The Drone That Rebooted the Wrong Server

Date: 2026-01-27 Issue: Multiple drones offline, plus mysterious gateway reboots Result: Swarm restored to 58 cores across 5 active drones Discovery: One drone was accidentally rebooting the gateway

After Alpha-Centauri recovered from a reboot (unrelated incident, but foreshadowing), I noticed the build swarm was in rough shape. Multiple drones offline, packages not building, the usual distributed systems chaos.

What I didn’t expect was to discover that one of the drones had been accidentally rebooting the gateway server every time a build failed.

The Initial Damage Assessment

The swarm status after the gateway came back online:

Drone	Status	Problem
drone-Tau-Ceti (10.42.0.194)	Offline	Service stopped
drone-Mirach (192.168.20.77)	Blocked	Hardcoded block in gateway code
drone-Titawin (192.168.20.196)	Online but dangerous	See below
drone-Icarus (10.42.0.203)	Online	Working fine

The gateway itself wasn’t starting on boot — systemctl enable swarm-gateway had never been run. Easy fix, but annoying.

Fix #1: Gateway Service Persistence

ssh [email protected] 'systemctl start swarm-gateway'
ssh [email protected] 'systemctl enable swarm-gateway'

Now it survives reboots. Novel concept.

Fix #2: drone-Tau-Ceti (The Easy One)

drone-Tau-Ceti (10.42.0.194) was just stopped. SSH’d in, started it:

ssh [email protected] 'rc-service swarm-drone start'

Watched it register with the gateway. 8 cores back in the pool.

Fix #3: drone-Mirach (The Blocked One)

This drone lives on the remote network (192.168.20.x) and had been causing trouble previously. My past self had helpfully added a hardcoded block:

# Temporary Block: drone-Mirach (failed uploads, unreachable SSH)
if 'alpheratz' in node_data.get('name', '') or 'alpheratz' in node_data.get('id', ''):
    log.warning(f"Registration rejected for BLOCKED node: {node_data.get('name')}")
    return {'error': 'Node blocked pending maintenance'}

Past Me meant well, but forgot to remove it. Present Me fixed it:

ssh [email protected] "sed -i '93,96d' /opt/build-swarm/bin/swarm-gateway"
ssh [email protected] 'systemctl restart swarm-gateway'

Then the real work: configuring the drone to use Tailscale IPs for everything.

The Problem: drone-Mirach is on 192.168.20.x, but the gateway returns orchestrator IPs from 10.42.0.x. Those IPs are unreachable from the remote network.

The Solution: Environment variable overrides in the drone config:

GATEWAY_URL="http://100.64.0.88:8090"
ORCHESTRATOR_IP="100.64.0.18"
UPLOAD_HOST="100.64.0.18"
REPORT_IP="100.64.0.110"

All Tailscale IPs, all routable from anywhere in the mesh.

Also had to patch the drone code to actually use these variables:

# Changed from:
orch_config = {'ip': None, 'port': 8080}

# To:
orch_config = {'ip': os.environ.get('ORCHESTRATOR_IP'), 'port': int(os.environ.get('ORCHESTRATOR_PORT', 8080))}

20 cores back in the pool.

Fix #4: drone-Titawin (The Mystery Rebooter)

Now for the fun one.

Alpha-Centauri had been experiencing unexplained reboots — four on January 26th alone. Every single one was a clean shutdown, not a crash. No hardware errors, no OOM kills, nothing in the logs except… SSH connections from the orchestrator at the exact moment of each shutdown.

Wait. SSH connections to do what?

Here’s the build swarm’s auto-reboot feature: when a drone fails too many builds in a row, the orchestrator SSHs to it and runs reboot. Clean slate, start fresh.

Here’s the bug: drone-Titawin (192.168.20.196) was configured to reach the gateway at http://10.42.0.199:8090. That IP isn’t directly routable from the remote network. Traffic goes through Tailscale subnet routing, which means it gets NAT’d (masqueraded) through Alpha-Centauri.

The orchestrator sees the source IP of drone-Titawin’s traffic as… 10.42.0.199. The gateway’s IP. Because that’s where the NAT is happening.

When drone-Titawin failed builds and the orchestrator tried to reboot it:

# What the orchestrator thought it was doing:
ssh [email protected] 'reboot'  # Reboot drone-Titawin

# What it actually did:
ssh [email protected] 'reboot'  # Reboot the gateway

Same IP. Different intended targets.

The gateway was getting rebooted because a drone on another network was failing builds.

The drone-Titawin Fix

Same pattern as drone-Mirach — configure it to report its Tailscale IP and use Tailscale IPs for orchestrator communication:

GATEWAY_URL="http://10.42.0.199:8090"   # Can stay, goes via Tailscale
ORCHESTRATOR_IP="100.64.0.18"           # Use Tailscale IP directly
UPLOAD_HOST="100.64.0.18"               # Use Tailscale IP directly
REPORT_IP="100.64.0.91"                 # Report actual Tailscale IP

Now when the orchestrator wants to reboot drone-Titawin, it SSHs to 100.64.0.91 — the drone’s actual Tailscale IP — not the gateway.

14 cores back in the pool, and the gateway stopped getting randomly rebooted.

The Final Swarm Status

Gateway: 10.42.0.199:8090 - Online
Orchestrator: 10.42.0.201:8080 - Online

Drones (58 total cores):
  drone-Mirach  | 100.64.0.110 | 20 cores | Online
  drone-Icarus    | 10.42.0.203  | 16 cores | Online
  drone-Titawin    | 100.64.0.91  | 14 cores | Online
  drone-Tau-Ceti     | 10.42.0.194  |  8 cores | Online

Cross-Network Drone Configuration (Reference)

For any drone not on the gateway’s local network:

GATEWAY_URL - Can use local IP if Tailscale subnet routing works
ORCHESTRATOR_IP - MUST be Tailscale IP (avoids NAT masquerade issue)
UPLOAD_HOST - MUST be Tailscale IP (for binary uploads)
REPORT_IP - MUST be drone’s Tailscale IP (so orchestrator can reach it)

The REPORT_IP is the critical one. That’s what the orchestrator uses for SSH commands. Get it wrong, and you reboot the wrong server.

Tailscale IPs Reference (Build Swarm)

Host	Local IP	Tailscale IP	Role
Alpha-Centauri	10.42.0.199	100.64.0.88	Gateway
Icarus-Orchestrator	10.42.0.201	100.64.0.18	Orchestrator
drone-Icarus	10.42.0.203	100.64.0.126	Drone
drone-Titawin	192.168.20.196	100.64.0.91	Drone
drone-Tau-Ceti	10.42.0.194	100.64.0.125	Drone
drone-Mirach	192.168.20.77	100.64.0.110	Drone

Lessons Learned

NAT masquerade hides real IPs — if your distributed system uses IP addresses for identity, NAT will betray you
REPORT_IP is critical — the IP a node reports to the coordinator should be reachable by the coordinator for management commands
Tailscale subnet routing is powerful but subtle — traffic gets masqueraded, which can confuse services that care about source IPs
Auto-reboot features need the right target — if your orchestrator can reboot nodes, make absolutely sure it’s rebooting the right ones
Past Me leaves landmines — that “temporary block” in the gateway code? Five days later and I forgot it existed

The swarm is back to 58 cores. The gateway stopped getting randomly rebooted. And I added comments to explain why REPORT_IP matters, so Future Me doesn’t have to rediscover this lesson in 6 months.