The Drone That Rebooted the Wrong Server
Date: 2026-01-27 Issue: Multiple drones offline, plus mysterious gateway reboots Result: Swarm restored to 58 cores across 5 active drones Discovery: One drone was accidentally rebooting the gateway
After Alpha-Centauri recovered from a reboot (unrelated incident, but foreshadowing), I noticed the build swarm was in rough shape. Multiple drones offline, packages not building, the usual distributed systems chaos.
What I didn’t expect was to discover that one of the drones had been accidentally rebooting the gateway server every time a build failed.
The Initial Damage Assessment
The swarm status after the gateway came back online:
| Drone | Status | Problem |
|---|---|---|
| drone-Tau-Ceti (10.42.0.194) | Offline | Service stopped |
| drone-Mirach (192.168.20.77) | Blocked | Hardcoded block in gateway code |
| drone-Titawin (192.168.20.196) | Online but dangerous | See below |
| drone-Icarus (10.42.0.203) | Online | Working fine |
The gateway itself wasn’t starting on boot — systemctl enable swarm-gateway had never been run. Easy fix, but annoying.
Fix #1: Gateway Service Persistence
ssh [email protected] 'systemctl start swarm-gateway'
ssh [email protected] 'systemctl enable swarm-gateway'
Now it survives reboots. Novel concept.
Fix #2: drone-Tau-Ceti (The Easy One)
drone-Tau-Ceti (10.42.0.194) was just stopped. SSH’d in, started it:
ssh [email protected] 'rc-service swarm-drone start'
Watched it register with the gateway. 8 cores back in the pool.
Fix #3: drone-Mirach (The Blocked One)
This drone lives on the remote network (192.168.20.x) and had been causing trouble previously. My past self had helpfully added a hardcoded block:
# Temporary Block: drone-Mirach (failed uploads, unreachable SSH)
if 'alpheratz' in node_data.get('name', '') or 'alpheratz' in node_data.get('id', ''):
log.warning(f"Registration rejected for BLOCKED node: {node_data.get('name')}")
return {'error': 'Node blocked pending maintenance'}
Past Me meant well, but forgot to remove it. Present Me fixed it:
ssh [email protected] "sed -i '93,96d' /opt/build-swarm/bin/swarm-gateway"
ssh [email protected] 'systemctl restart swarm-gateway'
Then the real work: configuring the drone to use Tailscale IPs for everything.
The Problem: drone-Mirach is on 192.168.20.x, but the gateway returns orchestrator IPs from 10.42.0.x. Those IPs are unreachable from the remote network.
The Solution: Environment variable overrides in the drone config:
GATEWAY_URL="http://100.64.0.88:8090"
ORCHESTRATOR_IP="100.64.0.18"
UPLOAD_HOST="100.64.0.18"
REPORT_IP="100.64.0.110"
All Tailscale IPs, all routable from anywhere in the mesh.
Also had to patch the drone code to actually use these variables:
# Changed from:
orch_config = {'ip': None, 'port': 8080}
# To:
orch_config = {'ip': os.environ.get('ORCHESTRATOR_IP'), 'port': int(os.environ.get('ORCHESTRATOR_PORT', 8080))}
20 cores back in the pool.
Fix #4: drone-Titawin (The Mystery Rebooter)
Now for the fun one.
Alpha-Centauri had been experiencing unexplained reboots — four on January 26th alone. Every single one was a clean shutdown, not a crash. No hardware errors, no OOM kills, nothing in the logs except… SSH connections from the orchestrator at the exact moment of each shutdown.
Wait. SSH connections to do what?
Here’s the build swarm’s auto-reboot feature: when a drone fails too many builds in a row, the orchestrator SSHs to it and runs reboot. Clean slate, start fresh.
Here’s the bug: drone-Titawin (192.168.20.196) was configured to reach the gateway at http://10.42.0.199:8090. That IP isn’t directly routable from the remote network. Traffic goes through Tailscale subnet routing, which means it gets NAT’d (masqueraded) through Alpha-Centauri.
The orchestrator sees the source IP of drone-Titawin’s traffic as… 10.42.0.199. The gateway’s IP. Because that’s where the NAT is happening.
When drone-Titawin failed builds and the orchestrator tried to reboot it:
# What the orchestrator thought it was doing:
ssh [email protected] 'reboot' # Reboot drone-Titawin
# What it actually did:
ssh [email protected] 'reboot' # Reboot the gateway
Same IP. Different intended targets.
The gateway was getting rebooted because a drone on another network was failing builds.
The drone-Titawin Fix
Same pattern as drone-Mirach — configure it to report its Tailscale IP and use Tailscale IPs for orchestrator communication:
GATEWAY_URL="http://10.42.0.199:8090" # Can stay, goes via Tailscale
ORCHESTRATOR_IP="100.64.0.18" # Use Tailscale IP directly
UPLOAD_HOST="100.64.0.18" # Use Tailscale IP directly
REPORT_IP="100.64.0.91" # Report actual Tailscale IP
Now when the orchestrator wants to reboot drone-Titawin, it SSHs to 100.64.0.91 — the drone’s actual Tailscale IP — not the gateway.
14 cores back in the pool, and the gateway stopped getting randomly rebooted.
The Final Swarm Status
Gateway: 10.42.0.199:8090 - Online
Orchestrator: 10.42.0.201:8080 - Online
Drones (58 total cores):
drone-Mirach | 100.64.0.110 | 20 cores | Online
drone-Icarus | 10.42.0.203 | 16 cores | Online
drone-Titawin | 100.64.0.91 | 14 cores | Online
drone-Tau-Ceti | 10.42.0.194 | 8 cores | Online
Cross-Network Drone Configuration (Reference)
For any drone not on the gateway’s local network:
- GATEWAY_URL - Can use local IP if Tailscale subnet routing works
- ORCHESTRATOR_IP - MUST be Tailscale IP (avoids NAT masquerade issue)
- UPLOAD_HOST - MUST be Tailscale IP (for binary uploads)
- REPORT_IP - MUST be drone’s Tailscale IP (so orchestrator can reach it)
The REPORT_IP is the critical one. That’s what the orchestrator uses for SSH commands. Get it wrong, and you reboot the wrong server.
Tailscale IPs Reference (Build Swarm)
| Host | Local IP | Tailscale IP | Role |
|---|---|---|---|
| Alpha-Centauri | 10.42.0.199 | 100.64.0.88 | Gateway |
| Icarus-Orchestrator | 10.42.0.201 | 100.64.0.18 | Orchestrator |
| drone-Icarus | 10.42.0.203 | 100.64.0.126 | Drone |
| drone-Titawin | 192.168.20.196 | 100.64.0.91 | Drone |
| drone-Tau-Ceti | 10.42.0.194 | 100.64.0.125 | Drone |
| drone-Mirach | 192.168.20.77 | 100.64.0.110 | Drone |
Lessons Learned
-
NAT masquerade hides real IPs — if your distributed system uses IP addresses for identity, NAT will betray you
-
REPORT_IP is critical — the IP a node reports to the coordinator should be reachable by the coordinator for management commands
-
Tailscale subnet routing is powerful but subtle — traffic gets masqueraded, which can confuse services that care about source IPs
-
Auto-reboot features need the right target — if your orchestrator can reboot nodes, make absolutely sure it’s rebooting the right ones
-
Past Me leaves landmines — that “temporary block” in the gateway code? Five days later and I forgot it existed
The swarm is back to 58 cores. The gateway stopped getting randomly rebooted. And I added comments to explain why REPORT_IP matters, so Future Me doesn’t have to rediscover this lesson in 6 months.