The Gateway That Rebooted Itself
Date: 2026-01-27 Issue: Gateway server rebooting 4+ times per day Root Cause: NAT masquerade + auto-heal = friendly fire Lesson: When automation targets the wrong IP, everyone loses
The Mystery
Alpha-Centauri (the gateway server at 10.42.0.199) was rebooting randomly. Four times on January 26th alone. No pattern I could find.
- No kernel panics
- No OOM kills
- No hardware errors
- NVMe temperature normal (46°C)
- Memory fine
- CPU fine
Every reboot was a clean shutdown. Like someone typed reboot.
The Evidence
I checked the auth logs:
Jan 26 14:23:47 Alpha-Centauri sshd: Accepted publickey for root from 100.64.0.18
Jan 26 14:23:49 Alpha-Centauri systemd: Stopping all remaining mounts...
Someone SSH’d in from 100.64.0.18 — which is Icarus-Orchestrator — and the system shut down two seconds later.
The orchestrator was rebooting my gateway. But why?
The Build Swarm Architecture
Quick context: The build swarm has three components:
- Gateway (
Alpha-Centauri, 10.42.0.199): Registration hub for drones - Orchestrator (
Icarus-Orchestrator, 10.42.0.201): Assigns packages, monitors drones - Drones: Worker nodes that compile packages
The orchestrator has an auto-heal feature: if a drone stops responding, it can SSH to the drone and reboot it. Useful for stuck builds.
The Culprit
drone-Titawin is a VM on the 192.168.20.x network — different from the main 10.42.0.x network. It can’t directly reach 10.42.0.x, so it uses Tailscale.
Here’s where it gets interesting:
drone-Titawin(192.168.20.196) was configured to reach the gateway at10.42.0.199- That IP isn’t directly routable from
192.168.20.x - Traffic goes through Tailscale, which routes it via Alpha-Centauri (subnet router)
- Alpha-Centauri NAT masquerades the traffic
- The gateway sees the source IP as
10.42.0.199— its own IP - When drone-Titawin goes offline, the orchestrator tries to reboot it
- The orchestrator SSHs to the IP on file:
10.42.0.199 - Alpha-Centauri reboots itself
The drone was accidentally reporting the gateway’s IP as its own. Every time the orchestrator tried to recover the drone, it shot the gateway instead.
The Fix
Same pattern that would later save my workstation: make the drone use Tailscale IPs directly.
# On drone-Titawin
cat >> /etc/build-swarm/drone.conf << 'EOF'
ORCHESTRATOR_IP="100.64.0.18"
UPLOAD_HOST="100.64.0.18"
REPORT_IP="100.64.0.91"
EOF
Now the drone reports its Tailscale IP (100.64.0.91) instead of the masqueraded gateway IP. When the orchestrator tries to reboot it, it hits the right target.
The Other Fixes (Same Session)
This wasn’t the only issue. While I was at it:
Gateway Service Not Enabled
ssh [email protected] 'systemctl enable swarm-gateway'
The gateway wasn’t set to start on boot. Every time it rebooted (four times that day), someone had to manually start it.
drone-Tau-Ceti Not Registering
ssh [email protected] 'rc-service swarm-drone start'
Service was stopped. Simple fix.
drone-Mirach Blocked
The gateway had a hardcoded block for “masaimara” nodes from a previous troubleshooting session:
# Temporary Block: dr-masaimara (failed uploads, unreachable SSH)
if 'masaimara' in node_data.get('name', ''):
log.warning(f"Registration rejected for BLOCKED node")
return {'error': 'Node blocked pending maintenance'}
I forgot to remove it after fixing the actual issue. Classic.
Final Swarm Status
After all fixes:
Gateway: 10.42.0.199:8090 - Online
Orchestrator: 10.42.0.201:8080 - Online
Drones (58 total cores):
drone-Mirach | 100.64.0.110 | 20 cores | Online
drone-Icarus | 10.42.0.203 | 16 cores | Online
drone-Titawin | 100.64.0.91 | 14 cores | Online
drone-Tau-Ceti | 10.42.0.194 | 8 cores | Online
Lessons Learned
-
NAT masquerade hides the real source IP. When Tailscale routes traffic through a subnet router with masquerade enabled, the destination sees the router’s IP, not the original source.
-
Auto-heal features need safeguards. If the orchestrator had a “protected hosts” list, this wouldn’t have happened. (I added one later.)
-
Cross-network drones need special configuration. Any drone not on the main network must use Tailscale IPs for
ORCHESTRATOR_IP,UPLOAD_HOST, andREPORT_IP. -
Remove temporary workarounds. That hardcoded block was fine during debugging. Forgetting to remove it caused a support ticket from myself to myself.
-
Clean shutdowns have a cause. If there’s no crash, something initiated the shutdown. Check auth logs.
Network Reference
For future me debugging this again:
| Host | Local IP | Tailscale IP | Role |
|---|---|---|---|
| Alpha-Centauri | 10.42.0.199 | 100.64.0.88 | Gateway |
| Icarus-Orchestrator | 10.42.0.201 | 100.64.0.18 | Orchestrator |
| drone-Icarus | 10.42.0.203 | 100.64.0.126 | Drone |
| drone-Titawin | 192.168.20.196 | 100.64.0.91 | Drone |
| drone-Tau-Ceti | 10.42.0.194 | 100.64.0.125 | Drone |
| drone-Mirach | 192.168.20.77 | 100.64.0.110 | Drone |
Rule: If a drone isn’t on 10.42.0.x, it must use Tailscale IPs for everything.
Prevention
- Protected hosts list: IPs that should never receive reboot commands
- Enable services on boot: Don’t rely on manual starts after reboot
- Document Tailscale mappings: Keep a reference of which IP is which
- Remove debug blocks: Temporary workarounds shouldn’t be permanent
The gateway rebooted four times before I found the cause. The orchestrator was just trying to help.