user@argobox:~/journal/2026-01-27-the-gateway-that-rebooted-itself
$ cat entry.md

The Gateway That Rebooted Itself

○ NOT REVIEWED

The Gateway That Rebooted Itself

Date: 2026-01-27 Issue: Gateway server rebooting 4+ times per day Root Cause: NAT masquerade + auto-heal = friendly fire Lesson: When automation targets the wrong IP, everyone loses


The Mystery

Alpha-Centauri (the gateway server at 10.42.0.199) was rebooting randomly. Four times on January 26th alone. No pattern I could find.

  • No kernel panics
  • No OOM kills
  • No hardware errors
  • NVMe temperature normal (46°C)
  • Memory fine
  • CPU fine

Every reboot was a clean shutdown. Like someone typed reboot.


The Evidence

I checked the auth logs:

Jan 26 14:23:47 Alpha-Centauri sshd: Accepted publickey for root from 100.64.0.18
Jan 26 14:23:49 Alpha-Centauri systemd: Stopping all remaining mounts...

Someone SSH’d in from 100.64.0.18 — which is Icarus-Orchestrator — and the system shut down two seconds later.

The orchestrator was rebooting my gateway. But why?


The Build Swarm Architecture

Quick context: The build swarm has three components:

  1. Gateway (Alpha-Centauri, 10.42.0.199): Registration hub for drones
  2. Orchestrator (Icarus-Orchestrator, 10.42.0.201): Assigns packages, monitors drones
  3. Drones: Worker nodes that compile packages

The orchestrator has an auto-heal feature: if a drone stops responding, it can SSH to the drone and reboot it. Useful for stuck builds.


The Culprit

drone-Titawin is a VM on the 192.168.20.x network — different from the main 10.42.0.x network. It can’t directly reach 10.42.0.x, so it uses Tailscale.

Here’s where it gets interesting:

  1. drone-Titawin (192.168.20.196) was configured to reach the gateway at 10.42.0.199
  2. That IP isn’t directly routable from 192.168.20.x
  3. Traffic goes through Tailscale, which routes it via Alpha-Centauri (subnet router)
  4. Alpha-Centauri NAT masquerades the traffic
  5. The gateway sees the source IP as 10.42.0.199 — its own IP
  6. When drone-Titawin goes offline, the orchestrator tries to reboot it
  7. The orchestrator SSHs to the IP on file: 10.42.0.199
  8. Alpha-Centauri reboots itself

The drone was accidentally reporting the gateway’s IP as its own. Every time the orchestrator tried to recover the drone, it shot the gateway instead.


The Fix

Same pattern that would later save my workstation: make the drone use Tailscale IPs directly.

# On drone-Titawin
cat >> /etc/build-swarm/drone.conf << 'EOF'
ORCHESTRATOR_IP="100.64.0.18"
UPLOAD_HOST="100.64.0.18"
REPORT_IP="100.64.0.91"
EOF

Now the drone reports its Tailscale IP (100.64.0.91) instead of the masqueraded gateway IP. When the orchestrator tries to reboot it, it hits the right target.


The Other Fixes (Same Session)

This wasn’t the only issue. While I was at it:

Gateway Service Not Enabled

ssh [email protected] 'systemctl enable swarm-gateway'

The gateway wasn’t set to start on boot. Every time it rebooted (four times that day), someone had to manually start it.

drone-Tau-Ceti Not Registering

ssh [email protected] 'rc-service swarm-drone start'

Service was stopped. Simple fix.

drone-Mirach Blocked

The gateway had a hardcoded block for “masaimara” nodes from a previous troubleshooting session:

# Temporary Block: dr-masaimara (failed uploads, unreachable SSH)
if 'masaimara' in node_data.get('name', ''):
    log.warning(f"Registration rejected for BLOCKED node")
    return {'error': 'Node blocked pending maintenance'}

I forgot to remove it after fixing the actual issue. Classic.


Final Swarm Status

After all fixes:

Gateway: 10.42.0.199:8090 - Online
Orchestrator: 10.42.0.201:8080 - Online

Drones (58 total cores):
  drone-Mirach     | 100.64.0.110 | 20 cores | Online
  drone-Icarus     | 10.42.0.203  | 16 cores | Online
  drone-Titawin    | 100.64.0.91  | 14 cores | Online
  drone-Tau-Ceti   | 10.42.0.194  |  8 cores | Online

Lessons Learned

  1. NAT masquerade hides the real source IP. When Tailscale routes traffic through a subnet router with masquerade enabled, the destination sees the router’s IP, not the original source.

  2. Auto-heal features need safeguards. If the orchestrator had a “protected hosts” list, this wouldn’t have happened. (I added one later.)

  3. Cross-network drones need special configuration. Any drone not on the main network must use Tailscale IPs for ORCHESTRATOR_IP, UPLOAD_HOST, and REPORT_IP.

  4. Remove temporary workarounds. That hardcoded block was fine during debugging. Forgetting to remove it caused a support ticket from myself to myself.

  5. Clean shutdowns have a cause. If there’s no crash, something initiated the shutdown. Check auth logs.


Network Reference

For future me debugging this again:

HostLocal IPTailscale IPRole
Alpha-Centauri10.42.0.199100.64.0.88Gateway
Icarus-Orchestrator10.42.0.201100.64.0.18Orchestrator
drone-Icarus10.42.0.203100.64.0.126Drone
drone-Titawin192.168.20.196100.64.0.91Drone
drone-Tau-Ceti10.42.0.194100.64.0.125Drone
drone-Mirach192.168.20.77100.64.0.110Drone

Rule: If a drone isn’t on 10.42.0.x, it must use Tailscale IPs for everything.


Prevention

  1. Protected hosts list: IPs that should never receive reboot commands
  2. Enable services on boot: Don’t rely on manual starts after reboot
  3. Document Tailscale mappings: Keep a reference of which IP is which
  4. Remove debug blocks: Temporary workarounds shouldn’t be permanent

The gateway rebooted four times before I found the cause. The orchestrator was just trying to help.