The Gateway That Rebooted Itself

Date: 2026-01-27 Issue: Gateway server rebooting 4+ times per day Root Cause: NAT masquerade + auto-heal = friendly fire Lesson: When automation targets the wrong IP, everyone loses

The Mystery

Alpha-Centauri (the gateway server at 10.42.0.199) was rebooting randomly. Four times on January 26th alone. No pattern I could find.

No kernel panics
No OOM kills
No hardware errors
NVMe temperature normal (46°C)
Memory fine
CPU fine

Every reboot was a clean shutdown. Like someone typed reboot.

The Evidence

I checked the auth logs:

Jan 26 14:23:47 Alpha-Centauri sshd: Accepted publickey for root from 100.64.0.18
Jan 26 14:23:49 Alpha-Centauri systemd: Stopping all remaining mounts...

Someone SSH’d in from 100.64.0.18 — which is Icarus-Orchestrator — and the system shut down two seconds later.

The orchestrator was rebooting my gateway. But why?

The Build Swarm Architecture

Quick context: The build swarm has three components:

Gateway (Alpha-Centauri, 10.42.0.199): Registration hub for drones
Orchestrator (Icarus-Orchestrator, 10.42.0.201): Assigns packages, monitors drones
Drones: Worker nodes that compile packages

The orchestrator has an auto-heal feature: if a drone stops responding, it can SSH to the drone and reboot it. Useful for stuck builds.

The Culprit

drone-Titawin is a VM on the 192.168.20.x network — different from the main 10.42.0.x network. It can’t directly reach 10.42.0.x, so it uses Tailscale.

Here’s where it gets interesting:

drone-Titawin (192.168.20.196) was configured to reach the gateway at 10.42.0.199
That IP isn’t directly routable from 192.168.20.x
Traffic goes through Tailscale, which routes it via Alpha-Centauri (subnet router)
Alpha-Centauri NAT masquerades the traffic
The gateway sees the source IP as 10.42.0.199 — its own IP
When drone-Titawin goes offline, the orchestrator tries to reboot it
The orchestrator SSHs to the IP on file: 10.42.0.199
Alpha-Centauri reboots itself

The drone was accidentally reporting the gateway’s IP as its own. Every time the orchestrator tried to recover the drone, it shot the gateway instead.

The Fix

Same pattern that would later save my workstation: make the drone use Tailscale IPs directly.

# On drone-Titawin
cat >> /etc/build-swarm/drone.conf << 'EOF'
ORCHESTRATOR_IP="100.64.0.18"
UPLOAD_HOST="100.64.0.18"
REPORT_IP="100.64.0.91"
EOF

Now the drone reports its Tailscale IP (100.64.0.91) instead of the masqueraded gateway IP. When the orchestrator tries to reboot it, it hits the right target.

The Other Fixes (Same Session)

This wasn’t the only issue. While I was at it:

Gateway Service Not Enabled

ssh [email protected] 'systemctl enable swarm-gateway'

The gateway wasn’t set to start on boot. Every time it rebooted (four times that day), someone had to manually start it.

drone-Tau-Ceti Not Registering

ssh [email protected] 'rc-service swarm-drone start'

Service was stopped. Simple fix.

drone-Mirach Blocked

The gateway had a hardcoded block for “masaimara” nodes from a previous troubleshooting session:

# Temporary Block: dr-masaimara (failed uploads, unreachable SSH)
if 'masaimara' in node_data.get('name', ''):
    log.warning(f"Registration rejected for BLOCKED node")
    return {'error': 'Node blocked pending maintenance'}

I forgot to remove it after fixing the actual issue. Classic.

Final Swarm Status

After all fixes:

Gateway: 10.42.0.199:8090 - Online
Orchestrator: 10.42.0.201:8080 - Online

Drones (58 total cores):
  drone-Mirach     | 100.64.0.110 | 20 cores | Online
  drone-Icarus     | 10.42.0.203  | 16 cores | Online
  drone-Titawin    | 100.64.0.91  | 14 cores | Online
  drone-Tau-Ceti   | 10.42.0.194  |  8 cores | Online

Lessons Learned

NAT masquerade hides the real source IP. When Tailscale routes traffic through a subnet router with masquerade enabled, the destination sees the router’s IP, not the original source.
Auto-heal features need safeguards. If the orchestrator had a “protected hosts” list, this wouldn’t have happened. (I added one later.)
Cross-network drones need special configuration. Any drone not on the main network must use Tailscale IPs for ORCHESTRATOR_IP, UPLOAD_HOST, and REPORT_IP.
Remove temporary workarounds. That hardcoded block was fine during debugging. Forgetting to remove it caused a support ticket from myself to myself.
Clean shutdowns have a cause. If there’s no crash, something initiated the shutdown. Check auth logs.

Network Reference

For future me debugging this again:

Host	Local IP	Tailscale IP	Role
Alpha-Centauri	10.42.0.199	100.64.0.88	Gateway
Icarus-Orchestrator	10.42.0.201	100.64.0.18	Orchestrator
drone-Icarus	10.42.0.203	100.64.0.126	Drone
drone-Titawin	192.168.20.196	100.64.0.91	Drone
drone-Tau-Ceti	10.42.0.194	100.64.0.125	Drone
drone-Mirach	192.168.20.77	100.64.0.110	Drone

Rule: If a drone isn’t on 10.42.0.x, it must use Tailscale IPs for everything.

Prevention

Protected hosts list: IPs that should never receive reboot commands
Enable services on boot: Don’t rely on manual starts after reboot
Document Tailscale mappings: Keep a reference of which IP is which
Remove debug blocks: Temporary workarounds shouldn’t be permanent

The gateway rebooted four times before I found the cause. The orchestrator was just trying to help.