Build Swarm Part 2: Hardening Against the Real World

We built a distributed compilation swarm. It worked. It was fast. It felt like the future.

Then, inevitably, the edge cases started crawling out of the woodwork.

After running the Build Swarm for a few weeks, we discovered that “working” and “robust” are two very different things. This is the story of a ghost drone, a NAT masquerade that confused a server into suicide, and the hardening of the fleet.

The Mystery Drone

I opened the swarm monitor on Tuesday morning, coffee in hand. The counts looked good. 58 cores online.

Wait. 58? I only own 50 cores.

I scanned the drone list. drone-Izar (VM). drone-Tarn (VM). drone-Meridian (Docker). And then: drone-lxc (10.42.0.184).

I didn’t remember building a drone at 10.42.0.184.

It was happily compiling packages, uploading artifacts, and reporting success. It was a model citizen. But it was a ghost.

I traced the IP. It was a forgotten LXC container on my Tau-Beta server. I had spun it up days ago to test a deployment script, verified it worked, and then… just left it. It had quietly auto-registered, pulled work, and joined the collective without asking.

The Fix: We didn’t kill it. We recruited it. Renamed it drone-tb-lxc, gave it proper SSH keys, and officially welcomed its 8 cores to the swarm.

The Lesson: If your auto-discovery works too well, you might find infrastructure you forgot you owned.

The NAT Traversal Nightmare

This was the bug that kept me up at night.

The Symptom: Every time the orchestrator tried to reboot drone-Tarn (a VM on the 192.168.20.x network), my gateway machine (10.42.0.1) would reboot instead.

Read that again. The orchestrator targets Drone A. Server B commits seppuku.

The Investigation:

It felt like a poltergeist. I’d issue a command to one machine, and another would die.

The culprit was NAT Masquerading.

drone-Tarn sits on a VLAN (192.168.20.25). It uses Tailscale to talk to the Gateway (10.42.0.1).
It routes traffic via Altair-Link (which is the Gateway machine acting as a subnet router).
Altair-Link was masquerading the traffic.
The Gateway application saw the incoming request not from 192.168.20.25, but from 10.42.0.1 (itself!).

So when the Orchestrator asked “Who are you?”, the network stack lied. It said “I am 10.42.0.1.”

The Orchestrator dutifully recorded drone-Tarn’s IP as 10.42.0.1. Later, it decided drone-Tarn needed a reboot. It SSH’d to the recorded IP — my gateway — and issued reboot.

And down went the lights.

The Fix: We implemented Smart Connection Fallback and REPORT_IP. Drones can now explicitly tell the gateway: “Ignore the packet headers. My REAL IP is X.” We configured drone-Tarn to report its Tailscale IP (100.100.x.y), which is globally unique.

Now, the orchestrator talks directly to the drone over the mesh, bypassing the NAT confusion entirely.

Smart Connection Fallback

We realized that in a hybrid cloud/home-lab environment, “IP address” is a fluid concept.

Some nodes are local (10.42.0.x).
Some are on VLANs (192.168.x.x).
Some are remote (100.100.x.y Tailscale).

We rewrote the swarm’s networking layer (lib/swarm/ssh.py) to be smarter. It now attempts connections in order:

Reported IP: The IP the drone claims to have.
Local Override: A hardcoded override in our config (for stubborn DNS).
ProxyJump: If all else fails, use a jump host.

For one of our most isolated drones (drone-Meridian), we use a double-hop: Orchestrator → Unraid Host (Jump) → VM (Target). To the orchestrator, it’s just another node.

Binary Validation

Found that network hiccups during upload could result in truncated binaries. Drones now:

Build the package
Calculate SHA256
Upload binary + checksum
Binhost verifies before accepting

Corrupted packages never make it to the production binhost.

The `build-swarm push` Command

Managing 5 nodes with different architectures and network constraints became tedious. Rolling out a bug fix meant SSHing into 5 different boxes, pulling git, restarting services…

So we automated it.

$ build-swarm push

That’s it. This command detects all online nodes, determines the best connection method (Smart Fallback), rsyncs the patched bin/ and lib/ directories, and restarts the services.

I can now deploy a hotfix to the entire cluster in 8 seconds.

Before and After

Before hardening:

Ghost containers join without anyone noticing
Orchestrator reboots the wrong machine through NAT confusion
Corrupted binaries make it to the binhost
Deploying a fix takes 20 minutes of SSH-ing around

After hardening:

Auto-discovery works, but with proper identification
REPORT_IP + Smart Fallback prevent NAT confusion
SHA256 validation catches corruption
build-swarm push deploys everywhere in 8 seconds

Robustness isn’t about building a system that never fails. It’s about building a system that fails in predictable ways — and doesn’t take the rest of the network down with it.

Next up: Part 3 — Teaching Drones to Heal Themselves — Health state machines, graceful work return, and the time the self-healing system grounded every drone simultaneously.

Build Swarm Part 2: Hardening Against the Real World — Ghost Drones and NAT Nightmares

Build Swarm Part 2: Hardening Against the Real World

The Mystery Drone

The NAT Traversal Nightmare

Smart Connection Fallback

Binary Validation

The `build-swarm push` Command

Before and After

System Status

🌐 Gateway

🚀 Orchestrators

🤖 Build Drones

🔨 Active Builds

Build Swarm Part 2: Hardening Against the Real World

The Mystery Drone

The NAT Traversal Nightmare

Smart Connection Fallback

Binary Validation

The build-swarm push Command

Before and After

Enjoyed this post?

The `build-swarm push` Command