Build Swarm Part 2: Hardening Against the Real World
We built a distributed compilation swarm. It worked. It was fast. It felt like the future.
Then, inevitably, the edge cases started crawling out of the woodwork.
After running the Build Swarm for a few weeks, we discovered that “working” and “robust” are two very different things. This is the story of a ghost drone, a NAT masquerade that confused a server into suicide, and the hardening of the fleet.
The Mystery Drone
I opened the swarm monitor on Tuesday morning, coffee in hand. The counts looked good. 58 cores online.
Wait. 58? I only own 50 cores.
I scanned the drone list.
drone-Izar (VM). drone-Tarn (VM). drone-Meridian (Docker).
And then: drone-lxc (10.42.0.184).
I didn’t remember building a drone at 10.42.0.184.
It was happily compiling packages, uploading artifacts, and reporting success. It was a model citizen. But it was a ghost.
I traced the IP. It was a forgotten LXC container on my Tau-Beta server. I had spun it up days ago to test a deployment script, verified it worked, and then… just left it. It had quietly auto-registered, pulled work, and joined the collective without asking.
The Fix: We didn’t kill it. We recruited it. Renamed it drone-tb-lxc, gave it proper SSH keys, and officially welcomed its 8 cores to the swarm.
The Lesson: If your auto-discovery works too well, you might find infrastructure you forgot you owned.
The NAT Traversal Nightmare
This was the bug that kept me up at night.
The Symptom: Every time the orchestrator tried to reboot drone-Tarn (a VM on the 192.168.20.x network), my gateway machine (10.42.0.1) would reboot instead.
Read that again. The orchestrator targets Drone A. Server B commits seppuku.
The Investigation:
It felt like a poltergeist. I’d issue a command to one machine, and another would die.
The culprit was NAT Masquerading.
drone-Tarnsits on a VLAN (192.168.20.25). It uses Tailscale to talk to the Gateway (10.42.0.1).- It routes traffic via
Altair-Link(which is the Gateway machine acting as a subnet router). Altair-Linkwas masquerading the traffic.- The Gateway application saw the incoming request not from
192.168.20.25, but from10.42.0.1(itself!).
So when the Orchestrator asked “Who are you?”, the network stack lied. It said “I am 10.42.0.1.”
The Orchestrator dutifully recorded drone-Tarn’s IP as 10.42.0.1. Later, it decided drone-Tarn needed a reboot. It SSH’d to the recorded IP — my gateway — and issued reboot.
And down went the lights.
The Fix: We implemented Smart Connection Fallback and REPORT_IP. Drones can now explicitly tell the gateway: “Ignore the packet headers. My REAL IP is X.” We configured drone-Tarn to report its Tailscale IP (100.100.x.y), which is globally unique.
Now, the orchestrator talks directly to the drone over the mesh, bypassing the NAT confusion entirely.
Smart Connection Fallback
We realized that in a hybrid cloud/home-lab environment, “IP address” is a fluid concept.
- Some nodes are local (
10.42.0.x). - Some are on VLANs (
192.168.x.x). - Some are remote (
100.100.x.yTailscale).
We rewrote the swarm’s networking layer (lib/swarm/ssh.py) to be smarter. It now attempts connections in order:
- Reported IP: The IP the drone claims to have.
- Local Override: A hardcoded override in our config (for stubborn DNS).
- ProxyJump: If all else fails, use a jump host.
For one of our most isolated drones (drone-Meridian), we use a double-hop: Orchestrator → Unraid Host (Jump) → VM (Target). To the orchestrator, it’s just another node.
Binary Validation
Found that network hiccups during upload could result in truncated binaries. Drones now:
- Build the package
- Calculate SHA256
- Upload binary + checksum
- Binhost verifies before accepting
Corrupted packages never make it to the production binhost.
The build-swarm push Command
Managing 5 nodes with different architectures and network constraints became tedious. Rolling out a bug fix meant SSHing into 5 different boxes, pulling git, restarting services…
So we automated it.
$ build-swarm push
That’s it. This command detects all online nodes, determines the best connection method (Smart Fallback), rsyncs the patched bin/ and lib/ directories, and restarts the services.
I can now deploy a hotfix to the entire cluster in 8 seconds.
Before and After
Before hardening:
- Ghost containers join without anyone noticing
- Orchestrator reboots the wrong machine through NAT confusion
- Corrupted binaries make it to the binhost
- Deploying a fix takes 20 minutes of SSH-ing around
After hardening:
- Auto-discovery works, but with proper identification
- REPORT_IP + Smart Fallback prevent NAT confusion
- SHA256 validation catches corruption
build-swarm pushdeploys everywhere in 8 seconds
Robustness isn’t about building a system that never fails. It’s about building a system that fails in predictable ways — and doesn’t take the rest of the network down with it.
Next up: Part 3 — Teaching Drones to Heal Themselves — Health state machines, graceful work return, and the time the self-healing system grounded every drone simultaneously.