Making the Build Swarm Bulletproof: Lessons from the Edge

We built a distributed compilation swarm. It worked. It was fast. It felt like the future.

Then, inevitably, the edge cases started crawling out of the woodwork.

After running the Gentoo Build Swarm for a few weeks, we discovered that “working” and “robust” are two very different things. This is the story of a ghost drone, a NAT masquerade that confused a server into suicide, and the hardening of the fleet.


1. The Mystery Drone

I opened the swarm monitor on Tuesday morning, coffee in hand. The counts looked good. 58 cores online.

Wait. 58? I only own 50 cores.

I scanned the drone list. drone-Izar (VM). drone-Tarn (VM). dr-mm2 (Docker). And then: drone-lxc (10.42.0.184).

I didn’t remember building a drone at 10.42.0.184.

It was happily compiling packages, uploading artifacts, and reporting success. It was a model citizen. But it was a ghost.

I traced the IP. It was a forgotten LXC container on my Tau-Ceti-Lab server. I had spun it up days ago to test a deployment script, verified it worked, and then… just left it. It had quietly auto-registered, pulled work, and joined the collective without asking.

The Fix: We didn’t kill it. We recruited it. We renamed it drone-tb-lxc, gave it proper SSH keys, and officially welcomed its 8 cores to the swarm.

The Lesson: If your auto-discovery works too well, you might find infrastructure you forgot you owned.


2. The NAT Traversal Nightmare (Why Orchestrators Need Glasses)

This was the bug that kept me up at night.

The Symptom: Every time the orchestrator tried to reboot drone-Tarn (a VM on the 192.168.10.x network), my gateway machine (10.42.0.1) would reboot instead.

Read that again. The orchestrator targets Drone A. Server B commits seppuku.

The Investigation: It felt like a poltergeist. I’d issue a command to one machine, and another would die.

The culprit was NAT Masquerading.

  1. drone-Tarn sits on a VLAN (192.168.10.25). It uses Tailscale to talk to the Gateway (10.42.0.1).
  2. It routes traffic via Altair-Link (which is the Gateway machine acting as a subnet router).
  3. Altair-Link was masquerading the traffic.
  4. The Gateway application saw the incoming request not from 192.168.10.25, but from 10.42.0.1 (itself!).

So when the Orchestrator asked “Who are you?”, the network stack lied. It said “I am 10.42.0.1.” The Orchestrator dutifully recorded drone-Tarn’s IP as 10.42.0.1.

Later, the Orchestrator decided drone-Tarn needed a reboot. It SSH’d to the recorded IP (10.42.0.1)—my gateway—and issued reboot. And down went the lights.

The Fix: We implemented Smart Connection Fallback and REPORT_IP. Drones can now explicitly tell the gateway: “Ignore the packet headers. My REAL IP is X.” We configured drone-Tarn to report its Tailscale IP (100.100.x.y), which is globally unique.

Now, the orchestrator talks directly to the drone over the mesh, bypassing the NAT confusion entirely.


3. Smart Connection Fallback

We realized that in a hybrid cloud/home-lab environment, “IP address” is a fluid concept.

  • Some nodes are local (10.42.0.x).
  • Some are on VLANs (192.168.x.x).
  • Some are remote (100.100.x.y Tailscale).

We rewrote the swarm’s networking layer (lib/swarm/ssh.py) to be smarter. It now attempts connections in order:

  1. Reported IP: The IP the drone claims to have.
  2. Local Override: A hardcoded override in our config (for stubborn DNS).
  3. ProxyJump: If all else fails, use a jump host.

For one of our most isolated drones (drone-Meridian), we use a double-hop: Orchestrator -> Unraid Host (Jump) -> VM (Target)

It works seamlessly. To the orchestrator, it’s just another node.


4. The build-swarm push Command

Managing 5 nodes with different architectures and network constraints became tedious. Rolling out a bug fix meant SSHing into 5 different boxes, pulling git, restarting services…

So we automated it.

$ build-swarm push

That’s it. This command:

  1. Detects all online nodes.
  2. Determines the best connection method (Smart Fallback).
  3. Rsyncs the patched bin/ and lib/ directories to each node.
  4. Restarts the services.

I can now deploy a hotfix to the entire cluster in 8 seconds.


Conclusion

Robustness isn’t about building a system that never fails. It’s about building a system that fails in predictable ways—and doesn’t take the rest of the network down with it.

We moved from “it works on my machine” to “it works on 5 machines across 2 networks and a VPN tunnel.”

The swarm is now processing builds faster and more reliably than ever. And drone-Tarn hasn’t rebooted the gateway since Tuesday.

Next up: We’re looking at adding a dashboard widget for “Total Watts Consumed” so I can explain the electric bill.