Teaching Drones to Heal Themselves: Self-Recovery in the Build Swarm

Teaching Drones to Heal Themselves: Self-Recovery in the Build Swarm

A drone disappears at 2 AM. No error. No goodbye. Just… gone.

The orchestrator keeps waiting. The package queue stalls. I wake up to a build that’s been stuck for six hours because drone-Tarn ran out of memory compiling dev-qt/qtwebengine and silently choked to death.

This kept happening. Different drones, different failures. Kernel panics. Network blips. Disk pressure. The result was always the same: a drone vanishes, the orchestrator has no idea, and packages sit in limbo until I manually intervene.

I got tired of being the swarm’s immune system.


The Silent Failure Problem

The Build Swarm was designed around a simple contract: the orchestrator assigns work, drones compile it, drones report back. Clean and elegant.

Except the contract assumed drones would always report back. In the real world, drones fail in the middle of things. They don’t send a polite “I’m dying” message. They just stop responding. And the orchestrator, bless its deterministic heart, would sit there holding a package assignment open indefinitely, waiting for a corpse to phone home.

The symptoms were brutal:

Build Progress:
  Needed:    0
  Building:  3
  Complete:  146
  Failed:    0

Zero needed, zero failed, three stuck “building” forever. Those three packages were assigned to a drone that had been dead for four hours. The swarm looked 98% done but would never finish without me SSH’ing in and kicking things.

Distributed systems don’t fail gracefully by default. You have to teach them how.


Health States: Knowing When You’re Sick

The first thing I needed was self-awareness. Drones needed to know they were in trouble before they crashed.

I added a health state machine to every drone. Four states, continuously evaluated:

class DroneHealth(Enum):
    HEALTHY    = "healthy"      # All systems nominal
    DEGRADED   = "degraded"     # Warning signs present
    CRITICAL   = "critical"     # Unable to compile reliably
    RECOVERING = "recovering"   # Coming back from failure

Every 15 seconds, the drone checks itself. Memory usage. Disk space. Load average. Recent build success rate. It’s a self-diagnostic loop that runs alongside the build process.

HEALTHY is straightforward — memory under 85%, disk under 90%, builds completing normally. The drone accepts work, compiles packages, life is good.

DEGRADED is where it gets interesting. Memory climbing past 85%. Disk over 90%. The last three builds took twice as long as expected. Something is wrong but not catastrophic. A degraded drone starts shedding work. It stops accepting new packages but finishes whatever is currently on the workbench.

CRITICAL means the drone can’t reliably compile anymore. Repeated build failures. Out of memory kills. The emerge process crashing. At this point, the drone enters a defensive posture — it stops everything and starts returning work to the orchestrator.

RECOVERING is the state after a drone comes back from critical. It runs a self-check sequence: verify portage tree is intact, confirm disk space, test that emerge --info returns clean output. Only after passing the checks does it transition back to HEALTHY and start accepting work again.

HEALTHY ──[warning thresholds]──> DEGRADED
DEGRADED ──[continued degradation]──> CRITICAL
CRITICAL ──[conditions improve]──> RECOVERING
RECOVERING ──[self-check passes]──> HEALTHY
DEGRADED ──[conditions improve]──> HEALTHY

The goal isn’t perfection. It’s awareness. A drone that knows it’s sick can do something about it. A drone that doesn’t just dies silently and takes its assigned packages into the void.


Graceful Work Return: The Key Innovation

This was the part that changed everything.

Previously, when a drone failed, its assigned packages were just… lost. The orchestrator had to time out (which took ages) or I had to manually reclaim them. Now, a drone in trouble actively hands its work back.

The sequence:

A drone detects it’s entering CRITICAL state. Instead of crashing, it:

Pauses accepting new packages from the queue. Finishes whatever is currently mid-compile, if the system is stable enough to complete it. Then it gathers every uncompleted package in its local queue and sends them back to the orchestrator with a returned status.

def return_work(self, reason: str):
    """Return all uncompleted packages to orchestrator."""
    for pkg in self.local_queue:
        if pkg.status != 'complete':
            self.orchestrator.return_package(
                package=pkg.atom,
                drone_id=self.drone_id,
                status='returned',
                reason=reason
            )
    self.local_queue.clear()

On the orchestrator side, returned packages go back into the main queue. But there’s a twist — the orchestrator records which drone returned them in a returned_by field. When reassigning, it avoids sending the same package back to the drone that just gave up on it.

This prevents a nasty loop where a package that causes OOM on drone-Tarn (14 cores, 16GB RAM) keeps getting reassigned to drone-Tarn and failing over and over. Instead, it gets routed to drone-Meridian (24 cores, 64GB RAM) which can handle the memory-hungry compile without breaking a sweat.

The result: packages find the drone that can actually build them. Work migrates naturally toward capable hardware. No manual intervention. No stale queues.


The Grounding Incident

I deployed v2.5.0 with the self-healing system to all five drones and both orchestrators on a Tuesday night. Felt good about it. The code was clean, the logic was sound, the tests passed.

Wednesday morning I checked the monitor.

drone-Izar:     GROUNDED
drone-Tarn:     GROUNDED
drone-Meridian: GROUNDED
drone-Tau-Beta: GROUNDED
sweeper-Capella: GROUNDED

Every single drone had been grounded. The orchestrator had locked them all out.

[GROUNDED] drone-Tarn - too many failures, reclaiming work
[GROUNDED] drone-Izar - too many failures, reclaiming work
[GROUNDED] drone-Meridian - too many failures, reclaiming work

My beautiful self-healing system had detected that every drone was failing, correctly grounded them all to prevent further damage, and then had nothing left to assign work to. The swarm was technically functioning perfectly — protecting itself from bad drones — except all the drones were “bad” and no packages were being built.

Helpful.

The root cause was embarrassingly mundane. The build queue was stale. It had been generated days earlier when the portage tree listed sys-apps/systemd-utils-6.22.0. But the driver system (my desktop, running the build-swarm CLI) still had version 6.20.0 installed. The dependency resolver kept trying to build packages against 6.22.0, but the emerge environment couldn’t reconcile the slot conflicts.

Every drone was hitting the same wall:

!!! Multiple package instances within a single package slot have been pulled
!!! into the dependency graph, resulting in a slot conflict:

sys-apps/systemd-utils:0
  (sys-apps/systemd-utils-6.22.0 pulled in by ...)
  (sys-apps/systemd-utils-6.20.0 installed)

Autounmask requirements, slot conflicts, dependency resolution failures. Every single package that touched systemd-utils would fail. And since half the tree depends on it transitively, that was… most of the queue.

Five failures and you’re grounded. Every drone hit five failures within minutes.

The fix was one command:

build-swarm fresh

Nuked the entire stale queue. Resynced portage across all nodes. Rediscovered packages from scratch. The fresh run found 149 packages to build, all with consistent dependency trees.

Within ten minutes, every drone was ungrounded and actively building. No more slot conflicts. No more grounding.

Stale state is the silent killer of distributed systems. The self-healing code was doing exactly what it should — protecting the swarm from repeated failures. The bug wasn’t in the self-healing. It was in the state I fed it. Garbage in, grounded out.


The Tailscale Connectivity Detour

While debugging the grounding issue, I noticed drone-Tarn and orchestrator-Tarn were unreachable over the mesh. Not grounded — genuinely offline. I couldn’t SSH to them at all.

Checked the host. The tailscaled daemon wasn’t running.

* status: stopped
* start-stop-daemon: failed to start `/usr/sbin/tailscaled'

The init script depended on net.lxcbr0 — the LXC bridge interface. The bridge existed, but the init system was trying to create it again on boot, and you can’t enslave a bridge to a bridge:

Cannot enslave a bridge to a bridge

So net.lxcbr0 failed to start, which meant tailscaled never started, which meant the Andromeda-network drones dropped off the mesh entirely. They were running fine locally on 192.168.20.x but invisible to the orchestrator on the Milky Way side.

Had to start tailscaled manually on Tarn-Host and then verify the mesh reconnected. Not a self-healing problem per se, but a reminder that your fancy application-level recovery means nothing if the network layer is broken underneath it.


The LXC Container Surprise

While I was already SSH’d into the remote network fixing Tailscale, I checked on sweeper-Capella. It’s an LXC container running as a utility drone — handles cleanup tasks, package verification, that sort of thing.

It was stopped.

The container had AUTOSTART=1 in its config, but it hadn’t survived the last host reboot. The lxcbr0 bridge was up but had no IP assigned, so the container couldn’t get network access even after starting.

ip addr add 10.0.3.1/24 dev lxcbr0
lxc-start -n drone-callisto

Two commands and it was back. But this is the kind of thing that erodes trust in “it’ll just come back up after a reboot.” Autostart means nothing if the network prerequisite isn’t there.

I added the bridge IP assignment to the host’s local.start script. One more line of boot-time infrastructure that shouldn’t be necessary but absolutely is.


Rolling Out v2.5.0

Deploying the self-healing update across the fleet was its own adventure.

The orchestrators (orchestrator-Izar and orchestrator-Tarn) got their code via rsync. Straightforward — push the updated lib/ and bin/ directories, restart the service.

The drones used deploy.sh, a script that handles the full update cycle: stop the drone service, rsync new code, verify the install, restart. Four of the five drones took the update without drama.

drone-Meridian was special. It’s a QEMU VM running inside the Unraid box at the Andromeda site. To reach it, I had to SSH through Tailscale to the Unraid host, then hop into the VM. The deploy script handles ProxyJump configs for exactly this scenario, but I still held my breath every time I deploy to a machine that requires two network hops to reach.

All five drones. Both orchestrators. v2.5.0 across the board.


The Fresh Build

After clearing the stale state and getting all nodes back online, the fresh build was almost anticlimactic.

$ build-swarm fresh

Syncing portage tree across all nodes...
  drone-Izar:      synced (2026-02-04 22:14:08)
  drone-Tarn:      synced (2026-02-04 22:14:08)
  drone-Meridian:  synced (2026-02-04 22:14:08)
  drone-Tau-Beta:  synced (2026-02-04 22:14:08)
  sweeper-Capella: synced (2026-02-04 22:14:08)

Discovering packages...
  149 packages need building

Distributing to 5 drones (66 cores)...
  Build started.

149 packages. Five healthy drones. Consistent portage timestamps everywhere. No slot conflicts. No grounding.

I watched the monitor for a few minutes — packages ticking off, drones churning through their queues, the self-healing system quietly monitoring health states and finding nothing to complain about. Boring. Exactly how infrastructure should be.


What I Actually Learned

Self-healing isn’t about preventing failures. It’s about making failures recoverable. Drones will always crash. Memory will always run out. Networks will always hiccup. The question is whether the system can put itself back together without me waking up at 3 AM.

Graceful work return is worth more than any amount of monitoring. Knowing a drone is dead is useful. Having the drone hand its work back before it dies is transformative. The difference between “I need to manually reclaim 12 packages” and “the swarm redistributed 12 packages while I was asleep.”

Stale state will activate your safety systems against you. The grounding logic was correct. The queue data was wrong. When your protection mechanisms are working perfectly but your inputs are garbage, you get a swarm that protects itself into total paralysis. Always validate state freshness before trusting automated decisions.

Infrastructure layers fail independently. Tailscale down because of an LXC bridge init ordering bug. Container autostart failing because the bridge has no IP. Application-level self-healing means nothing if the VM can’t reach the network. Recovery has to happen at every layer, not just the one you wrote last week.

One command should fix the common case. build-swarm fresh became the reset button. Stale queue? Fresh. Weird dependency conflicts? Fresh. Drones all grounded for mysterious reasons? Fresh. Having a nuclear option that reliably returns you to a known-good state is worth more than trying to make incremental recovery handle every edge case.


The self-healing system is running in production now. It’s not dramatic. Drones occasionally enter DEGRADED, shed some work, and recover on their own. The orchestrator tracks health states and routes work around struggling nodes. Packages find their way to hardware that can build them.

Most of the time, I don’t even notice it working. That’s the point.

This post is part of the Build Swarm series. Previously: Hardening the Build Swarm covered ghost drones and NAT confusion. Next up: the overnight build review and what 66 cores can accomplish while you sleep.