Build Swarm Part 4: From Chaos to Production

The swarm was hardened. The self-healing worked. The architecture was solid. But “solid” and “production-ready” aren’t the same thing.

Production means it runs without you. Not for a few hours — indefinitely. Through cron failures, network blips, and the thousand small things that go wrong when nobody’s watching.

This is the story of the week it became real.


The Cron Job That Wasn’t

Saturday, 8 AM. No builds had run overnight. The queue was full of needed packages but nothing was moving. All drones showed “Online” but idle.

The binhost server runs Gentoo with dcron. When I’d updated the crontab on Friday, I used cronie-style syntax:

# This works in cronie
0 2 * * * /opt/build-swarm/scripts/nightly-update.sh

But dcron doesn’t understand the same format. It silently failed to load the entry.

The fix: Switched to cronie for consistency. Added a health check that verifies cron jobs are actually scheduled. The nightly update script now logs “STARTED” at the beginning so we know it ran.

Lesson: “Silent failure” is the worst failure mode. Test cron changes immediately.


The Runaway Drone

While debugging the cron issue, I noticed something worse: drone-Tarn was in a retry loop of death. It would claim jobs from the queue, fail them instantly due to a configuration issue, then immediately claim them again. This cycle blocked the entire build queue.

The “Split Brain” issue was also real: the orchestrator thought a drone was working, but the drone had actually crashed. Jobs would sit in delegated state for hours.

Circuit Breakers

Borrowed from microservices architecture:

Five Strikes Rule: If a drone reports 5 consecutive failures, it gets “Grounded” — no new work for 5 minutes.

Auto-Reclaim: When a drone is grounded, any work delegated to it immediately goes back in the needed queue.

Maintenance Loop: Runs every 60 seconds, sweeps for offline drones, reclaims their work.

Auto-Reboot: If enabled, the orchestrator can SSH into a grounded drone and restart it:

ssh drone-Izar "rc-service build-drone restart"

Self-Healing Goes Live

Three levels of automatic recovery, deployed the same week:

Level 1: Job Reclamation — When a drone goes offline, its delegated work is automatically reclaimed after 60 seconds.

Level 2: Drone Restart — If a drone is grounded (too many failures), the orchestrator can SSH in and restart the service.

Level 3: Full Reboot — For hardware-level issues, the orchestrator can trigger a full system reboot. Gated behind confirmation and only used after multiple restart attempts fail.

The Test

Same day I deployed self-healing, the Andromeda network had a power event. Three drones went offline simultaneously.

  • 15:42 — Power blip at Andromeda site
  • 15:43 — drone-Tarn, drone-Meridian offline
  • 15:44 — Self-healing kicked in, reclaimed 12 jobs
  • 15:45 — Remaining drones picked up the work
  • 16:10 — Andromeda drones came back online
  • 16:11 — Automatic reintegration into swarm

What would have happened before: Jobs stuck for hours. Manual intervention required. Angry developer.


The Code Audit: 25 Bugs in One Night

After the self-healing deployment, I took a full night — 12.5 hours — and audited every line of the build swarm and apkg codebase. Four components, thousands of lines.

apkg Issues Found

  • Hardcoded paths — Several scripts had paths hardcoded instead of using $HOME
  • Missing error handling — SSH failures to binhost weren’t caught gracefully
  • Duplicate code — Package resolution logic was copied in three places

Fixed by moving to XDG paths, adding retry logic with exponential backoff, and creating a PackageResolver class.

Build Swarm Issues Found

  • Race condition — Two drones could claim the same job if requests hit within milliseconds (fixed with Redis-based locking)
  • Memory leak — Build logs accumulated in memory without cleanup (fixed with log rotation)
  • No graceful shutdown — Killing a drone mid-build left orphaned jobs (fixed with SIGTERM handler that finishes the current build before exit)
  • Recursive retry bug — A failed package would trigger a retry, which would fail, which would trigger another retry, consuming the entire queue
  • State machine inconsistency — Drones could get stuck between states if a health check timed out during a state transition

25 bugs total. All fixed. All deployed before sunrise.

Test coverage: 67% → 84%


The Production Numbers

After the chaos week, the hardening, the self-healing deployment, and the code audit, the swarm ran its first fully automated production build:

Total packages in queue:  475
Packages delegated:       6 (currently building)
Packages received:        157 (completed binaries)
Packages blocked:         1 (dependency on quarantined package)
Successful builds:        159
Failed builds:            0
Success rate:             100%

One hundred percent success rate. On a distributed build system. Across multiple machines. Building Gentoo packages.

I stared at those numbers for a while.

What the Numbers Don’t Show

The 100% success rate comes with context. The quarantine system prevents failures from being counted as failures — they become “blocked” or “quarantined” instead. The honest metric: 159 packages built successfully out of 159 attempted.

The Edge Cases That Surfaced

ICU slot conflicts — The International Components for Unicode library had version slot conflicts that cascaded through dozens of dependent packages. Required manual package.use intervention.

egl-gbm ABI flags — GPU-related ABI flags needed to match across the entire graphics stack. A mismatch on one drone produced binaries that worked on the drone but crashed on the workstation.

harfbuzz circular dependency — harfbuzz depends on freetype, freetype depends on harfbuzz. Classic Gentoo. The resolver handles it with staged builds, but it took two iterations to get the staging order right.


The User Experience

From my workstation’s perspective, the build swarm is invisible. I run:

emerge --usepkg --usepkgonly --update @world

Portage checks the binhost, finds pre-built packages for everything, and installs them. A 475-package update takes about 15 minutes instead of 3 weeks.

My desktop never compiles. The fans never spin up. The system stays responsive the entire time.

That’s the whole point.


Next up: Part 5 — Lessons Learned — What worked, what didn’t, and what I’d do differently building a distributed system with homelab hardware.