Build Swarm: From Chaos to Production

Managing a distributed build system is fun until a drone goes rogue and fails the same build 500 times in a minute. This is the story of a week spent making the swarm actually production-ready.

Saturday Morning: The Cron Job That Wasnโ€™t

Saturday, 8 AM. No builds had run overnight. The queue was full of needed packages but nothing was moving. All drones showed โ€œOnlineโ€ but idle.

The binhost server runs Gentoo with dcron. When Iโ€™d updated the crontab on Friday, I used cronie-style syntax:

# This works in cronie
0 2 * * * /opt/build-swarm/scripts/nightly-update.sh

But dcron doesnโ€™t understand the same format. It silently failed to load the entry.

The fix: Switched the binhost to cronie for consistency. Added a health check that verifies cron jobs are actually scheduled. The nightly update script now logs โ€œSTARTEDโ€ at the beginning so we know it ran.

Lesson: โ€œSilent failureโ€ is the worst failure mode. Test cron changes immediately.


The Runaway Drone Problem

While debugging the cron issue, I noticed something worse: drone-Tarn was in a retry loop of death. It would claim jobs from the queue, fail them instantly due to a configuration issue, then immediately claim them again. This cycle blocked the entire build queue.

The โ€œSplit Brainโ€ issue was also real: The orchestrator thought a drone was working, but the drone had actually crashed. Jobs would sit in delegated state for hours.

The Solution: Circuit Breakers

Borrowed from microservices architecture:

Three Strikes Rule (well, Five): If a drone reports 5 consecutive failures, it gets โ€œGroundedโ€ โ€” no new work for 5 minutes.

Auto-Reclaim: When a drone is grounded, any work delegated to it immediately goes back in the needed queue.

Maintenance Loop: Runs every 60 seconds, sweeps for offline drones, reclaims their work.

Auto-Reboot: If enabled, the orchestrator can SSH into a grounded drone and restart it:

ssh drone-Izar "rc-service build-drone restart"

Upgrading the Monitor

The original build-swarm monitor was ugly. Fixed-width columns that broke on long package names. No colors. No resource tracking. Just โ€œOnlineโ€ or โ€œOfflineโ€.

Resource Tracking

Drones now report resource usage in their heartbeat:

{
  "drone_id": "drone-Izar",
  "status": "building",
  "current_task": "sys-devel/llvm-19.1.7",
  "cpu_percent": 87.3,
  "memory_percent": 45.2
}

Visual Improvements

The monitor now uses Pythonโ€™s rich library:

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Build Swarm Monitor          Needed: 45  Built: 312  Rate: 8/hr โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ DRONE          โ”‚ STATUS   โ”‚ TASK                  โ”‚ CPU  โ”‚ RAM  โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ drone-Izar     โ”‚ BUILDING โ”‚ sys-devel/llvm-19.1.7 โ”‚ 87%  โ”‚ 45%  โ”‚
โ”‚ drone-Tarn     โ”‚ BUILDING โ”‚ dev-qt/qtbase-6.8.2   โ”‚ 92%  โ”‚ 38%  โ”‚
โ”‚ drone-Tau-Beta โ”‚ IDLE     โ”‚ -                     โ”‚ 2%   โ”‚ 12%  โ”‚
โ”‚ drone-Meridian โ”‚ BUILDING โ”‚ www-client/firefox    โ”‚ 95%  โ”‚ 62%  โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Green for building, yellow for idle, red for grounded. Build watching became a spectator sport.


Code Review: Finding the Gremlins

Took a systematic pass through both apkg and the build swarm codebase.

apkg Issues

  • Hardcoded paths โ€” Several scripts had paths hardcoded instead of using $HOME
  • Missing error handling โ€” SSH failures to binhost werenโ€™t caught gracefully
  • Duplicate code โ€” Package resolution logic was copied in three places

Fixed by moving to XDG paths, adding retry logic with exponential backoff, and creating a PackageResolver class.

Build Swarm Issues

  • Race condition โ€” Two drones could claim the same job if requests hit within milliseconds
  • Memory leak โ€” Build logs accumulated in memory without cleanup
  • No graceful shutdown โ€” Killing a drone mid-build left orphaned jobs

Fixed with Redis-based locking for job claims, log rotation with configurable retention, and a SIGTERM handler that finishes the current build before exit.

Test coverage: 67% โ†’ 84%


Self-Healing Goes Live

Built three levels of automatic recovery:

Level 1: Job Reclamation โ€” When a drone goes offline, its delegated work is automatically reclaimed after 60 seconds.

Level 2: Drone Restart โ€” If a drone is grounded (too many failures), the orchestrator can SSH in and restart the service.

Level 3: Full Reboot โ€” For hardware-level issues, the orchestrator can trigger a full system reboot. Gated behind confirmation and only used after multiple restart attempts fail.

The Test

Same day I deployed self-healing, the Andromeda network had a power event. Three drones went offline simultaneously.

What happened:

  • 15:42 โ€” Power blip at Andromeda site
  • 15:43 โ€” drone-Tarn, drone-Meridian offline
  • 15:44 โ€” Self-healing kicked in, reclaimed 12 jobs
  • 15:45 โ€” Remaining drones picked up the work
  • 16:10 โ€” Andromeda drones came back online
  • 16:11 โ€” Automatic reintegration into swarm

What would have happened before: Jobs stuck for hours. Manual intervention required. Angry developer.


Automation Polish

Portage Sync Tolerance

The nightly sync occasionally failed due to mirror timeouts. Added retry logic with fallback:

emerge --sync || emerge-webrsync

Binary Validation

Found that network hiccups during upload could result in truncated binaries. Drones now:

  1. Build the package
  2. Calculate SHA256
  3. Upload binary + checksum
  4. Binhost verifies before accepting

Better Status Reporting

The apkg status command now shows everything at a glance:

$ apkg status
Sync: 2026-01-23 02:00 (12 hours ago)
World: 264 packages
Binary: 98.2% available (4,637/4,722)
Swarm: 4/5 drones online, 12 packages building

Before and After

Before:

  • Drone fails โ†’ Retry loop of death
  • Drone offline โ†’ Job stuck for 3 hours
  • Observability โ†’ โ€œItโ€™s probably workingโ€

After:

  • Drone fails โ†’ Grounded โ†’ Reboots โ†’ Recovers
  • Drone offline โ†’ Work reclaimed in 60s
  • Observability โ†’ โ€œdrone-Izar is building mesa-25.3.3 at 87% CPUโ€

The swarm went from โ€œneeds babysittingโ€ to โ€œruns itself.โ€ Thatโ€™s the difference between a prototype and production.

The swarm is watching itself now.