Managing a distributed build system is fun until a drone goes rogue and fails the same build 500 times in a minute. This is the story of a week spent making the swarm actually production-ready.
Saturday Morning: The Cron Job That Wasnโt
Saturday, 8 AM. No builds had run overnight. The queue was full of needed packages but nothing was moving. All drones showed โOnlineโ but idle.
The binhost server runs Gentoo with dcron. When Iโd updated the crontab on Friday, I used cronie-style syntax:
# This works in cronie
0 2 * * * /opt/build-swarm/scripts/nightly-update.sh
But dcron doesnโt understand the same format. It silently failed to load the entry.
The fix: Switched the binhost to cronie for consistency. Added a health check that verifies cron jobs are actually scheduled. The nightly update script now logs โSTARTEDโ at the beginning so we know it ran.
Lesson: โSilent failureโ is the worst failure mode. Test cron changes immediately.
The Runaway Drone Problem
While debugging the cron issue, I noticed something worse: drone-Tarn was in a retry loop of death. It would claim jobs from the queue, fail them instantly due to a configuration issue, then immediately claim them again. This cycle blocked the entire build queue.
The โSplit Brainโ issue was also real: The orchestrator thought a drone was working, but the drone had actually crashed. Jobs would sit in delegated state for hours.
The Solution: Circuit Breakers
Borrowed from microservices architecture:
Three Strikes Rule (well, Five): If a drone reports 5 consecutive failures, it gets โGroundedโ โ no new work for 5 minutes.
Auto-Reclaim: When a drone is grounded, any work delegated to it immediately goes back in the needed queue.
Maintenance Loop: Runs every 60 seconds, sweeps for offline drones, reclaims their work.
Auto-Reboot: If enabled, the orchestrator can SSH into a grounded drone and restart it:
ssh drone-Izar "rc-service build-drone restart"
Upgrading the Monitor
The original build-swarm monitor was ugly. Fixed-width columns that broke on long package names. No colors. No resource tracking. Just โOnlineโ or โOfflineโ.
Resource Tracking
Drones now report resource usage in their heartbeat:
{
"drone_id": "drone-Izar",
"status": "building",
"current_task": "sys-devel/llvm-19.1.7",
"cpu_percent": 87.3,
"memory_percent": 45.2
}
Visual Improvements
The monitor now uses Pythonโs rich library:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Build Swarm Monitor Needed: 45 Built: 312 Rate: 8/hr โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ DRONE โ STATUS โ TASK โ CPU โ RAM โ
โโโโโโโโโโโโโโโโโโผโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโผโโโโโโโค
โ drone-Izar โ BUILDING โ sys-devel/llvm-19.1.7 โ 87% โ 45% โ
โ drone-Tarn โ BUILDING โ dev-qt/qtbase-6.8.2 โ 92% โ 38% โ
โ drone-Tau-Beta โ IDLE โ - โ 2% โ 12% โ
โ drone-Meridian โ BUILDING โ www-client/firefox โ 95% โ 62% โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Green for building, yellow for idle, red for grounded. Build watching became a spectator sport.
Code Review: Finding the Gremlins
Took a systematic pass through both apkg and the build swarm codebase.
apkg Issues
- Hardcoded paths โ Several scripts had paths hardcoded instead of using
$HOME - Missing error handling โ SSH failures to binhost werenโt caught gracefully
- Duplicate code โ Package resolution logic was copied in three places
Fixed by moving to XDG paths, adding retry logic with exponential backoff, and creating a PackageResolver class.
Build Swarm Issues
- Race condition โ Two drones could claim the same job if requests hit within milliseconds
- Memory leak โ Build logs accumulated in memory without cleanup
- No graceful shutdown โ Killing a drone mid-build left orphaned jobs
Fixed with Redis-based locking for job claims, log rotation with configurable retention, and a SIGTERM handler that finishes the current build before exit.
Test coverage: 67% โ 84%
Self-Healing Goes Live
Built three levels of automatic recovery:
Level 1: Job Reclamation โ When a drone goes offline, its delegated work is automatically reclaimed after 60 seconds.
Level 2: Drone Restart โ If a drone is grounded (too many failures), the orchestrator can SSH in and restart the service.
Level 3: Full Reboot โ For hardware-level issues, the orchestrator can trigger a full system reboot. Gated behind confirmation and only used after multiple restart attempts fail.
The Test
Same day I deployed self-healing, the Andromeda network had a power event. Three drones went offline simultaneously.
What happened:
- 15:42 โ Power blip at Andromeda site
- 15:43 โ drone-Tarn, drone-Meridian offline
- 15:44 โ Self-healing kicked in, reclaimed 12 jobs
- 15:45 โ Remaining drones picked up the work
- 16:10 โ Andromeda drones came back online
- 16:11 โ Automatic reintegration into swarm
What would have happened before: Jobs stuck for hours. Manual intervention required. Angry developer.
Automation Polish
Portage Sync Tolerance
The nightly sync occasionally failed due to mirror timeouts. Added retry logic with fallback:
emerge --sync || emerge-webrsync
Binary Validation
Found that network hiccups during upload could result in truncated binaries. Drones now:
- Build the package
- Calculate SHA256
- Upload binary + checksum
- Binhost verifies before accepting
Better Status Reporting
The apkg status command now shows everything at a glance:
$ apkg status
Sync: 2026-01-23 02:00 (12 hours ago)
World: 264 packages
Binary: 98.2% available (4,637/4,722)
Swarm: 4/5 drones online, 12 packages building
Before and After
Before:
- Drone fails โ Retry loop of death
- Drone offline โ Job stuck for 3 hours
- Observability โ โItโs probably workingโ
After:
- Drone fails โ Grounded โ Reboots โ Recovers
- Drone offline โ Work reclaimed in 60s
- Observability โ โdrone-Izar is building mesa-25.3.3 at 87% CPUโ
The swarm went from โneeds babysittingโ to โruns itself.โ Thatโs the difference between a prototype and production.
The swarm is watching itself now.