After three months of development, the Gentoo Build Swarm hits v1.0. It’s one thing to have a working system; it’s another to have one someone else could deploy.
Three months ago, a drone went rogue and failed the same build 500 times in a minute. The orchestrator thought a crashed drone was still working. Jobs sat in limbo for hours. I woke up to broken queues and angry logs.
Today? A drone fails, gets grounded after 5 strikes, its work is reclaimed in 60 seconds, and the swarm keeps building. Self-healing isn’t a feature—it’s the difference between “hobby project” and “infrastructure I trust.”
That’s what v1.0 means.
The Documentation Sprint
Spent a weekend writing comprehensive docs:
Architecture Overview — How gateway, orchestrator, and drones fit together. State machine diagrams for build jobs. The flow from “package needed” to “binary available.”
Deployment Guide — Step-by-step setup. Install scripts. Configuration templates. What ports to open, what services to enable.
Configuration Reference — Every config option with defaults. Environment variables. Tuning parameters for different hardware profiles.
Troubleshooting Guide — The problems I hit and how to fix them. Drone won’t connect? Check these five things. Builds failing? Here’s the debug process.
API Reference — All endpoints with examples. How to query build status. How to manually trigger rebuilds.
~15,000 words total. Not glamorous, but future-me (and anyone else who touches this system) will thank past-me.
Client Sync Workflow
The apkg sync command now handles the full update cycle:
apkg sync
# 1. Sync Portage tree
# 2. Check for available updates
# 3. Verify binary availability on binhost
# 4. Install updates (binary-only mode)
# 5. Create post-update snapshot
If packages aren’t available as binaries, it warns but doesn’t fail:
Warning: 3 packages not available as binaries
- sys-kernel/linux-firmware (too large)
- app-misc/screen (pending build)
- dev-util/ctags (blocked)
Run `apkg sync --compile-missing` to build locally.
This is the workflow I wanted from the start: type one command, get updated, know exactly what happened.
Version Tags
Tagged everything as v1.0:
- Gateway: v1.0.0
- Orchestrator: v1.0.0
- Drone: v1.0.0
- apkg: v0.5.0 (still in active development for Nix integration)
What We Built
A distributed compilation system for Gentoo that:
- Parallelizes builds across 5 drones with 66 total CPU cores
- Self-heals from drone failures, network issues, and bad builds
- Monitors itself with real-time TUI and web dashboards
- Integrates with client systems via apkg for seamless updates
The power event at the Andromeda site was the real test. Three drones went offline simultaneously. The swarm detected it in 60 seconds, reclaimed 12 jobs, redistributed work to surviving drones, and kept building. When the Andromeda drones came back 25 minutes later, they rejoined automatically.
I didn’t have to touch anything. That’s v1.0.
By the Numbers
| Metric | Value |
|---|---|
| Lines of code | ~8,000 (Python) |
| Documentation | ~15,000 words |
| Test coverage | 84% |
| Packages built | 4,700+ |
| Average build rate | 12 packages/hour |
| Uptime since January | 99.2% |
What I Learned
Distributed systems fail in distributed ways. Every component needs to handle the others being down. The orchestrator can’t assume drones are healthy. Drones can’t assume the binhost is reachable. Everything needs fallbacks.
Monitoring is not optional. If you can’t see what the system is doing, you can’t debug it. The TUI monitor went from “nice to have” to “how did I ever run this without it.”
Self-healing saves sanity. Waking up to a recovered system instead of a broken one is worth every line of recovery code.
Documentation is a feature. Three months from now, I won’t remember why I made certain decisions. The docs will.
What’s Next
- v1.1: Prometheus metrics export for Grafana dashboards
- v1.2: Web dashboard (not just TUI)
- v2.0: Multi-architecture support (amd64 + arm64)
The swarm is alive, and it’s hungry for packages.
v1.0: shipped.