Build Swarm: When Your Computer Can’t Update for Three Weeks
It was 11 PM on New Year’s Eve.
I was watching 70 packages compile across 4 machines in my basement. The terminal read:
Building: 68
Complete: 2
Estimated: 6h 42m
By morning, everything would be compiled. No fan noise. No thermal throttling. No babysitting.
I had finally solved the problem that makes most people abandon Gentoo within a week: the compile times.
But getting here? That took months.
The Series at a Glance
| Part | What It Covers | Key Theme |
|---|---|---|
| Part 1: The Problem & Architecture (this page) | Why Gentoo compilation doesn’t scale, and the system I built to fix it | Understand the pain, then design the solution |
| Part 2: Hardening Against the Real World | Ghost drones, NAT nightmares, smart connection fallback | Edge cases will find you |
| Part 3: Teaching Drones to Heal Themselves | Health states, graceful work return, the grounding incident | Self-healing isn’t optional |
| Part 4: From Chaos to Production | Cron failures, circuit breakers, code review, overnight builds | Production means “runs without you” |
| Part 5: Lessons Learned | What worked, what didn’t, what I’d do differently | Distributed systems at home scale |
The Problem Nobody Talks About
Here’s the dirty secret of source-based distributions: they’re miserable to maintain.
The Gentoo philosophy is beautiful in theory. Compile everything from source with custom optimizations. Tailor every package to your hardware. Strip out bloat. Achieve the leanest, fastest system possible.
The reality?
$ time emerge -uDN @world
# ...48 hours later...
I timed a full system update on my desktop once. 48 hours. That’s with an i7-4790K running all 8 threads at 100% for two straight days. My office sounded like a jet engine. The GPU hit 85°C from radiant heat. It sounded like the computer was trying to achieve liftoff.
Now imagine needing to reinstall Firefox because of a security patch. That’s a 2-hour compile every time.
I love Gentoo. I love -march=native. I love USE flags. I love having a system that’s exactly what I want and nothing else. But I was spending more time maintaining the system than using it.
Something had to change.
The Idea: Make Other Computers Do the Work
The concept was simple:
- Compile once on dedicated machines that don’t mind the heat
- Store the results as binary packages
- Install anywhere in minutes instead of hours
This isn’t new. It’s basically what Arch does with the AUR binary repos. What Red Hat does with Koji. What Debian does with their build farms.
The difference? I was going to do it with whatever hardware I had lying around, connected via Tailscale, running on my ISP’s residential connection.
What could go wrong?
The First Attempt (v1): SSH and Hope
Version 1 was ugly. A bash script that parsed emerge --pretend @world to get the package list, SSH’d into each “drone” machine, ran emerge <package>, and waited.
It worked for about 20 minutes. Then I hit dependency ordering. You can’t build kde-plasma/plasma-workspace before dev-qt/qtbase is done. The bash script didn’t know that. It just fired packages at machines and hoped for the best.
v1 taught me the most important lesson of the project: this is a distributed systems problem, not a scripting problem.
The Architecture
After v1 crashed and burned, I designed a real system.
The Cast of Characters
| Name | Hardware | Cores | Role |
|---|---|---|---|
drone-Izar | i7 VM on Proxmox | 16 | Primary builder |
drone-Tarn | Ryzen VM | 14 | Secondary builder |
drone-Meridian | Docker on Unraid NAS | 24 | Heavy lifter |
Tau-Beta | Bare-metal desktop | 8 | Backup (Windows dual-boot risk) |
sweeper-Capella | LXC container | 8 | Cleanup & maintenance |
Total: 66 cores available for parallel compilation.
The Data Flow
┌──────────────────────┐
│ MY DESKTOP │
│ (Package Consumer) │
└──────────┬───────────┘
│
▼
┌──────────────────────┐
│ GATEWAY │
│ 10.42.0.199:8090 │
│ │
│ • Node registration │
│ • API routing │
│ • Auto-failover │
└──────────┬───────────┘
│
▼
┌──────────────────────┐
│ ORCHESTRATOR │
│ 10.42.0.201:8080 │
│ │
│ • Package queue │
│ • Work assignment │
│ • Build tracking │
└──────────┬───────────┘
│
┌───────────┬───────────┼───────────┬───────────┐
▼ ▼ ▼ ▼ │
┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ Drone 1 │ │ Drone 2 │ │ Drone 3 │ │ Drone 4 │ │
│ 16 core │ │ 14 core │ │ 24 core │ │ 8 core │ │
│ (VM) │ │ (VM) │ │ (Docker)│ │ (Bare) │ │
└────┬────┘ └────┬────┘ └────┬────┘ └────┬────┘ │
│ │ │ │ │
└───────────┴───────────┴───────────┘ │
│ │
▼ │
┌──────────────────────┐ │
│ BINHOST │◄──────────┘
│ (nginx) │
└──────────────────────┘
The Build Loop
- I run
build-swarm freshon my desktop - The orchestrator syncs Portage on all nodes
- It calculates which packages need updates
- Packages enter a queue with dependency ordering (Kahn’s algorithm for topological sorting)
- Drones poll for work every 30 seconds
- Each drone claims a package, runs
emerge --buildpkg, uploads the.gpkg.tarto staging, reports success/failure - When all packages complete, I run
build-swarm finalize - Staging directory moves atomically to the production binhost
- My desktop runs
emerge --usepkgand gets binaries
The Failures (And There Were Many)
Failure #1: The Race Condition
Day 3. Two drones claimed the same package. Both compiled it. Both uploaded. The second upload overwrote the first with a corrupted file.
The fix: Atomic package claiming with server-side locking:
def claim_package(drone_id, package) -> bool:
with self.lock:
if package in self.claimed:
return False # Already taken
self.claimed[package] = drone_id
return True
Failure #2: The Gateway Died
Week 2. The gateway container ran out of memory. None of the drones could register. The orchestrator had no workers. The entire swarm sat idle for 6 hours while I was at work.
The fix: Heartbeat monitoring. If the gateway stops responding, drones cache their last-known orchestrator URL, the orchestrator promotes itself to standalone mode, and I get an alert via Uptime Kuma.
Failure #3: The Portage Sync Drift
Week 3. Drone A synced Portage. Drone B didn’t. Drone A compiled qt-base-6.7.2. Drone B tried to compile something that depended on qt-base-6.7.1. Dependency collision.
The fix: The orchestrator verifies all nodes are synced to the same Portage timestamp before starting builds:
build-swarm sync-verify
# ✓ drone-Izar: 2026-01-25 08:14:22
# ✓ drone-Tarn: 2026-01-25 08:14:22
# ✗ drone-Meridian: 2026-01-24 16:30:00 ← STALE
Failure #4: The Bare-Metal Brick
Week 4. Tau-Beta is my one bare-metal drone. It dual-boots Windows and Gentoo. The swarm has an “auto-restart on stuck build” feature. It restarted Tau-Beta.
It booted into Windows.
The fix: AUTO_REBOOT=false in the drone config. Tau-Beta is special.
Failure #5: The Disk Full
Month 2. Drones were keeping old build artifacts. /var/cache/binpkgs filled up. Builds started failing with cryptic “disk write error” messages that took hours to diagnose.
The fix: Auto-cleanup after successful uploads. Drones don’t hoard packages anymore.
The Payoff
Here’s what my system update looks like now:
$ build-swarm fresh
═══ GENTOO BUILD SWARM ═══
Starting fresh build...
✓ Cleared staging directory
✓ Reset orchestrator state
✓ Synced portage on all nodes
✓ Discovered 83 needed packages
Build started. Monitor with: build-swarm monitor
Then I go to bed.
$ build-swarm status
═══ BUILD SWARM STATUS ═══
Gateway: ✓ 10.42.0.199
Orchestrator: ✓ 10.42.0.201 (API Online)
Build Progress:
Needed: 0
Building: 0
Complete: 83
Blocked: 0
═══ NEXT ACTION ═══
All builds complete!
→ Run: build-swarm finalize
Then on my desktop:
$ sudo apkg update
Updating from binhost... [83 packages]
>>> Installing www-client/firefox-133.0.3
>>> Installing app-office/libreoffice-24.8.4
...
Completed in 4m 32s
4 minutes. Not 48 hours. Not even 4 hours. Four minutes.
The Monitor
I built a TUI because I like watching things work:
╔═════════════════════════════════════════════════════════╗
║ GENTOO BUILD SWARM v2.6 ║
╠═════════════════════════════════════════════════════════╣
║ ORCH: 10.42.0.201 (primary) GATE: 10.42.0.199 (ok) ║
╠═════════════════════════════════════════════════════════╣
║ DRONE │ CORES │ STATUS ║
╠═════════════════════════════════════════════════════════╣
║ drone-Izar │ 16 │ Building: dev-libs/openssl ║
║ drone-Tarn │ 14 │ Building: www-client/firefox ║
║ drone-Merid │ 24 │ Building: kde-plasma/plasma... ║
║ Tau-Beta │ 8 │ Idle ║
╠═════════════════════════════════════════════════════════╣
║ QUEUE: 47 │ DONE: 36 │ BLOCKED: 0 │ ETA: 2h 14m ║
╚═════════════════════════════════════════════════════════╝
It’s oddly satisfying to watch package counts climb while doing nothing.
The Stack
| Component | Technology |
|---|---|
| Drones | Python service + OpenRC |
| Orchestrator | Python + Flask API |
| Gateway | Python + Flask |
| Networking | Tailscale mesh VPN |
| Binhost | nginx static file serving |
| Monitoring | Custom TUI + Uptime Kuma |
| Code deploy | Git + SSH |
| Package format | Gentoo .gpkg.tar |
Total lines of Python: ~4,500 Packages compiled while I slept: 2,847 (and counting)
Next up: Part 2 — Hardening Against the Real World — A ghost drone appears from nowhere, the orchestrator reboots the wrong server via NAT confusion, and we learn that “working” and “robust” are two different things.