Build Swarm: When Your Computer Can't Update for Three Weeks

Build Swarm: When Your Computer Can’t Update for Three Weeks

It was 11 PM on New Year’s Eve.

I was watching 70 packages compile across 4 machines in my basement. The terminal read:

Building:  68
Complete:  2
Estimated: 6h 42m

By morning, everything would be compiled. No fan noise. No thermal throttling. No babysitting.

I had finally solved the problem that makes most people abandon Gentoo within a week: the compile times.

But getting here? That took months.


The Series at a Glance

PartWhat It CoversKey Theme
Part 1: The Problem & Architecture (this page)Why Gentoo compilation doesn’t scale, and the system I built to fix itUnderstand the pain, then design the solution
Part 2: Hardening Against the Real WorldGhost drones, NAT nightmares, smart connection fallbackEdge cases will find you
Part 3: Teaching Drones to Heal ThemselvesHealth states, graceful work return, the grounding incidentSelf-healing isn’t optional
Part 4: From Chaos to ProductionCron failures, circuit breakers, code review, overnight buildsProduction means “runs without you”
Part 5: Lessons LearnedWhat worked, what didn’t, what I’d do differentlyDistributed systems at home scale

The Problem Nobody Talks About

Here’s the dirty secret of source-based distributions: they’re miserable to maintain.

The Gentoo philosophy is beautiful in theory. Compile everything from source with custom optimizations. Tailor every package to your hardware. Strip out bloat. Achieve the leanest, fastest system possible.

The reality?

$ time emerge -uDN @world
# ...48 hours later...

I timed a full system update on my desktop once. 48 hours. That’s with an i7-4790K running all 8 threads at 100% for two straight days. My office sounded like a jet engine. The GPU hit 85°C from radiant heat. It sounded like the computer was trying to achieve liftoff.

Now imagine needing to reinstall Firefox because of a security patch. That’s a 2-hour compile every time.

I love Gentoo. I love -march=native. I love USE flags. I love having a system that’s exactly what I want and nothing else. But I was spending more time maintaining the system than using it.

Something had to change.

The Idea: Make Other Computers Do the Work

The concept was simple:

  1. Compile once on dedicated machines that don’t mind the heat
  2. Store the results as binary packages
  3. Install anywhere in minutes instead of hours

This isn’t new. It’s basically what Arch does with the AUR binary repos. What Red Hat does with Koji. What Debian does with their build farms.

The difference? I was going to do it with whatever hardware I had lying around, connected via Tailscale, running on my ISP’s residential connection.

What could go wrong?


The First Attempt (v1): SSH and Hope

Version 1 was ugly. A bash script that parsed emerge --pretend @world to get the package list, SSH’d into each “drone” machine, ran emerge <package>, and waited.

It worked for about 20 minutes. Then I hit dependency ordering. You can’t build kde-plasma/plasma-workspace before dev-qt/qtbase is done. The bash script didn’t know that. It just fired packages at machines and hoped for the best.

v1 taught me the most important lesson of the project: this is a distributed systems problem, not a scripting problem.


The Architecture

After v1 crashed and burned, I designed a real system.

The Cast of Characters

NameHardwareCoresRole
drone-Izari7 VM on Proxmox16Primary builder
drone-TarnRyzen VM14Secondary builder
drone-MeridianDocker on Unraid NAS24Heavy lifter
Tau-BetaBare-metal desktop8Backup (Windows dual-boot risk)
sweeper-CapellaLXC container8Cleanup & maintenance

Total: 66 cores available for parallel compilation.

The Data Flow

                    ┌──────────────────────┐
                    │     MY DESKTOP       │
                    │  (Package Consumer)  │
                    └──────────┬───────────┘


                    ┌──────────────────────┐
                    │       GATEWAY        │
                    │   10.42.0.199:8090   │
                    │                      │
                    │ • Node registration  │
                    │ • API routing        │
                    │ • Auto-failover      │
                    └──────────┬───────────┘


                    ┌──────────────────────┐
                    │    ORCHESTRATOR      │
                    │   10.42.0.201:8080   │
                    │                      │
                    │ • Package queue      │
                    │ • Work assignment    │
                    │ • Build tracking     │
                    └──────────┬───────────┘

       ┌───────────┬───────────┼───────────┬───────────┐
       ▼           ▼           ▼           ▼           │
  ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐      │
  │ Drone 1 │ │ Drone 2 │ │ Drone 3 │ │ Drone 4 │      │
  │ 16 core │ │ 14 core │ │ 24 core │ │  8 core │      │
  │  (VM)   │ │  (VM)   │ │ (Docker)│ │ (Bare)  │      │
  └────┬────┘ └────┬────┘ └────┬────┘ └────┬────┘      │
       │           │           │           │           │
       └───────────┴───────────┴───────────┘           │
                               │                       │
                               ▼                       │
                    ┌──────────────────────┐           │
                    │       BINHOST        │◄──────────┘
                    │       (nginx)        │
                    └──────────────────────┘

The Build Loop

  1. I run build-swarm fresh on my desktop
  2. The orchestrator syncs Portage on all nodes
  3. It calculates which packages need updates
  4. Packages enter a queue with dependency ordering (Kahn’s algorithm for topological sorting)
  5. Drones poll for work every 30 seconds
  6. Each drone claims a package, runs emerge --buildpkg, uploads the .gpkg.tar to staging, reports success/failure
  7. When all packages complete, I run build-swarm finalize
  8. Staging directory moves atomically to the production binhost
  9. My desktop runs emerge --usepkg and gets binaries

The Failures (And There Were Many)

Failure #1: The Race Condition

Day 3. Two drones claimed the same package. Both compiled it. Both uploaded. The second upload overwrote the first with a corrupted file.

The fix: Atomic package claiming with server-side locking:

def claim_package(drone_id, package) -> bool:
    with self.lock:
        if package in self.claimed:
            return False  # Already taken
        self.claimed[package] = drone_id
        return True

Failure #2: The Gateway Died

Week 2. The gateway container ran out of memory. None of the drones could register. The orchestrator had no workers. The entire swarm sat idle for 6 hours while I was at work.

The fix: Heartbeat monitoring. If the gateway stops responding, drones cache their last-known orchestrator URL, the orchestrator promotes itself to standalone mode, and I get an alert via Uptime Kuma.

Failure #3: The Portage Sync Drift

Week 3. Drone A synced Portage. Drone B didn’t. Drone A compiled qt-base-6.7.2. Drone B tried to compile something that depended on qt-base-6.7.1. Dependency collision.

The fix: The orchestrator verifies all nodes are synced to the same Portage timestamp before starting builds:

build-swarm sync-verify
# ✓ drone-Izar:     2026-01-25 08:14:22
# ✓ drone-Tarn:     2026-01-25 08:14:22
# ✗ drone-Meridian: 2026-01-24 16:30:00  ← STALE

Failure #4: The Bare-Metal Brick

Week 4. Tau-Beta is my one bare-metal drone. It dual-boots Windows and Gentoo. The swarm has an “auto-restart on stuck build” feature. It restarted Tau-Beta.

It booted into Windows.

The fix: AUTO_REBOOT=false in the drone config. Tau-Beta is special.

Failure #5: The Disk Full

Month 2. Drones were keeping old build artifacts. /var/cache/binpkgs filled up. Builds started failing with cryptic “disk write error” messages that took hours to diagnose.

The fix: Auto-cleanup after successful uploads. Drones don’t hoard packages anymore.


The Payoff

Here’s what my system update looks like now:

$ build-swarm fresh

═══ GENTOO BUILD SWARM ═══

Starting fresh build...
  ✓ Cleared staging directory
  ✓ Reset orchestrator state
  ✓ Synced portage on all nodes
  ✓ Discovered 83 needed packages

Build started. Monitor with: build-swarm monitor

Then I go to bed.

$ build-swarm status

═══ BUILD SWARM STATUS ═══

Gateway:      ✓ 10.42.0.199
Orchestrator: ✓ 10.42.0.201 (API Online)

Build Progress:
  Needed:    0
  Building:  0
  Complete:  83
  Blocked:   0

═══ NEXT ACTION ═══
All builds complete!
  → Run: build-swarm finalize

Then on my desktop:

$ sudo apkg update
Updating from binhost... [83 packages]
>>> Installing www-client/firefox-133.0.3
>>> Installing app-office/libreoffice-24.8.4
...
Completed in 4m 32s

4 minutes. Not 48 hours. Not even 4 hours. Four minutes.


The Monitor

I built a TUI because I like watching things work:

╔═════════════════════════════════════════════════════════╗
║              GENTOO BUILD SWARM v2.6                    ║
╠═════════════════════════════════════════════════════════╣
║ ORCH: 10.42.0.201 (primary)  GATE: 10.42.0.199 (ok)    ║
╠═════════════════════════════════════════════════════════╣
║ DRONE        │ CORES │ STATUS                           ║
╠═════════════════════════════════════════════════════════╣
║ drone-Izar   │  16   │ Building: dev-libs/openssl       ║
║ drone-Tarn   │  14   │ Building: www-client/firefox     ║
║ drone-Merid  │  24   │ Building: kde-plasma/plasma...   ║
║ Tau-Beta     │   8   │ Idle                             ║
╠═════════════════════════════════════════════════════════╣
║ QUEUE: 47 │ DONE: 36 │ BLOCKED: 0 │ ETA: 2h 14m         ║
╚═════════════════════════════════════════════════════════╝

It’s oddly satisfying to watch package counts climb while doing nothing.


The Stack

ComponentTechnology
DronesPython service + OpenRC
OrchestratorPython + Flask API
GatewayPython + Flask
NetworkingTailscale mesh VPN
Binhostnginx static file serving
MonitoringCustom TUI + Uptime Kuma
Code deployGit + SSH
Package formatGentoo .gpkg.tar

Total lines of Python: ~4,500 Packages compiled while I slept: 2,847 (and counting)


Next up: Part 2 — Hardening Against the Real World — A ghost drone appears from nowhere, the orchestrator reboots the wrong server via NAT confusion, and we learn that “working” and “robust” are two different things.