How I Solved Gentoo's 40-Hour Compile Problem with 4 Machines and 62 Cores

How I Solved Gentoo’s 40-Hour Compile Problem with 4 Machines and 62 Cores

It was 11 PM on New Year’s Eve. I was watching 70 packages compile across 4 machines in my basement. The terminal read:

Building:  68
Complete:  2
Estimated: 6h 42m

By morning, everything would be compiled. No fan noise. No thermal throttling. No babysitting.

I had finally solved the problem that makes most people abandon Gentoo within a week: the compile times.


The Problem Nobody Talks About

Here’s the dirty secret of source-based distributions: they’re miserable to maintain.

The Gentoo philosophy is beautiful in theory. Compile everything from source with custom optimizations. Tailor every package to your hardware. Strip out bloat. Achieve the leanest, fastest system possible.

The reality?

$ time emerge -uDN @world
# ...48 hours later...

I timed a full system update on my desktop once. 48 hours. That’s with an i7-4790K running all 8 threads at 100% for two straight days. My office sounded like a jet engine. The GPU hit 85°C from radiant heat. It sounded like the computer was trying to achieve liftoff.

Now imagine needing to reinstall Firefox because of a security patch. That’s a 2-hour compile every time.

Most guides say “just use binary packages” and move on. But Gentoo’s binary package support is… complicated. The official binhost has packages you probably don’t want (generic builds, different USE flags). And if you want packages optimized for your hardware, you have to compile them yourself.

So I did what any reasonable person would do: I built a distributed compilation cluster in my basement.


The Idea: Make Other Computers Do the Work

The concept was simple:

  1. Compile once on dedicated machines that don’t mind the heat
  2. Store the results as binary packages
  3. Install anywhere in minutes instead of hours

This isn’t new. It’s basically what Arch does with the AUR binary repos. What Red Hat does with Koji. What Debian does with their build farms.

The difference? I was going to do it with whatever hardware I had lying around, connected via Tailscale, running on my ISP’s residential connection.

What could go wrong?


The Cast of Characters

The Drones (The Muscle)

NameHardwareCoresRole
drone-Izari7 VM on Proxmox16Primary builder
drone-TarnRyzen VM14Secondary builder
dr-mm2Docker on Unraid NAS24Heavy lifter
Tau-Ceti-LabBare-metal desktop8Backup (Windows dual-boot risk)

Total: 62 cores available for parallel compilation.

The Orchestrator (The Brain)

A small LXC container (orchestrator-Izar) running Python. Its job:

  • Maintain a queue of packages to build
  • Assign work to drones based on availability
  • Track what’s complete, what’s failed, what’s blocked
  • Manage the binary package staging area

The Gateway (The Router)

Another container (10.42.0.199) that:

  • Provides node discovery (drones phone home here)
  • Routes requests to the active orchestrator
  • Handles failover if the primary orchestrator dies
  • Serves the binhost URL for package downloads

The Driver (Me)

My desktop (Canopus-Outpost) runs the CLI that talks to this whole mess:

build-swarm status    # What's happening?
build-swarm fresh     # Start a clean build
build-swarm monitor   # Watch it work

The Architecture

Here’s what the data flow looks like:

                    ┌──────────────────────┐
                    │     MY DESKTOP       │
                    │  (Package Consumer)  │
                    └──────────┬───────────┘


                    ┌──────────────────────┐
                    │       GATEWAY        │
                    │   10.42.0.199:8090   │
                    │                      │
                    │ • Node registration  │
                    │ • API routing        │
                    │ • Auto-failover      │
                    └──────────┬───────────┘


                    ┌──────────────────────┐
                    │    ORCHESTRATOR      │
                    │   10.42.0.201:8080   │
                    │                      │
                    │ • Package queue      │
                    │ • Work assignment    │
                    │ • Build tracking     │
                    └──────────┬───────────┘

       ┌───────────┬───────────┼───────────┬───────────┐
       ▼           ▼           ▼           ▼           │
  ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐      │
  │ Drone 1 │ │ Drone 2 │ │ Drone 3 │ │ Drone 4 │      │
  │ 16 core │ │ 14 core │ │ 24 core │ │  8 core │      │
  │  (VM)   │ │  (VM)   │ │ (Docker)│ │ (Bare)  │      │
  └────┬────┘ └────┬────┘ └────┬────┘ └────┬────┘      │
       │           │           │           │           │
       └───────────┴───────────┴───────────┘           │
                               │                       │
                               ▼                       │
                    ┌──────────────────────┐           │
                    │       BINHOST        │◄──────────┘
                    │       (nginx)        │
                    └──────────────────────┘

The Build Loop

  1. I run build-swarm fresh on my desktop
  2. The orchestrator syncs Portage on all nodes
  3. It calculates which packages need updates
  4. Packages enter a queue with dependency ordering
  5. Drones poll for work every 30 seconds
  6. Each drone:
    • Claims a package
    • Runs emerge --buildpkg <package>
    • Uploads the resulting .gpkg.tar to staging
    • Reports success/failure
  7. When all packages complete, I run build-swarm finalize
  8. Staging directory → Production binhost (atomic move)
  9. My desktop runs emerge --usepkg and gets binaries

The Failures (And There Were Many)

Failure #1: The Race Condition

Day 3. Two drones claimed the same package. Both compiled it. Both uploaded. The second upload overwrote the first with a corrupted file.

The fix: Atomic package claiming with server-side locking:

def claim_package(drone_id, package) -> bool:
    with self.lock:
        if package in self.claimed:
            return False  # Already taken
        self.claimed[package] = drone_id
        return True

Failure #2: The Gateway Died

Week 2. The gateway container ran out of memory. None of the drones could register. The orchestrator had no workers. The entire swarm sat idle for 6 hours while I was at work.

The fix: Heartbeat monitoring. If the gateway stops responding:

  1. Drones cache their last-known orchestrator URL
  2. Orchestrator promotes itself to “standalone” mode
  3. I get an alert via Uptime Kuma

Failure #3: The Portage Sync Drift

Week 3. Drone A synced Portage. Drone B didn’t. Drone A compiled qt-base-6.7.2. Drone B tried to compile something that depended on qt-base-6.7.1 (because that’s what its tree said). Dependency collision.

The fix: The orchestrator now verifies all nodes are synced to the same Portage timestamp before starting builds:

build-swarm sync-verify
# ✓ drone-Izar:     2026-01-25 08:14:22
# ✓ drone-Tarn:  2026-01-25 08:14:22
# ✗ dr-mm2:       2026-01-24 16:30:00  ← STALE

Failure #4: The Bare-Metal Brick

Week 4. Tau-Ceti-Lab is my one bare-metal drone. It dual-boots Windows and Gentoo. The swarm has an “auto-restart on stuck build” feature. It restarted Tau-Ceti-Lab.

It booted into Windows.

The fix: A config flag:

# /etc/build-swarm/drone.conf
AUTO_REBOOT=false  # Tau-Ceti-Lab is special

Failure #5: The Disk Full

Month 2. Drones were keeping old build artifacts. /var/cache/binpkgs filled up. Builds started failing with cryptic “disk write error” messages that took hours to diagnose.

The fix: Auto-cleanup after successful uploads. Drones don’t hoard packages anymore.


The Payoff

Here’s what my system update looks like now:

$ build-swarm fresh

═══ GENTOO BUILD SWARM ═══

Starting fresh build...
  ✓ Cleared staging directory
  ✓ Reset orchestrator state
  ✓ Synced portage on all nodes
  ✓ Discovered 83 needed packages

Build started. Monitor with: build-swarm monitor

Then I go to bed.

$ build-swarm status

═══ BUILD SWARM STATUS ═══

Gateway:      ✓ 10.42.0.199
Orchestrator: ✓ 10.42.0.201 (API Online)

Build Progress:
  Needed:    0
  Building:  0
  Complete:  83
  Blocked:   0

═══ NEXT ACTION ═══
All builds complete!
  → Run: build-swarm finalize

Then on my desktop:

$ sudo apkg update
Updating from binhost... [83 packages]
>>> Installing www-client/firefox-133.0.3
>>> Installing app-office/libreoffice-24.8.4
...
Completed in 4m 32s

4 minutes. Not 48 hours. Not even 4 hours. Four minutes.


The Monitor

I built a TUI (Terminal User Interface) because I like watching things work:

╔═════════════════════════════════════════════════════════╗
║              🐝 GENTOO BUILD SWARM v2.6                 ║
╠═════════════════════════════════════════════════════════╣
║ ORCH: 10.42.0.201 (primary)  GATE: 10.42.0.199 (ok)     ║
╠═════════════════════════════════════════════════════════╣
║ DRONE        │ CORES │ STATUS                           ║
╠═════════════════════════════════════════════════════════╣
║ 🟢 drone-1   │  16   │ Building: dev-libs/openssl       ║
║ 🟢 drone-2   │  14   │ Building: www-client/firefox     ║
║ 🟢 drone-3   │  24   │ Building: kde-plasma/plasma...   ║
║ 🟢 drone-4   │   8   │ Idle                             ║
╠═════════════════════════════════════════════════════════╣
║ QUEUE: 47 │ DONE: 36 │ BLOCKED: 0 │ ETA: 2h 14m         ║
╠═════════════════════════════════════════════════════════╣
║ [q] Quit  [b] Balance  [u] Unblock  [R] Reset           ║
╚═════════════════════════════════════════════════════════╝

It’s oddly satisfying to watch package counts climb while doing nothing.


Would I Recommend This?

For your job: Absolutely not. Use containers. Use CI/CD. Use literally anything from this decade.

For your home lab: Maybe.

Here’s my honest assessment:

You Should Consider This If:

  • You run Gentoo on 2+ machines
  • You have spare hardware (VMs count)
  • You find distributed systems problems interesting
  • You’re okay with things breaking

You Should NOT Do This If:

  • You just want a Linux desktop that works
  • You value your free time
  • You don’t enjoy debugging at 2 AM
  • You’re sane

The Build Swarm took about 60 hours to build and has saved me maybe 200 hours of compile time over 6 months. The ROI is positive, but barely.

The real value was the learning. I now understand:

  • How package managers work at a low level
  • How distributed task queues operate
  • How to handle network partitions and node failures
  • Why Kubernetes is actually pretty impressive (it does all this and more)

The Stack

For the curious, here’s what runs the swarm:

ComponentTechnology
DronesPython service + OpenRC
OrchestratorPython + Flask API
GatewayPython + Flask
NetworkingTailscale mesh VPN
Binhostnginx static file serving
MonitoringCustom TUI + Uptime Kuma
Code deployGit + SSH
Package formatGentoo .gpkg.tar

Total lines of Python: ~4,500
Total headaches: Countless
Packages compiled while I slept: 2,847 (and counting)


The Philosophy

Gentoo’s official position is “compile everything yourself.” My position is “I’d rather have the computer compile everything while I sleep.”

The Build Swarm is my compromise. I still get source-based optimization. I still control every USE flag. I still compile from upstream.

I just don’t have to watch.


This post is part of the Argo OS Journey series, documenting the creation of a custom Gentoo-based distribution across my home lab.

Related Posts: