25 Bugs in One Night: A Build Swarm Codebase Review

25 Bugs in One Night: A Build Swarm Codebase Review

It started at 6 PM on a Friday. “I’ll just do a quick audit of the drone code before dinner.”

It ended at 6:30 AM Saturday, 25 bugs deep, four components rewritten, and every node in the swarm running new code. Dinner never happened. Coffee happened three times.

The Gentoo Build Swarm has been running in production for weeks. It compiles packages, distributes work, collects binaries. It works. But “works” and “works correctly” are not the same thing, and I’d been accumulating a nagging feeling that I hadn’t really read the codebase end-to-end since the initial development sprint. There were patches on patches. Hotfixes from 2 AM debugging sessions. Code written by a version of me who was more tired and less careful than I’d like to admit.

So I sat down and went through everything. Component by component, file by file. The drone. The coordinator. The orchestrator. The Flask dashboard. Every script in the lib directory. A systematic review of the entire codebase.

What I found was… humbling.


The Recursive Retry From Hell

I’m going to lead with the worst one because I deserve the embarrassment.

The drone has a retry mechanism. When a package fails to build, it can attempt a fix (usually adjusting USE flags or keywords) and retry. Simple concept. The implementation?

The original code concatenated " --retry-after-fix" onto the package atom string.

So =kde-frameworks/ki18n-6.22.0 becomes:

=kde-frameworks/ki18n-6.22.0 --retry-after-fix

That’s not a valid atom. That’s garbage. But it gets worse. On the second retry:

=kde-frameworks/ki18n-6.22.0 --retry-after-fix --retry-after-fix

And the third:

=kde-frameworks/ki18n-6.22.0 --retry-after-fix --retry-after-fix --retry-after-fix

The string just kept growing. Every retry appended another flag to the package name. This got fed directly into emerge, which would choke on the malformed atom, fail, trigger another retry, append another flag, fail again…

The drone was generating progressively more corrupted emerge commands in an infinite loop, logging each one as a “retry attempt,” and burning CPU cycles on doomed builds until something else killed the process.

I stared at this for a solid minute. Past Me wrote this. Past Me tested this. Past Me shipped this.

The fix was trivial. Replace string concatenation with a boolean parameter:

def _attempt_build(self, package, retried=False):
    # ... build logic ...
    if not retried and self._can_fix(package, error):
        self._apply_fix(package, error)
        return self._attempt_build(package, retried=True)

One boolean. That’s it. That’s what it should have been from the start. The atom string never gets touched. The retry flag is a parameter, not a suffix glued onto a package name like it’s a URL query string from 2003.


The Drone: 7 Critical Bugs

The retry corruption was the headline, but the drone had six more problems hiding in its 1,200 lines.

The /proc/meminfo parser assumed line order. It was reading lines[1] to get free memory. On most kernels, line 1 is MemFree. On some, it’s MemAvailable. On others, after certain kernel configs, the order shifts entirely. The fix: parse by key name like a sane person.

# Before: pray that line order never changes
mem_free = int(lines[1].split()[1])

# After: actually look for the right key
for line in lines:
    if line.startswith('MemAvailable:'):
        mem_available = int(line.split()[1])
        break

The watchdog double-report. The drone has a watchdog timer that fires if a build runs too long. But the watchdog callback and the normal build completion path both reported results to the orchestrator. If a build finished at exactly the wrong moment — watchdog fires, then the build completes 200ms later — the orchestrator would get two reports for the same package. One “timed out,” one “completed.” The state machine did not enjoy this.

Stale orchestrator IP. The drone reads the orchestrator’s address from a config file at startup. But builds can run for hours. If the orchestrator moves (IP change, failover, config update), the drone keeps talking to the old address. Now it re-reads the config before each poll cycle.

pgrep -f matching itself. Classic. The drone uses pgrep -f "emerge" to check if a build is still running. But the pgrep command itself contains the string “emerge” in its arguments. So it always finds at least one match: itself. The fix is pgrep -f "emerge" | grep -v $BASHPID, but the real fix was switching to process group tracking instead of grepping for command strings.

Metrics computed twice. The monitoring loop calculated system metrics (CPU, memory, disk, load average) and then 40 lines later, calculated them again for the status report. Each calculation took about 300ms because of the /proc reads. That’s 0.6 seconds of blocking I/O on every cycle, for numbers we already had. Deleted the duplicate. Done.


The Coordinator: 8 Bugs, 3 Embarrassing

The coordinator manages the build queue — what packages need building, who’s working on what, what’s done. It’s the brain. The brain had issues.

verify command: NameError. The verify subcommand was supposed to check build state consistency. It called state.stats(). The variable state did not exist in that scope. This command has been broken since the day it was written. Nobody noticed because verify is the kind of thing you write “for later” and then never run. Until you’re doing a codebase review at 11 PM and you actually try it.

def cmd_verify(args):
    # state was never assigned in this function
    print(state.stats())  # NameError: name 'state' is not defined

init-from-list feeding garbage atoms. The init-from-list command reads the output of emerge --pretend to populate the build queue. It was parsing lines that started with [ebuild and extracting the package atom. Except it was also grabbing lines like [ebuild R ] (reinstalls) and [blocks] entries, and the extraction logic was splitting on whitespace without checking which field was actually the atom. Result: the queue would get populated with fragments like [ebuild, R, ], and N alongside real atoms. The drones would then try to build a package called R and get confused.

finalize using stale state after releasing the lock. The coordinator uses file-based locking for concurrent access. The finalize command was: acquire lock, read state, release lock, then write state. In the gap between releasing the lock and writing, another process could modify the file. The finalize would then overwrite those changes with its stale copy. Moved the write inside the lock.

add missing duplicate detection for = prefix. Adding kde-frameworks/ki18n would check for duplicates. Adding =kde-frameworks/ki18n-6.22.0 would not, because the dedup logic stripped the category/package but not the version prefix. You could end up with the same package queued three times with slightly different atom formats.

Variable shadowing. A function called ssh_cmd() had a local variable also called ssh_cmd that held the constructed command string. Depending on Python’s mood and the call order, you’d either get the function or the string. This one was intermittent and maddening.

balance assigning to incompatible builders. The load balancer distributed packages across drones without checking architecture compatibility. This matters when one of your drones is a QEMU VM with different CFLAGS. Packages built with -march=znver3 don’t run on a host expecting -march=skylake.

File handle leak in init-from-list. Opened a file, read it, never closed it. In a long-running session with multiple init calls, this would eventually exhaust file descriptors. with statement. That’s all it needed.


The Dashboard: One Line That Broke Everything

This one made me want to throw my keyboard.

The Command Center dashboard uses WebSockets to push live status updates to the browser. The WebSocket handler calls various status functions and bundles them up. One of those calls looked like this:

status['orchestrator'] = get_orchestrator_status()

Seems fine, right? The problem: status['orchestrator'] already contained detailed orchestrator data — timing information, event history, package lists, build durations, recent completions, pipeline state. The get_orchestrator_status() function returns a simplified 4-field summary: name, state, uptime, version.

One assignment replaced all the detailed data with a stripped-down summary. Every WebSocket update overwrote the good data with the bad data. Session stats: gone. Timing display: gone. Completion logging: gone. Pipeline visualization: blank.

I had been staring at a half-broken dashboard for weeks thinking the frontend rendering was buggy. Nope. The data was fine until the backend helpfully replaced it with nothing useful, 10 times per second.

One line. Weeks of broken UI. I need to take a walk.


Dashboard Fixes: The Rest

Once I was already in the Flask code with murder in my heart, I kept going.

The fire-and-forget build endpoints. The /start and /fresh endpoints launched builds using nohup in a subprocess and immediately returned “started!” to the browser. No error checking. No feedback. If the build failed to launch — bad path, missing config, permission error — the dashboard would cheerfully display “Build started!” while nothing happened. Fixed it with a sleep 2 && head -5 on the subprocess output to catch early failures before responding.

Wrong staging path. The promote endpoint (moving packages from staging to production) was looking for /var/cache/binpkgs/staging. The actual directory is /var/cache/binpkgs-staging. A slash versus a hyphen. The promote button had never worked. I’d been promoting packages manually from the command line and never connected the dots.

OpenRC restart bug. The dashboard’s “restart service” button ran rc-service <name> restart. On OpenRC, restart sometimes fails if the service is in a weird state. The fix: rc-service <name> stop; sleep 1; rc-service <name> start. Sequential stop-then-start is more reliable than a single restart command. It’s ugly, but it works every time.

Missing state indicators. The dashboard had no concept of “paused” or “initializing.” If you paused the orchestrator, the dashboard showed it as “running.” If you were loading 200 packages into the queue, the dashboard showed… nothing. Added a paused state indicator and an init progress tracking endpoint that reports how many packages have been loaded out of the total.

No sync button. The drones need their Portage tree synced before building. This was a manual SSH command. Added a button. It’s 2026, I shouldn’t have to open a terminal to click a button that opens a terminal.


The Auto-Balance Feature

While I was in the orchestrator code fixing bugs, I added something I’d been thinking about for weeks.

The problem: drone-Meridian has 24 cores and finishes its queue fast. drone-Izar has 16 cores and gets the same number of packages. Meridian finishes, sits idle for 40 minutes, while Izar is still grinding through its backlog.

The solution: work stealing. When a drone polls for work and its queue is empty, the orchestrator looks at other drones’ queues. If any drone has 3+ packages remaining, the idle drone steals from the back of their queue (the packages they haven’t started yet). The donor keeps at least 2 packages — one active, one buffer — so it never runs dry.

The target allocation is core-weighted. A 24-core drone should get roughly 50% more work than a 16-core drone. It’s not perfect (some packages are way heavier than others), but it keeps the fleet busy instead of having fast drones nap while slow ones sweat.

Tagged this as orchestrator v2.5.0. The auto-balance alone was worth the overnight session.


Deployment: 4 Drones, 2 Networks, 1 Exhausted Engineer

By 4 AM I had a pile of fixes and a new feature. Time to deploy.

Four drones across two networks. Each one is a special snowflake in how you get code onto it.

drone-Izar (10.42.0.203) — local network. Straightforward scp of the updated files, rc-service drone restart. The easy one.

drone-Tarn (100.64.0.91) — remote site, reached via Tailscale. Same scp workflow but over the mesh network. 38ms latency, transfers are fine.

drone-Meridian (100.64.0.110) — this one’s a QEMU VM at the remote site, also on Tailscale. Same as Tarn but I always triple-check the CFLAGS compatibility because this is the one that’ll silently build broken binaries if the flags drift.

sweeper-Capella (10.42.0.x) — an LXC container on the local network. Can’t just scp into a container. You lxc-attach and work from inside, or push files through the container’s rootfs mount. I went with lxc-attach because at 4 AM I couldn’t remember the rootfs path and didn’t feel like looking it up.

Coordinator deployed to orchestrator-Izar (10.42.0.201). Command Center rebuilt and redeployed via Docker on Altair-Link. Full stack, every component, every node.

The whole deployment took about 45 minutes. Most of that was waiting for the Docker rebuild and double-checking that each drone registered cleanly after the update.


Verification: The Payoff

6 AM. Everything deployed. Kicked off a build to test.

Completed:  152
Building:     6
Remaining:   18
Cores:       58 (4 drones active)

152 packages done. 6 in progress. 18 to go. 58 cores churning across four machines on two networks.

Wall clock: about 3.2 hours. Effective build time per package: 18.3 seconds average. The auto-balance was already working — drone-Meridian had pulled 12 extra packages from Izar’s queue and was chewing through them.

No double-reports. No corrupted atoms. No stale IPs. No NameError from verify. The dashboard was showing full orchestrator data — timing, events, completions, the whole thing — for the first time in weeks.

I watched the last 18 packages tick down. Everything green. Everything working the way it should have been working all along.


What 12.5 Hours Teaches You

Code that “works” in production can be deeply broken. The recursive retry bug had been shipping for weeks. The dashboard was clobbering its own data every update cycle. The verify command was dead on arrival. These weren’t theoretical bugs — they were actively causing problems that I’d attributed to other things.

A systematic review catches what testing misses. I have tests. They passed. They tested the happy path. They didn’t test what happens when you retry twice, or when two status functions fight over the same dictionary key, or when a file handle isn’t closed on the third invocation. Reading the code, line by line, found bugs that no amount of “run it and see” would have surfaced.

Past Me is not to be trusted. Every one of these bugs was written by someone who was “pretty sure” the code was correct. String concatenation for retry tracking? Sure, that’ll work. Hardcoded line numbers for /proc/meminfo? Close enough. The dashboard status override? I probably meant to fix that later. Past Me is an optimist. Present Me has to clean up after him.

Deploy at 4 AM, verify at 6 AM. There’s a certain brutal efficiency to overnight sessions. No meetings. No interruptions. No context switching. Just you, the code, and a growing list of things that need fixing. The sun comes up and either everything works or it doesn’t, and either way you’re going to sleep.

Twelve and a half hours. Twenty-five bugs. Four components. Every node in the swarm running clean code by sunrise.

Coffee count: three. Regrets: zero. Well, maybe the recursive retry thing. That one stings.