Skip to main content
user@argobox:~/journal/2026-02-05-stale-queue-grounding
$ cat entry.md

The Stale Queue That Grounded My Drones

○ NOT REVIEWED

The Stale Queue That Grounded My Drones

Date: 2026-02-05 Duration: 7+ hours Issue: All drones idle despite 149 packages in queue Root Cause: Version mismatch, stale queue, undeployed code, corrupt binpkgs Fix: Version bump, force deploy, build-swarm fresh


The Symptom

149 packages needed building. Five drones available. All five showing IDLE.

drone-Izar       IDLE    16 cores available
drone-Tarn       IDLE    14 cores available
drone-Meridian   IDLE    24 cores available
drone-Tau-Beta   IDLE     8 cores available
sweeper-Capella  IDLE     6 cores available

68 cores doing absolutely nothing while almost 150 packages sat in the queue collecting dust.

10:30 — First Theory: Coordinator Bug

Checked the coordinator logs. No errors. It was happily reporting “queue populated, dispatching work” — but the drones weren’t picking it up. Or more accurately, they were picking it up, failing, and getting grounded.

[orchestrator-Izar] drone-Izar: grounded after 3 consecutive failures
[orchestrator-Izar] drone-Tarn: grounded after 3 consecutive failures
[orchestrator-Izar] drone-Meridian: grounded after 3 consecutive failures

Every drone was attempting builds, failing immediately, and getting grounded by the orchestrator’s safety logic. Three strikes and you’re out.

12:15 — Second Theory: Package Issue

Pulled the build logs from drone-Izar. The emerge commands were failing because the packages referenced in the queue didn’t match what was available. Queue had version 6.22.0 packages. The actual driver on the drones had 6.20.0.

The queue was stale. It had been generated against a newer portage tree, but the drones hadn’t been synced. Every package atom in the queue was either the wrong version or had dependency conflicts.

14:00 — The Real Problem: Undeployed Code

While digging, I found something worse. The code changes from the previous session — bug fixes, improvements, the works — had never been deployed to the remote nodes.

How did I miss this? The VERSION file was missing on the remotes. My deploy script checks VERSION to decide if an update is needed. No VERSION file means “up to date” (because the check returns “no difference”). Beautiful logic, Past Me.

ssh orchestrator-Izar "cat /opt/build-swarm/VERSION"
# cat: /opt/build-swarm/VERSION: No such file or directory

So the drones were running old code with a stale queue generated by new code. No wonder everything was failing.

15:45 — Corrupt Binary Packages

While I was in there, found another surprise: corrupt binary packages for LLVM on the binhost. Three .xpak files that were zero-byte. Any drone that tried to install LLVM from the binhost would fail, get retried, fail again, and get grounded.

LLVM is a dependency for a lot of packages. Those corrupt binpkgs were a hidden landmine.

16:30 — The Fix

Three-part fix:

Version bump to 0.4.3 — new VERSION file, new deploy marker.

Force push to all nodes — skip the “is it already up to date?” check. Just push everything.

for host in orchestrator-Izar drone-Izar drone-Tarn drone-Meridian drone-Tau-Beta sweeper-Capella; do
    rsync -avz --delete /opt/build-swarm/ $host:/opt/build-swarm/
done

build-swarm fresh — nuclear option. Clears the stale queue, regenerates from current portage tree, rebuilds the corrupt binpkgs, and resets all drone grounding states.

[build-swarm] Clearing stale queue...
[build-swarm] Regenerating package list...
[build-swarm] Discovered 149 packages
[build-swarm] Resetting drone states...
[build-swarm] All drones cleared for duty

17:15 — Back Online

All five drones building. Queue draining. No grounding events.

The version mismatch was the root cause, but the missing VERSION file is what let it fester. If the deploy had worked correctly last session, the queue and the drones would have been in sync from the start.

Added a check: if VERSION doesn’t exist on a remote, treat it as “needs update” instead of “up to date.”


68 cores idle for 7 hours because a file didn’t exist. The drones weren’t broken — they were grounded by stale data and a deploy that never happened.