The Binpkg Cleanup Bug

Date: January 28, 2026 Duration: ~3 hours Mood: That satisfying feeling when you finally see it

The Problem

drone-Izar had accumulated 4.9GB of binary packages. drone-Tarn had 1.7GB.

These should have been cleaned up automatically after upload. That’s… literally what the cleanup function does.

Except it wasn’t.

Finding It

The cleanup code looked for this:

# Old code - looking for directories
pkg_path = os.path.join(pkgdir, cat, pkg_name)
if os.path.exists(pkg_path):
    shutil.rmtree(pkg_path)

But Portage switched to flat .gpkg.tar files months ago. There are no directories.

The cleanup was checking for paths that literally don’t exist. Every. Single. Time.

# What cleanup expected:
/var/cache/binpkgs/kde-frameworks/kconfig/

# What actually exists:
/var/cache/binpkgs/kde-frameworks/kconfig-6.22.0.gpkg.tar

The Fix

Added import glob and rewrote the cleanup to actually find the files:

# New code - find the actual .gpkg.tar files
gpkg_pattern = os.path.join(pkgdir, cat, f"{pv}.gpkg.tar")
gpkg_files = glob.glob(gpkg_pattern)
for gpkg_file in gpkg_files:
    os.remove(gpkg_file)

Simple. Obvious. Took three hours to notice.

The Cascade

While I was in there, I found more issues:

Staging never releasing to production. Packages were getting validated and uploaded to staging (/var/cache/binpkgs-staging) but never moved to the actual binhost. Found 33 packages stuck in staging, 0 in production.

Health check cutting off early. The deploy.sh health command was dying after section 6 because ((issues++)) returns exit code 1 when incrementing from 0. With set -e enabled, bash saw that as a failure.

Changed all of these:

((issues++))

To:

issues=$((issues + 1))

Bash arithmetic. Never trust it.

KDE Frameworks bootstrap loop. 69 packages blocked due to slot conflicts. Drones had kde-frameworks 6.20.x installed but were trying to build 6.22.x, and the new versions were sitting in staging instead of production.

The Aftermath

Moved staging to production:

for pkg in /var/cache/binpkgs-staging/*/*.gpkg.tar; do
    cat=$(dirname "$pkg" | xargs basename)
    mkdir -p "/var/cache/binpkgs/$cat"
    mv "$pkg" "/var/cache/binpkgs/$cat/"
done
emaint binhost --fix

Cleared the blocked packages. Restarted the orchestrator.

╔════════════════════════════════════════════════════════════════╗
║              BUILD SWARM HEALTH CHECK                          ║
╚════════════════════════════════════════════════════════════════╝

1. BINARY VERIFICATION      ✓ All match
2. SERVICE STATUS           ✓ All running
3. GATEWAY CONNECTIVITY     ✓ Online (4 drones, 2 orchestrators)
4. BUILD QUEUE STATUS       Needed: 10 | Blocked: 0
5. GROUNDED DRONES          ✓ None

╔════════════════════════════════════════════════════════════════╗
║  RESULT: ✓ ALL CHECKS PASSED                                  ║
╚════════════════════════════════════════════════════════════════╝

What I Learned

Portage changes things. The switch to flat GPKG files happened quietly. My code was still looking for the old directory structure.

Bash arithmetic is weird. ((n++)) returns 1 when incrementing from 0. With set -e, that’s a script-killer.

Check staging. I assumed the release pipeline was working. It wasn’t. 33 packages just… sitting there.

Files Modified

File	Changes
`bin/swarm-drone`	Added glob import, fixed cleanup_local_binaries()
`bin/swarm-orchestrator`	Added release logic, fixed pattern matching
`deploy.sh`	Added health check sections 6-7, fixed arithmetic

Tomorrow I’m monitoring a full 263-package rebuild. Let’s see what else breaks.