user@argobox:~/journal/2026-01-23-the-race-condition-that-ate-my-binaries
$ cat entry.md

The Race Condition That Ate My Binaries

○ NOT REVIEWED

The Race Condition That Ate My Binaries

Date: 2026-01-23 Session: Major swarm overhaul Bugs Fixed: 4 New Scripts: 3


The Missing Binaries Mystery

Packages were building successfully. Drones were reporting completion. But when I checked the orchestrator’s staging area, some binaries were just… gone.

The orchestrator logs showed validation failures for packages that definitely built. What was happening?


Bug #1: The rsync Delete Flag

Found it in the drone code:

rsync_cmd = [
    'rsync', '-av', '--remove-source-files',  # 💀
    f'{pkgdir}/',
    f'root@{orchestrator_ip}:{staging_path}/'
]

--remove-source-files deletes the local file immediately after upload. Sounds efficient, right?

The problem: the orchestrator runs validation after receiving the file. If validation fails for any reason (wrong checksum, path mismatch, whatever), the orchestrator rejects the package. But the drone already deleted its local copy. The binary is gone forever.

The fix: Don’t delete until the orchestrator confirms acceptance.

def upload_binary(orchestrator_ip, package):
    # rsync WITHOUT --remove-source-files
    rsync_cmd = [
        'rsync', '-av', '--ignore-existing',
        f'{pkgdir}/',
        f'root@{orchestrator_ip}:{staging_path}/'
    ]
    subprocess.check_call(rsync_cmd)
    # Local copy still exists

def cleanup_local_binaries(package):
    """Only called after orchestrator confirms success."""
    # Now it's safe to delete

The orchestrator’s completion response now includes an accepted field:

self.send_json({'status': 'ok', 'accepted': accepted})

Drones only clean up when accepted: true.


Bug #2: Global Variable in Threading

The heartbeat worker thread was throwing UnboundLocalError:

UnboundLocalError: local variable 'current_package' referenced before assignment

Classic Python threading mistake. Variables in the main thread aren’t automatically visible in spawned threads.

# BEFORE (broken)
def heartbeat_worker():
    while True:
        phone_home()
        if current_package and build_start_time > 0:  # 💥
            # ...

# AFTER (fixed)
def heartbeat_worker():
    global current_package, build_start_time  # Added this
    while True:
        phone_home()
        if current_package and build_start_time > 0:
            # ...

One line. Hours of debugging.


Bug #3: The Crash-Looping Docker Container

drone-Mirach on Mirach-Maia-Silo (the Unraid box) was in a perpetual restart loop. Logs showed:

Error: SSH key not found at /root/.ssh/id_rsa

The container was using an old entrypoint from when drones communicated over SSH. We switched to HTTP months ago. The container image never got updated.

The fix: Rebuild with the v2 HTTP-based drone:

docker run -d \
    --name dr-mm2 \
    --hostname dr-mm2 \
    --network host \
    --restart unless-stopped \
    -e GATEWAY_URL=http://10.42.0.199:8090 \
    -v drone-swarm-code:/opt/build-swarm:ro \
    -v dr-mm2-portage:/var/db/repos/gentoo \
    -v /root/.ssh/id_ed25519:/root/.ssh/id_rsa:ro \
    --entrypoint /bin/bash \
    gentoo-drone:v2 \
    -c '
        export PYTHONPATH=/opt/build-swarm/lib:$PYTHONPATH
        exec python3 /opt/build-swarm/bin/swarm-drone
    '

Key changes:

  • --network host so the container gets its own IP (not the host’s)
  • Override entrypoint to run Python drone directly
  • Still mount SSH key for rsync uploads (only direction we use SSH now)

Bug #4: Binary Path Validation

This one’s still partially open. The orchestrator’s binary validation is looking for files in the wrong path:

Looking for: /var/cache/binpkgs/sys-libs/libseccomp-2.6.0/libseccomp-2.6.0-r3.gpkg.tar
Should be:   /var/cache/binpkgs/sys-libs/libseccomp/libseccomp-2.6.0-r3.gpkg.tar

The version number shouldn’t be in the directory path. This causes “missing_binary” errors for packages that actually built successfully. Added to the fix list.


New Automation Scripts

Got tired of manually setting up nodes. Wrote scripts:

setup-drone.sh:

./scripts/setup-drone.sh 10.42.0.194 drone-Tau-Ceti

What it does:

  1. Fixes hostname resolution
  2. Creates directories
  3. Deploys drone code
  4. Creates OpenRC init script
  5. Configures sleep prevention (no hibernating mid-build)
  6. Syncs portage tree
  7. Enables and starts service
  8. Verifies gateway registration

Also wrote setup-orchestrator.sh and setup-gateway.sh for the other components.


LXC Container Conversion

Converted drone-Tau-Ceti from bare-metal to an LXC container. Better isolation, easier to reset if something goes wrong.

Container config:

lxc.net.0.type = macvlan
lxc.net.0.macvlan.mode = bridge
lxc.net.0.link = eno1
lxc.cgroup2.cpuset.cpus = 0-5
lxc.cgroup2.memory.max = 24G
lxc.start.auto = 1

6 cores, 24GB RAM, auto-starts on boot. Gets its own IP via DHCP on the macvlan bridge.


Sleep Prevention

Drones kept going to sleep mid-build on systems with power management. Added elogind config:

# /etc/elogind/logind.conf.d/no-sleep.conf
[Login]
HandlePowerKey=ignore
HandleSuspendKey=ignore
HandleHibernateKey=ignore
HandleLidSwitch=ignore
IdleAction=ignore
IdleActionSec=infinity

No more surprise naps.


Current Swarm Status

After all the fixes:

ComponentCountStatus
Gateway1✅ Online
Orchestrators2✅ Both online
Drones3✅ All building
Total Cores46Active

Test build results:

  • 30 packages queued
  • 23 successful (76%)
  • 7 blocked (mostly path validation bug + nvidia-drivers needing kernel sources)

Deployment Commands (Reference)

# Deploy drone code to all drones
for drone in drone-Mirach drone-Icarus drone-Tau-Ceti; do
  scp bin/swarm-drone root@$(build-swarm $drone ip):/opt/build-swarm/bin/
  build-swarm $drone restart
done

# Setup a new drone
./scripts/setup-drone.sh <ip> [name]

# Check swarm status
curl -s http://10.42.0.199:8090/api/v1/nodes | python3 -m json.tool

Four bugs. One race condition eating binaries. One Docker container from a different era. One missing global statement. One path validation issue still pending.

The swarm is at 46 cores now. Almost broke 50, but one of the drones kept going to sleep. Fixed that too.