The Race Condition That Ate My Binaries
Date: 2026-01-23 Session: Major swarm overhaul Bugs Fixed: 4 New Scripts: 3
The Missing Binaries Mystery
Packages were building successfully. Drones were reporting completion. But when I checked the orchestrator’s staging area, some binaries were just… gone.
The orchestrator logs showed validation failures for packages that definitely built. What was happening?
Bug #1: The rsync Delete Flag
Found it in the drone code:
rsync_cmd = [
'rsync', '-av', '--remove-source-files', # 💀
f'{pkgdir}/',
f'root@{orchestrator_ip}:{staging_path}/'
]
--remove-source-files deletes the local file immediately after upload. Sounds efficient, right?
The problem: the orchestrator runs validation after receiving the file. If validation fails for any reason (wrong checksum, path mismatch, whatever), the orchestrator rejects the package. But the drone already deleted its local copy. The binary is gone forever.
The fix: Don’t delete until the orchestrator confirms acceptance.
def upload_binary(orchestrator_ip, package):
# rsync WITHOUT --remove-source-files
rsync_cmd = [
'rsync', '-av', '--ignore-existing',
f'{pkgdir}/',
f'root@{orchestrator_ip}:{staging_path}/'
]
subprocess.check_call(rsync_cmd)
# Local copy still exists
def cleanup_local_binaries(package):
"""Only called after orchestrator confirms success."""
# Now it's safe to delete
The orchestrator’s completion response now includes an accepted field:
self.send_json({'status': 'ok', 'accepted': accepted})
Drones only clean up when accepted: true.
Bug #2: Global Variable in Threading
The heartbeat worker thread was throwing UnboundLocalError:
UnboundLocalError: local variable 'current_package' referenced before assignment
Classic Python threading mistake. Variables in the main thread aren’t automatically visible in spawned threads.
# BEFORE (broken)
def heartbeat_worker():
while True:
phone_home()
if current_package and build_start_time > 0: # 💥
# ...
# AFTER (fixed)
def heartbeat_worker():
global current_package, build_start_time # Added this
while True:
phone_home()
if current_package and build_start_time > 0:
# ...
One line. Hours of debugging.
Bug #3: The Crash-Looping Docker Container
drone-Mirach on Mirach-Maia-Silo (the Unraid box) was in a perpetual restart loop. Logs showed:
Error: SSH key not found at /root/.ssh/id_rsa
The container was using an old entrypoint from when drones communicated over SSH. We switched to HTTP months ago. The container image never got updated.
The fix: Rebuild with the v2 HTTP-based drone:
docker run -d \
--name dr-mm2 \
--hostname dr-mm2 \
--network host \
--restart unless-stopped \
-e GATEWAY_URL=http://10.42.0.199:8090 \
-v drone-swarm-code:/opt/build-swarm:ro \
-v dr-mm2-portage:/var/db/repos/gentoo \
-v /root/.ssh/id_ed25519:/root/.ssh/id_rsa:ro \
--entrypoint /bin/bash \
gentoo-drone:v2 \
-c '
export PYTHONPATH=/opt/build-swarm/lib:$PYTHONPATH
exec python3 /opt/build-swarm/bin/swarm-drone
'
Key changes:
--network hostso the container gets its own IP (not the host’s)- Override entrypoint to run Python drone directly
- Still mount SSH key for rsync uploads (only direction we use SSH now)
Bug #4: Binary Path Validation
This one’s still partially open. The orchestrator’s binary validation is looking for files in the wrong path:
Looking for: /var/cache/binpkgs/sys-libs/libseccomp-2.6.0/libseccomp-2.6.0-r3.gpkg.tar
Should be: /var/cache/binpkgs/sys-libs/libseccomp/libseccomp-2.6.0-r3.gpkg.tar
The version number shouldn’t be in the directory path. This causes “missing_binary” errors for packages that actually built successfully. Added to the fix list.
New Automation Scripts
Got tired of manually setting up nodes. Wrote scripts:
setup-drone.sh:
./scripts/setup-drone.sh 10.42.0.194 drone-Tau-Ceti
What it does:
- Fixes hostname resolution
- Creates directories
- Deploys drone code
- Creates OpenRC init script
- Configures sleep prevention (no hibernating mid-build)
- Syncs portage tree
- Enables and starts service
- Verifies gateway registration
Also wrote setup-orchestrator.sh and setup-gateway.sh for the other components.
LXC Container Conversion
Converted drone-Tau-Ceti from bare-metal to an LXC container. Better isolation, easier to reset if something goes wrong.
Container config:
lxc.net.0.type = macvlan
lxc.net.0.macvlan.mode = bridge
lxc.net.0.link = eno1
lxc.cgroup2.cpuset.cpus = 0-5
lxc.cgroup2.memory.max = 24G
lxc.start.auto = 1
6 cores, 24GB RAM, auto-starts on boot. Gets its own IP via DHCP on the macvlan bridge.
Sleep Prevention
Drones kept going to sleep mid-build on systems with power management. Added elogind config:
# /etc/elogind/logind.conf.d/no-sleep.conf
[Login]
HandlePowerKey=ignore
HandleSuspendKey=ignore
HandleHibernateKey=ignore
HandleLidSwitch=ignore
IdleAction=ignore
IdleActionSec=infinity
No more surprise naps.
Current Swarm Status
After all the fixes:
| Component | Count | Status |
|---|---|---|
| Gateway | 1 | ✅ Online |
| Orchestrators | 2 | ✅ Both online |
| Drones | 3 | ✅ All building |
| Total Cores | 46 | Active |
Test build results:
- 30 packages queued
- 23 successful (76%)
- 7 blocked (mostly path validation bug + nvidia-drivers needing kernel sources)
Deployment Commands (Reference)
# Deploy drone code to all drones
for drone in drone-Mirach drone-Icarus drone-Tau-Ceti; do
scp bin/swarm-drone root@$(build-swarm $drone ip):/opt/build-swarm/bin/
build-swarm $drone restart
done
# Setup a new drone
./scripts/setup-drone.sh <ip> [name]
# Check swarm status
curl -s http://10.42.0.199:8090/api/v1/nodes | python3 -m json.tool
Four bugs. One race condition eating binaries. One Docker container from a different era. One missing global statement. One path validation issue still pending.
The swarm is at 46 cores now. Almost broke 50, but one of the drones kept going to sleep. Fixed that too.