Three Rsync Bugs In One Day

Date: 2026-01-29 Duration: ~7 hours Bugs Found: 3 (all in the same rsync command) Bonus: Built a CLI tool out of frustration

The Situation

Full system build running across the swarm. 261 packages, 4 drones. Should be routine at this point.

It was not routine.

Drones were getting stuck. Uploads hung indefinitely. The circuit breaker was grounding nodes left and right. And I kept having to SSH into different machines to figure out what was happening.

Bug #1: No Timeout On Rsync

The first problem was obvious once I looked: rsync had no timeout. A stuck transfer would hang forever, blocking the drone from doing any other work.

One drone had been “uploading” the same package for 3 hours. The rsync process was still alive but doing nothing.

The fix:

rsync_cmd = [
    'rsync', '-av',
    '--timeout=300',  # 5 minute stall timeout
    # ...
]
subprocess.check_call(rsync_cmd, timeout=600)  # 10 minute total limit

Now if a transfer stalls for 5 minutes, rsync kills itself. If the whole operation takes more than 10 minutes, Python kills it.

Bug #2: The Flag That Only Works Sometimes

Added --contimeout=30 for a 30-second connection timeout. Tested it locally. Deployed.

Every single upload started failing with exit code 1.

Turns out --contimeout is only valid when connecting to an rsync daemon. We’re using SSH transport. With SSH, that flag is completely invalid and rsync just dies.

The fix is to use SSH options instead:

rsync_cmd = [
    'rsync', '-av', '--timeout=300',
    '-e', 'ssh -o ConnectTimeout=30 -o ServerAliveInterval=30',
    # ...
]

Same functionality, but specified where rsync can actually understand it.

Bug #3: Uploading Everything Instead Of One File

This was the subtle one.

After a drone builds a package, it needs to upload the binary to the orchestrator’s staging area. One package. Maybe 50-100 MB typically.

Except the drones were uploading /var/cache/binpkgs/ - the entire binary package cache. 3.2 GB. Hundreds of files. On every single build completion.

The transfer would take forever (hence the timeout issues), use massive bandwidth, and often fail partway through because the orchestrator would reset the connection.

The fix: Upload only the specific package that was just built:

# Find the specific package file
pkg_file = f'{category}/{name}-{version}.gpkg.tar'

if os.path.exists(os.path.join(pkgdir, pkg_file)):
    rsync_cmd = [
        'rsync', '-av', '--timeout=120',
        '-e', 'ssh -o ConnectTimeout=30 -o ServerAliveInterval=30',
        '--relative',
        f'./{pkg_file}',
        f'root@{upload_host}:{staging_path}/'
    ]
    subprocess.check_call(rsync_cmd, cwd=pkgdir, timeout=300)
else:
    # Fall back to full sync only if we can't find the specific file
    # ...

The --relative flag preserves the directory structure (so app-misc/foo-1.0.gpkg.tar ends up in the right subdirectory), but only transfers that one file.

Upload time dropped from “forever” to “a few seconds.”

The Circuit Breaker Problem

While fixing rsync, I kept fighting the circuit breaker. It’s supposed to ground misbehaving drones after 5 failures, but I’d made it too aggressive.

Original logic: 5 failures → grounded for 5 minutes → 10 minutes → 20 minutes → 30 minutes (exponential backoff).

The problem? With rsync broken, every drone hit 5 failures almost immediately. Then they were all grounded for increasingly long periods. The swarm ground to a halt.

The fix: Simple 2-minute cooloff. After 5 failures, wait 2 minutes, try again. If it works, reset the failure count to 0.

# In get_work_for_drone():
if health.get('failures', 0) >= 5:
    last_fail = datetime.fromisoformat(health.get('last_failure'))
    if datetime.now() - last_fail < timedelta(minutes=2):
        # Still grounded - reclaim work, return None
        return None
    else:
        # Cool-off expired - give drone another chance
        health['failures'] = 0

The goal is to get drones back online quickly, not punish them with exponential waits.

The CLI Tool (Born Of Frustration)

By hour 5, I was tired of this workflow:

# Check what drone-Titawin is doing
ssh [email protected] "tail -20 /var/log/build-swarm/drone.log"

# Check drone-Mirach's queue
curl -s http://100.64.0.18:8080/api/v1/status | jq

# Restart drone-Tau-Ceti
ssh [email protected] "rc-service swarm-drone restart"

So I built a CLI tool instead:

# Same things, but sane:
build-swarm drone-Titawin logs 20
build-swarm status
build-swarm drone-Tau-Ceti restart

Features:

Partial name matching: Titawin-Host → drone-Titawin, masai → drone-Mirach
Auto-discovers gateway and orchestrator URLs
Dynamic drone list from the gateway API

build-swarm status              # Overall swarm status
build-swarm drones              # List all drones

build-swarm <drone> queue       # Show drone's package queue
build-swarm <drone> status      # Show drone details
build-swarm <drone> restart     # Restart drone service
build-swarm <drone> logs [N]    # Show last N log lines
build-swarm <drone> ssh [cmd]   # SSH to drone or run command

Installed it to /usr/local/bin/build-swarm. No more remembering which IP is which drone.

Gateway Selection Bug

While I was in there, I also fixed the gateway’s orchestrator selection. It was directing drones to the orchestrator with the emptiest queue (makes sense for load balancing). But we only have one orchestrator that actually has work. The other is a backup.

So drones kept going to the empty backup, getting no work, and the primary sat there with 176 packages waiting.

Fixed it to send drones to whichever orchestrator has the most work. Dynamic primary selection based on actual queue size.

The Final Count

After 7 hours:

Component	Bugs Fixed
swarm-drone	3 (rsync timeout, SSH flags, targeted upload)
swarm-orchestrator	1 (circuit breaker cooloff)
swarm-gateway	1 (orchestrator selection)
NEW: build-swarm CLI	Created from scratch

Build progress by end of session:

Progress: 72% (188/261)
Queued: 65, Building: 8
All 4 drones active

Deployment Commands (For Future Reference)

# Deploy drone changes to all drones
for drone in drone-Mirach drone-Icarus drone-Tau-Ceti drone-Titawin; do
  scp bin/swarm-drone root@$(build-swarm $drone ip):/opt/build-swarm/bin/
  build-swarm $drone restart
done

# Deploy orchestrator changes
scp bin/swarm-orchestrator [email protected]:/opt/build-swarm/bin/
ssh [email protected] "rc-service swarm-orchestrator restart"

# Deploy gateway changes
scp bin/swarm-gateway [email protected]:/opt/build-swarm/bin/
ssh [email protected] "rc-service swarm-gateway restart"

Remaining Issues

Duplicate log lines - Every entry appears twice. Probably a logging handler being added twice during init. Not critical but annoying.
KDE package failures - 41+ packages failing. Needs USE flag investigation.
drone-Mirach upload issues - Tailscale path to orchestrator might be flaky. The targeted upload fix should help.

Three bugs. One rsync command. Seven hours. And now I have a CLI tool, so at least something good came out of it.