The Crash Recovery & The Lab Engine Goes Live
Started: 2026-02-09 00:30 Ended: 17:15+ Sessions: 4 parallel workstreams Issues Resolved: 7 critical infrastructure problems Status: Lab engine in production, fleet synchronized, drift detection designed
00:30 — The Crash
Callisto crashed hard. 100% CPU, 100% RAM, then silence. This followed a Proxmox incident where the IO host went OOM and Tailscale hijacked the routing table.
The work didn’t vanish — three repos had uncommitted changes when the system went down:
gentoo-build-swarmv2.6.0 in progress (7 modified, 6 new untracked files)argoboxadmin panel work (active button states, SSR migration)- Multiple Vaults with session logs
Nothing lost on disk. But the system was down. Everything had to be recovered and re-verified.
01:00 — The Inventory of Damage
Spun up and took stock:
Tailscale was down. Service had stopped on callisto. The lxcbr0 bridge config had a bug — it was trying to enslave the bridge to itself — which prevented the network init script from completing.
Tau-Beta (drone-testbed) was offline. The LXC container on 10.0.0.194 was in STOPPED state. The swarm-drone service was disabled.
Sweeper-Capella was stopped. The LXC container on callisto was down. User chose not to restart it — they prefer using Meridian as the sweeper if needed.
The orchestrator was PAUSED. 28 packages queued, 0 building, 4 drones sitting idle. The orchestrator-sync service had crashed. It had been paused for 2 days and nobody noticed. That’s a problem we’ll have to solve.
Drone-masaimara was disabled. The config said “DISABLED: Unraid host unreachable as of 2026-02-02” but the drone was actually online.
Titan’s binhost cache was empty. The /var/cache/binpkgs/ directory structure existed but had zero actual .gpkg.tar files. 775 stale entries in the Packages index pointing to nonexistent files.
All of this got fixed in the first recovery hour. Tailscale started manually. Tau-Beta container restarted. Orchestrator unpaused via curl. Masaimara re-enabled. Binhost acknowledged as stale.
02:00 — The Fleet Comparison
Took a hard look at what was running across the three systems:
| System | Packages | Role |
|---|---|---|
| callisto (local) | 1,607 | Desktop/Driver |
| orchestrator-io | 629 | Primary orchestrator (4 cores, 2GB RAM) |
| orchestrator-titan | 830 | Secondary orchestrator + binhost (Tailscale) |
They were NOT in sync.
64 version mismatches between io and titan. 13 packages only on io. 214 packages only on titan. Titan had KDE 6.22 / Plasma 6.5.5 while io had 6.20 / 6.5.4.
Callisto vs. the orchestrators: 87 total mismatches.
- 32 packages where callisto was newer (openssl 3.5.5, mesa 25.2.8, libdrm 2.4.131)
- 41 packages where titan was newer (KDE Frameworks 6.22, gnupg 2.5.17)
This is what happens when systems drift for weeks without active sync. The binhost can’t keep up. Drones build things but nothing remembers what should be installed where.
03:00 — The Sync Strategy
Io as source of truth (local LAN, ~30MB/s transfer). Titan was behind Tailscale (~100KB/s), so syncing 7.9GB there would take 22 hours. Pivoted to io.
Step 1: quickpkg’d all 32 packages where callisto was newer. 32/32 succeeded. 273MB total.
Step 2: rsync’d the 32 binpkgs to io at 29.6 MB/s. Took 10 seconds. Regenerated the Packages index via emaint binhost --fix. All 32 verified present and indexed.
Step 3: Tried rsync to titan. Killed it after partial transfer. That’s a background task for later when bandwidth is cheaper.
Step 4: Io needs 93 packages for a full @world update (14 new, 48 upgrades, 30 rebuilds). Kicked off parallel builds:
- Heavy packages on masaimara (20 cores, 52GB RAM): LLVM/clang, mesa, GTK4, Python 3.13
- Light packages on io (4 cores, 2GB RAM): KDE Frameworks 6.22 (35 packages), system libs
Status: 1,299 needed, 368 received, 234 synced to Io. The fleet is synchronizing.
08:10 — Lab Engine Deployed to Production
The lab engine got deployed to Proxmox CT 110 on titan with full Cloudflare Tunnel routing. This is real now.
Issues hit during deployment:
-
Cloudflare Pages wasn’t rebuilding. The site still had old code. Turns out the build DID happen but without PUBLIC_LAB_API_SECRET set, so the HMAC signing wouldn’t work in production. Set the env var via Cloudflare API, triggered a new build. Verified the HMAC secret was baked into the JS bundle.
-
CORS headers were missing on 403 responses. Starlette’s BaseHTTPMiddleware early returns bypass CORSMiddleware entirely. Converted the HMAC middleware to pure ASGI and added CORS headers directly to error responses.
-
Stale uvicorn process wouldn’t die.
rc-service lab-engine restartshowed “no matching processes found” but the old process still held port 8094. Had tokillall uvicorn; killall python3; rc-service lab-engine zap; rc-service lab-engine start. -
PROXMOX_PASSWORD wasn’t set. Proxmox requires password auth for terminal tickets (not API tokens). Set it via
pveum passwd lab-engine@pve, added to CT 110 .env, restarted.
All fixed. Lab engine is live on labs.argobox.com.
14:00 — Lab Engine UX Fixes
The lab engine was running but the experience was broken:
-
Terminal wouldn’t auto-login. Proxmox console showed a getty login prompt. The container password was a random
secrets.token_urlsafe(12)that users would never know. Added server-side auto-login in the terminal WebSocket proxy. Detectslogin:andPassword:prompts in the output stream, injects credentials automatically using Proxmox protocol framing. Users see the login happen in real-time but never type the password. -
Container Lab page had wrong fetch. Used raw
fetch()instead oflabFetch(), so requests went to Cloudflare Pages instead of labs.argobox.com. HMAC signing failed. Fixed to uselabFetch()for all API calls. -
SSH from CT 110 was broken. No
sshbinary in the Alpine container. Installedopenssh-client, generated an ed25519 key, added it to titan’s authorized_keys. Nowexec_command()works via SSH fallback, so setup commands likeapk add bash vim curlcan run inside containers. -
False “Simulation Mode” banner. Health check and reconnect were racing. If health check timed out (3s), fallback to sim mode would fire even though the real session connected. Fixed by only running health check when there’s no saved session to reconnect.
-
No clear “live vs simulation” indicators. Added:
- Green “LIVE” badge on terminal when WebSocket connects to real container
- Terminal title shows
root@lab-xxxxxxxxinstead of generic session ID - Green “Connected to live container” banner with pulsing dot
- Descriptive provisioning steps: “Creating LXC container on Proxmox…”, “Booting Alpine Linux…”, “Opening live terminal session…”
- Simulation banner only shows when engine is actually unreachable
Lab experience is now honest about what’s real and what’s simulated.
17:00 — Multi-Container Labs Planned
Started work on supporting multi-container labs (Networking Lab: 3 containers, IaC Playground: 2 containers, future Argo OS GUI: QEMU VM with noVNC).
Backend done:
- WebSocket route reads
container_idfrom query params - Terminal proxy validates and uses correct container from session
- Increased MAX_WEBSOCKET_PER_SESSION from 1 to 4
- Session manager ready for vm_type branching
Frontend pending:
- labWebSocketUrl() needs containerId parameter
- TerminalEmbed needs to pass container_id
- Networking page needs 3 TerminalEmbed instances with tab switching
- IaC page needs 2 terminals
- argo-os page needs VNC support (QEMU VM, noVNC component)
Full plan at ~/.claude/plans/prancy-beaming-sphinx.md.
The Drift Detection Feature (Designed, Not Yet Built)
The orchestrator was paused for 2 days. Nobody noticed. Packages stayed queued. Drones sat idle. This needs automation.
Designed a new v2.7.0 feature: Fleet Drift Detection
What it does:
- Periodically collects package lists from all nodes via heartbeat
- Compares versions across nodes
- Auto-queues builds for outdated packages
- Auto-promotes binpkgs to binhost after builds
- Alerts when drift exceeds threshold
- Restarts orchestrator if paused for >1 hour
Three types of drift detected:
- Version drift (same package, different versions)
- Missing packages (package on some nodes but not others)
- Stale binhost (index entries pointing to nonexistent files)
This is what keeps a fleet actually synchronized instead of slowly degrading into chaos.
The Day in Numbers
- 7 infrastructure issues resolved
- 64 version mismatches between orchestrators (now being synced)
- 32 binpkgs created and transferred in 10 seconds
- 5 deployment bugs fixed (auth, CORS, process management, passwords, UX)
- 4 lab engine UX issues resolved
- 3 multi-container labs partially implemented
The orchestrator was paused for 2 days and nobody noticed. That’s the bug we need to solve. Not today, but soon. v2.7.0.