Skip to main content
user@argobox:~/journal/2026-02-09-the-crash-recovery-and-the-lab-engine-goes-live
$ cat entry.md

The Crash Recovery & The Lab Engine Goes Live

○ NOT REVIEWED

The Crash Recovery & The Lab Engine Goes Live

Started: 2026-02-09 00:30 Ended: 17:15+ Sessions: 4 parallel workstreams Issues Resolved: 7 critical infrastructure problems Status: Lab engine in production, fleet synchronized, drift detection designed


00:30 — The Crash

Callisto crashed hard. 100% CPU, 100% RAM, then silence. This followed a Proxmox incident where the IO host went OOM and Tailscale hijacked the routing table.

The work didn’t vanish — three repos had uncommitted changes when the system went down:

  • gentoo-build-swarm v2.6.0 in progress (7 modified, 6 new untracked files)
  • argobox admin panel work (active button states, SSR migration)
  • Multiple Vaults with session logs

Nothing lost on disk. But the system was down. Everything had to be recovered and re-verified.


01:00 — The Inventory of Damage

Spun up and took stock:

Tailscale was down. Service had stopped on callisto. The lxcbr0 bridge config had a bug — it was trying to enslave the bridge to itself — which prevented the network init script from completing.

Tau-Beta (drone-testbed) was offline. The LXC container on 10.0.0.194 was in STOPPED state. The swarm-drone service was disabled.

Sweeper-Capella was stopped. The LXC container on callisto was down. User chose not to restart it — they prefer using Meridian as the sweeper if needed.

The orchestrator was PAUSED. 28 packages queued, 0 building, 4 drones sitting idle. The orchestrator-sync service had crashed. It had been paused for 2 days and nobody noticed. That’s a problem we’ll have to solve.

Drone-masaimara was disabled. The config said “DISABLED: Unraid host unreachable as of 2026-02-02” but the drone was actually online.

Titan’s binhost cache was empty. The /var/cache/binpkgs/ directory structure existed but had zero actual .gpkg.tar files. 775 stale entries in the Packages index pointing to nonexistent files.

All of this got fixed in the first recovery hour. Tailscale started manually. Tau-Beta container restarted. Orchestrator unpaused via curl. Masaimara re-enabled. Binhost acknowledged as stale.


02:00 — The Fleet Comparison

Took a hard look at what was running across the three systems:

SystemPackagesRole
callisto (local)1,607Desktop/Driver
orchestrator-io629Primary orchestrator (4 cores, 2GB RAM)
orchestrator-titan830Secondary orchestrator + binhost (Tailscale)

They were NOT in sync.

64 version mismatches between io and titan. 13 packages only on io. 214 packages only on titan. Titan had KDE 6.22 / Plasma 6.5.5 while io had 6.20 / 6.5.4.

Callisto vs. the orchestrators: 87 total mismatches.

  • 32 packages where callisto was newer (openssl 3.5.5, mesa 25.2.8, libdrm 2.4.131)
  • 41 packages where titan was newer (KDE Frameworks 6.22, gnupg 2.5.17)

This is what happens when systems drift for weeks without active sync. The binhost can’t keep up. Drones build things but nothing remembers what should be installed where.


03:00 — The Sync Strategy

Io as source of truth (local LAN, ~30MB/s transfer). Titan was behind Tailscale (~100KB/s), so syncing 7.9GB there would take 22 hours. Pivoted to io.

Step 1: quickpkg’d all 32 packages where callisto was newer. 32/32 succeeded. 273MB total.

Step 2: rsync’d the 32 binpkgs to io at 29.6 MB/s. Took 10 seconds. Regenerated the Packages index via emaint binhost --fix. All 32 verified present and indexed.

Step 3: Tried rsync to titan. Killed it after partial transfer. That’s a background task for later when bandwidth is cheaper.

Step 4: Io needs 93 packages for a full @world update (14 new, 48 upgrades, 30 rebuilds). Kicked off parallel builds:

  • Heavy packages on masaimara (20 cores, 52GB RAM): LLVM/clang, mesa, GTK4, Python 3.13
  • Light packages on io (4 cores, 2GB RAM): KDE Frameworks 6.22 (35 packages), system libs

Status: 1,299 needed, 368 received, 234 synced to Io. The fleet is synchronizing.


08:10 — Lab Engine Deployed to Production

The lab engine got deployed to Proxmox CT 110 on titan with full Cloudflare Tunnel routing. This is real now.

Issues hit during deployment:

  1. Cloudflare Pages wasn’t rebuilding. The site still had old code. Turns out the build DID happen but without PUBLIC_LAB_API_SECRET set, so the HMAC signing wouldn’t work in production. Set the env var via Cloudflare API, triggered a new build. Verified the HMAC secret was baked into the JS bundle.

  2. CORS headers were missing on 403 responses. Starlette’s BaseHTTPMiddleware early returns bypass CORSMiddleware entirely. Converted the HMAC middleware to pure ASGI and added CORS headers directly to error responses.

  3. Stale uvicorn process wouldn’t die. rc-service lab-engine restart showed “no matching processes found” but the old process still held port 8094. Had to killall uvicorn; killall python3; rc-service lab-engine zap; rc-service lab-engine start.

  4. PROXMOX_PASSWORD wasn’t set. Proxmox requires password auth for terminal tickets (not API tokens). Set it via pveum passwd lab-engine@pve, added to CT 110 .env, restarted.

All fixed. Lab engine is live on labs.argobox.com.


14:00 — Lab Engine UX Fixes

The lab engine was running but the experience was broken:

  1. Terminal wouldn’t auto-login. Proxmox console showed a getty login prompt. The container password was a random secrets.token_urlsafe(12) that users would never know. Added server-side auto-login in the terminal WebSocket proxy. Detects login: and Password: prompts in the output stream, injects credentials automatically using Proxmox protocol framing. Users see the login happen in real-time but never type the password.

  2. Container Lab page had wrong fetch. Used raw fetch() instead of labFetch(), so requests went to Cloudflare Pages instead of labs.argobox.com. HMAC signing failed. Fixed to use labFetch() for all API calls.

  3. SSH from CT 110 was broken. No ssh binary in the Alpine container. Installed openssh-client, generated an ed25519 key, added it to titan’s authorized_keys. Now exec_command() works via SSH fallback, so setup commands like apk add bash vim curl can run inside containers.

  4. False “Simulation Mode” banner. Health check and reconnect were racing. If health check timed out (3s), fallback to sim mode would fire even though the real session connected. Fixed by only running health check when there’s no saved session to reconnect.

  5. No clear “live vs simulation” indicators. Added:

    • Green “LIVE” badge on terminal when WebSocket connects to real container
    • Terminal title shows root@lab-xxxxxxxx instead of generic session ID
    • Green “Connected to live container” banner with pulsing dot
    • Descriptive provisioning steps: “Creating LXC container on Proxmox…”, “Booting Alpine Linux…”, “Opening live terminal session…”
    • Simulation banner only shows when engine is actually unreachable

Lab experience is now honest about what’s real and what’s simulated.


17:00 — Multi-Container Labs Planned

Started work on supporting multi-container labs (Networking Lab: 3 containers, IaC Playground: 2 containers, future Argo OS GUI: QEMU VM with noVNC).

Backend done:

  • WebSocket route reads container_id from query params
  • Terminal proxy validates and uses correct container from session
  • Increased MAX_WEBSOCKET_PER_SESSION from 1 to 4
  • Session manager ready for vm_type branching

Frontend pending:

  • labWebSocketUrl() needs containerId parameter
  • TerminalEmbed needs to pass container_id
  • Networking page needs 3 TerminalEmbed instances with tab switching
  • IaC page needs 2 terminals
  • argo-os page needs VNC support (QEMU VM, noVNC component)

Full plan at ~/.claude/plans/prancy-beaming-sphinx.md.


The Drift Detection Feature (Designed, Not Yet Built)

The orchestrator was paused for 2 days. Nobody noticed. Packages stayed queued. Drones sat idle. This needs automation.

Designed a new v2.7.0 feature: Fleet Drift Detection

What it does:

  • Periodically collects package lists from all nodes via heartbeat
  • Compares versions across nodes
  • Auto-queues builds for outdated packages
  • Auto-promotes binpkgs to binhost after builds
  • Alerts when drift exceeds threshold
  • Restarts orchestrator if paused for >1 hour

Three types of drift detected:

  1. Version drift (same package, different versions)
  2. Missing packages (package on some nodes but not others)
  3. Stale binhost (index entries pointing to nonexistent files)

This is what keeps a fleet actually synchronized instead of slowly degrading into chaos.


The Day in Numbers

  • 7 infrastructure issues resolved
  • 64 version mismatches between orchestrators (now being synced)
  • 32 binpkgs created and transferred in 10 seconds
  • 5 deployment bugs fixed (auth, CORS, process management, passwords, UX)
  • 4 lab engine UX issues resolved
  • 3 multi-container labs partially implemented

The orchestrator was paused for 2 days and nobody noticed. That’s the bug we need to solve. Not today, but soon. v2.7.0.