Skip to main content
Back to Journal
user@argobox:~/journal/2026-03-21-titan-goes-dark
$ cat entry.md

Titan Goes Dark

○ NOT REVIEWED

Titan Goes Dark

Date: 2026-03-21 Duration: 43+ hours offline and counting Issue: Titan Proxmox host unreachable since 2:32 AM March 20 Root Cause: Unknown — can't read logs on a dead machine


The Timeline

I was pushing commits at 2:00 AM. ArgoBeat stuff, OpenClaw, pubDate fixes. Normal late-night code session. Nothing unusual.

2:12 AM — last commit pushed.

2:30 AM — standard Linux maintenance window kicks in. Probably unattended-upgrades doing its thing.

2:32 AM — Titan drops off Tailscale.

That's it. That's the whole timeline. Tailscale last-seen timestamp is the only forensic evidence I have because the machine that holds the actual logs is the machine that's dead.


What's Down

CT 103 — the Legal RAG container — is completely unreachable. Port 8100 returns nothing. No health endpoint. No search. No LLM generation. The ArgoBox activity feed just says "unavailable" which is technically accurate but deeply unhelpful.

CourtListener sync missed both the March 20 and March 21 runs at 2 AM. There are 12,455 10th Circuit cases from a March 18 plan that were supposed to be ingesting incrementally. They're queued up somewhere on a filesystem I can't reach.


What's Safe

The data. Thank god for AllShare.

/mnt/AllShare/colorado-legal-rag/
├── chroma_db/     3.5 GB (vectors, backup from 2026-03-08)
└── data/          2.9 GB (statutes, case law, rules, forms)

Persistent NTFS storage, physically independent of Titan. Even if the Proxmox host is bricked, the RAG data survives. The source code is in git. The container setup is documented in TITAN-CT103-SETUP.md. Everything is recoverable.

Everything except the container itself. CT 103 has no Proxmox snapshots. Not on Io, not anywhere. I never set up automated backups. That's the kind of oversight that doesn't matter until it matters.


The Hypothesis

My best guess: CourtListener daily sync was running at 2 AM. System updates kicked in at 2:30 AM. Resource conflict — maybe OOM, maybe a failed reboot from a kernel update while the sync was hammering disk I/O.

I cannot confirm this. The machine that could confirm it is the machine that's offline. Circular problem.


Recovery Plan

Phase 1 is embarrassingly simple: walk over and press the power button. Or access the hypervisor console if remote management cooperates. Neither option is available to me right now.

Phase 2 is verification:

curl http://192.168.50.100:8100/health
curl http://192.168.50.100:8100/api/search?q=custody

Phase 3 is configuring the LEGAL_RAG_API_URL environment variable in CF Pages so the activity feed actually works. Turns out this was never configured — which means the "unavailable" status in ArgoBox predates the outage. I just didn't notice because I assumed it was a tunnel issue.

Phase 4, if the index is corrupted:

ssh [email protected] "pct exec 103 -- \
  cd /opt/legal-rag && \
  nohup nice -n 19 python scripts/build_index.py \
  --collection co_caselaw --batch-size 8 &"

Nice -n 19. Because the last thing I need is another resource conflict at 2 AM.


The Known Unknowns

Four questions I can't answer until Titan is back:

  1. What actually caused the 2:32 AM crash? Sync + updates? OOM? Disk full? Network?
  2. Does CT 103 still exist on Titan's ZFS? If the container is gone, it's a full rebuild from the setup guide.
  3. Is the 2 AM cron job actually configured? Session notes mention it as a plan. Actual crontab on CT 103 — unverified.
  4. Why did I never set up backups? No documented reason. Just never did it.

What This Exposed

Two infrastructure gaps that should have been obvious:

No container backups. CT 103 is recoverable from source code + setup guide + AllShare mount, but it's not instant. A Proxmox snapshot would make this a 5-minute restore instead of a 2-hour rebuild.

No monitoring. Titan went offline at 2:32 AM and I didn't notice until the next day. No alerting. No health checks. Just Tailscale's last-seen timestamp staring at me from the admin panel.

The data being safe on AllShare is the one bright spot. At least past-me got that right.

43 hours of silence from a machine I depend on. Probably should have noticed sooner.