The Week After Chaos: Auditing, Documenting, and Finally Breathing

The dust had settled.

Power crisis at the Andromeda site. Network ghost on the Unraid server. Four days of debugging, phone calls, and teaching dad to click “Console” instead of “that other thing.” The command center dashboard went from concept to critical infrastructure in a week.

Now came the boring part. The essential part. Making sure it all actually worked—and that it would keep working when I wasn’t watching.

The Audit

Saturday morning. Coffee. A blank checklist.

I’d learned something from the chaos: “it’s probably fine” is not a status. I needed to verify every service, manually, with my own eyes.

Build Infrastructure

ServiceStatusNotes
GatewayRespondingAPI returns valid JSON
OrchestratorProcessingQueue depth: 3 packages
Drones (5/5)OnlineAll reporting heartbeats
BinhostServingSSH connection stable
Client syncWorkingPulled test package successfully

The swarm had survived the power event. Self-healing kicked in exactly as designed. When the Andromeda drones came back online, they automatically rejoined the collective without intervention.

That felt good.

Media Services

ServiceStatusNotes
Plex (local)StreamingTested playback on three clients
Plex (remote)StreamingTailscale latency: 42ms
TautulliCollectingStats updating in real-time
OverseerrProcessingTest request fulfilled
Sonarr/RadarrMonitoringCalendar showing upcoming releases

Everything worked. But Tautulli had stale session data—ghosts of streams that ended days ago. A restart cleared it. Added to the “things to check after incidents” list.

Storage

ServiceStatusNotes
Meridian-Mako sharesMountedNFS responsive
Cassiel-SiloAccessibleDSM dashboard loading
Backup jobsRunningLast backup: 6 hours ago
Array healthNo errorsAll drives healthy

The Unraid server was behaving. The folder structure had stabilized (dad hadn’t reorganized anything in three days—a new record).

Network

ServiceStatusNotes
Tailscale meshConnectedAll nodes visible
Cross-site latencyAcceptable38-45ms to Andromeda
DNS resolutionWorkingInternal and external
VPN tunnelStableNo drops in 48 hours

Green across the board.

The Issues (Minor)

Not everything was perfect:

Tautulli session ghosts — Fixed with a restart. Root cause: the service didn’t gracefully handle the power event. Sessions that were active when Tarn-Host went down were never properly closed.

Backup schedule drift — One job was set to 2 AM instead of 3 AM. This meant it competed with the Portage sync for disk I/O. Corrected.

DNS record stale — An old A record pointed to an IP that hadn’t existed for months. Cleaned up.

None of these would have caused an outage. All of them would have caused confusion during the next incident.


The Documentation Sprint

Auditing revealed gaps. Not in the services—in the documentation.

When I was guiding dad through Proxmox over the phone, I realized: there was no “click here, then here, then here” guide. I was improvising from memory while he was watching football.

Never again.

What I Wrote

Network Topology Diagram

Not just “here are the IPs.” A full visual:

  • Both sites with their subnets
  • Which machines are hypervisors vs VMs vs containers
  • Where the Tailscale subnet routers live
  • The path a packet takes from my desktop to dad’s Plex server

Service Catalog

Every running service, documented:

  • What it does
  • What depends on it
  • Where the config lives
  • How to restart it
  • What “broken” looks like

Recovery Runbooks

Step-by-step procedures for:

  • “Site is completely down” (check power, then UPS, then network, then hypervisor)
  • “Service X isn’t responding” (flowchart by service type)
  • “How to add a new service” (the right way, not the fast way)
  • “How to explain to dad what broke” (simplified terminology, no jargon)

That last one is real. When I’m debugging at 11 PM and need dad to check a blinking light, I can’t use words like “hypervisor” or “LXC container.” The runbook has translations.

Emergency Contact Procedures

If something breaks and I’m unreachable:

  1. Check if it’s just the internet (can you load Google?)
  2. Check if Tailscale is connected (green icon in system tray)
  3. If both are fine, the problem is probably on my end—wait
  4. If the internet is down, restart the router (unplug, wait 30 seconds, plug back in)
  5. If that doesn’t work, call the ISP

This is printed and taped to the inside of dad’s network cabinet. Laminated.


Cleanup Tasks

While documenting, I found cruft:

Orphaned containers — Three Docker containers that hadn’t run in months. Old experiments. Removed.

Configuration archaeology — Files named traefik.yml.backup.old.working.FINAL. Archived to a dated folder, originals deleted.

Naming inconsistencies — Some configs said “titan,” others said “Tarn-Host.” Standardized everything to the Galactic naming scheme.

DNS debt — Records for services that no longer existed. Hostnames that pointed to old IPs. Cleaned.

Small things. The kind of things that don’t break anything—until they do.


Maintenance Automation

Some tasks I was doing manually every week:

  • Check certificate expiration dates
  • Verify backups actually ran
  • Check disk space on all nodes
  • Rotate logs before they fill drives

These are now scripts. They run via cron. They report to Uptime Kuma.

If a certificate is expiring in < 14 days, I get an alert. If a backup hasn’t run in > 36 hours, I get an alert. If any disk is > 85% full, I get an alert.

The goal: no surprises.


Synthetic Health Checks

The command center dashboard shows if services are “up.” But “up” doesn’t mean “working.”

A container can be running while the application inside has crashed. A web server can respond to health checks while returning 500 errors to users.

I added synthetic checks:

Plex: Every 5 minutes, try to play a 10-second test video. If it fails, alert.

Build swarm: Every 5 minutes, query the gateway API. If it returns invalid JSON or times out, alert.

NAS: Every 5 minutes, list a test directory. If it fails or times out, alert.

These catch the “it’s running but broken” failures that status pages miss.


The Philosophy

After a week of chaos and a weekend of cleanup, I had time to think about what I was actually building.

Infrastructure isn’t done when it works.

It’s done when:

  1. Someone else could run it. If I got hit by a bus, could dad keep Plex running? Now, maybe. There’s a runbook.

  2. Future-me won’t curse past-me. In six months, I won’t remember why I configured the subnet router that way. Now there’s a comment explaining it.

  3. The documentation matches reality. Every IP in the docs is accurate. Every hostname is current. I verified them all.

  4. Recovery doesn’t require heroics. Self-healing handles the common failures. Runbooks handle the uncommon ones. Phone calls to dad are the last resort, not the first.

None of this is exciting.

None of it makes for dramatic blog posts.

All of it is why I can sleep through power outages now.


The Scorecard

MetricBeforeAfter
Services documented~40%100%
Recovery runbooks04
Orphaned containers30
Stale DNS records70
Automated health checks012
”Things I should probably write down”InfiniteFinite

The week started with a power blip and ended with a laminated emergency guide in dad’s network cabinet.

The swarm is stable. The documentation is current. The runbooks are written.

And next time something breaks at halftime, dad knows exactly where to click.

System: maintainable. Finally.