The Week After Chaos: Auditing, Documenting, and Finally Breathing

The dust had settled.

Power crisis at the Andromeda site. Network ghost on the Unraid server. Four days of debugging, phone calls, and teaching dad to click “Console” instead of “that other thing.” The command center dashboard went from concept to critical infrastructure in a week.

Now came the boring part. The essential part. Making sure it all actually worked—and that it would keep working when I wasn’t watching.

The Audit

Saturday morning. Coffee. A blank checklist.

I’d learned something from the chaos: “it’s probably fine” is not a status. I needed to verify every service, manually, with my own eyes.

Build Infrastructure

Service	Status	Notes
Gateway	Responding	API returns valid JSON
Orchestrator	Processing	Queue depth: 3 packages
Drones (5/5)	Online	All reporting heartbeats
Binhost	Serving	SSH connection stable
Client sync	Working	Pulled test package successfully

The swarm had survived the power event. Self-healing kicked in exactly as designed. When the Andromeda drones came back online, they automatically rejoined the collective without intervention.

That felt good.

Media Services

Service	Status	Notes
Plex (local)	Streaming	Tested playback on three clients
Plex (remote)	Streaming	Tailscale latency: 42ms
Tautulli	Collecting	Stats updating in real-time
Overseerr	Processing	Test request fulfilled
Sonarr/Radarr	Monitoring	Calendar showing upcoming releases

Everything worked. But Tautulli had stale session data—ghosts of streams that ended days ago. A restart cleared it. Added to the “things to check after incidents” list.

Storage

Service	Status	Notes
Meridian-Mako shares	Mounted	NFS responsive
Cassiel-Silo	Accessible	DSM dashboard loading
Backup jobs	Running	Last backup: 6 hours ago
Array health	No errors	All drives healthy

The Unraid server was behaving. The folder structure had stabilized (dad hadn’t reorganized anything in three days—a new record).

Network

Service	Status	Notes
Tailscale mesh	Connected	All nodes visible
Cross-site latency	Acceptable	38-45ms to Andromeda
DNS resolution	Working	Internal and external
VPN tunnel	Stable	No drops in 48 hours

Green across the board.

The Issues (Minor)

Not everything was perfect:

Tautulli session ghosts — Fixed with a restart. Root cause: the service didn’t gracefully handle the power event. Sessions that were active when Tarn-Host went down were never properly closed.

Backup schedule drift — One job was set to 2 AM instead of 3 AM. This meant it competed with the Portage sync for disk I/O. Corrected.

DNS record stale — An old A record pointed to an IP that hadn’t existed for months. Cleaned up.

None of these would have caused an outage. All of them would have caused confusion during the next incident.

The Documentation Sprint

Auditing revealed gaps. Not in the services—in the documentation.

When I was guiding dad through Proxmox over the phone, I realized: there was no “click here, then here, then here” guide. I was improvising from memory while he was watching football.

Never again.

What I Wrote

Network Topology Diagram

Not just “here are the IPs.” A full visual:

Both sites with their subnets
Which machines are hypervisors vs VMs vs containers
Where the Tailscale subnet routers live
The path a packet takes from my desktop to dad’s Plex server

Service Catalog

Every running service, documented:

What it does
What depends on it
Where the config lives
How to restart it
What “broken” looks like

Recovery Runbooks

Step-by-step procedures for:

“Site is completely down” (check power, then UPS, then network, then hypervisor)
“Service X isn’t responding” (flowchart by service type)
“How to add a new service” (the right way, not the fast way)
“How to explain to dad what broke” (simplified terminology, no jargon)

That last one is real. When I’m debugging at 11 PM and need dad to check a blinking light, I can’t use words like “hypervisor” or “LXC container.” The runbook has translations.

Emergency Contact Procedures

If something breaks and I’m unreachable:

Check if it’s just the internet (can you load Google?)
Check if Tailscale is connected (green icon in system tray)
If both are fine, the problem is probably on my end—wait
If the internet is down, restart the router (unplug, wait 30 seconds, plug back in)
If that doesn’t work, call the ISP

This is printed and taped to the inside of dad’s network cabinet. Laminated.

Cleanup Tasks

While documenting, I found cruft:

Orphaned containers — Three Docker containers that hadn’t run in months. Old experiments. Removed.

Configuration archaeology — Files named traefik.yml.backup.old.working.FINAL. Archived to a dated folder, originals deleted.

Naming inconsistencies — Some configs said “titan,” others said “Tarn-Host.” Standardized everything to the Galactic naming scheme.

DNS debt — Records for services that no longer existed. Hostnames that pointed to old IPs. Cleaned.

Small things. The kind of things that don’t break anything—until they do.

Maintenance Automation

Some tasks I was doing manually every week:

Check certificate expiration dates
Verify backups actually ran
Check disk space on all nodes
Rotate logs before they fill drives

These are now scripts. They run via cron. They report to Uptime Kuma.

If a certificate is expiring in < 14 days, I get an alert. If a backup hasn’t run in > 36 hours, I get an alert. If any disk is > 85% full, I get an alert.

The goal: no surprises.

Synthetic Health Checks

The command center dashboard shows if services are “up.” But “up” doesn’t mean “working.”

A container can be running while the application inside has crashed. A web server can respond to health checks while returning 500 errors to users.

I added synthetic checks:

Plex: Every 5 minutes, try to play a 10-second test video. If it fails, alert.

Build swarm: Every 5 minutes, query the gateway API. If it returns invalid JSON or times out, alert.

NAS: Every 5 minutes, list a test directory. If it fails or times out, alert.

These catch the “it’s running but broken” failures that status pages miss.

The Philosophy

After a week of chaos and a weekend of cleanup, I had time to think about what I was actually building.

Infrastructure isn’t done when it works.

It’s done when:

Someone else could run it. If I got hit by a bus, could dad keep Plex running? Now, maybe. There’s a runbook.
Future-me won’t curse past-me. In six months, I won’t remember why I configured the subnet router that way. Now there’s a comment explaining it.
The documentation matches reality. Every IP in the docs is accurate. Every hostname is current. I verified them all.
Recovery doesn’t require heroics. Self-healing handles the common failures. Runbooks handle the uncommon ones. Phone calls to dad are the last resort, not the first.

None of this is exciting.

None of it makes for dramatic blog posts.

All of it is why I can sleep through power outages now.

The Scorecard

Metric	Before	After
Services documented	~40%	100%
Recovery runbooks	0	4
Orphaned containers	3	0
Stale DNS records	7	0
Automated health checks	0	12
”Things I should probably write down”	Infinite	Finite

The week started with a power blip and ended with a laminated emergency guide in dad’s network cabinet.

The swarm is stable. The documentation is current. The runbooks are written.

And next time something breaks at halftime, dad knows exactly where to click.

System: maintainable. Finally.