The dust had settled.
Power crisis at the Andromeda site. Network ghost on the Unraid server. Four days of debugging, phone calls, and teaching dad to click “Console” instead of “that other thing.” The command center dashboard went from concept to critical infrastructure in a week.
Now came the boring part. The essential part. Making sure it all actually worked—and that it would keep working when I wasn’t watching.
The Audit
Saturday morning. Coffee. A blank checklist.
I’d learned something from the chaos: “it’s probably fine” is not a status. I needed to verify every service, manually, with my own eyes.
Build Infrastructure
| Service | Status | Notes |
|---|---|---|
| Gateway | Responding | API returns valid JSON |
| Orchestrator | Processing | Queue depth: 3 packages |
| Drones (5/5) | Online | All reporting heartbeats |
| Binhost | Serving | SSH connection stable |
| Client sync | Working | Pulled test package successfully |
The swarm had survived the power event. Self-healing kicked in exactly as designed. When the Andromeda drones came back online, they automatically rejoined the collective without intervention.
That felt good.
Media Services
| Service | Status | Notes |
|---|---|---|
| Plex (local) | Streaming | Tested playback on three clients |
| Plex (remote) | Streaming | Tailscale latency: 42ms |
| Tautulli | Collecting | Stats updating in real-time |
| Overseerr | Processing | Test request fulfilled |
| Sonarr/Radarr | Monitoring | Calendar showing upcoming releases |
Everything worked. But Tautulli had stale session data—ghosts of streams that ended days ago. A restart cleared it. Added to the “things to check after incidents” list.
Storage
| Service | Status | Notes |
|---|---|---|
| Meridian-Mako shares | Mounted | NFS responsive |
| Cassiel-Silo | Accessible | DSM dashboard loading |
| Backup jobs | Running | Last backup: 6 hours ago |
| Array health | No errors | All drives healthy |
The Unraid server was behaving. The folder structure had stabilized (dad hadn’t reorganized anything in three days—a new record).
Network
| Service | Status | Notes |
|---|---|---|
| Tailscale mesh | Connected | All nodes visible |
| Cross-site latency | Acceptable | 38-45ms to Andromeda |
| DNS resolution | Working | Internal and external |
| VPN tunnel | Stable | No drops in 48 hours |
Green across the board.
The Issues (Minor)
Not everything was perfect:
Tautulli session ghosts — Fixed with a restart. Root cause: the service didn’t gracefully handle the power event. Sessions that were active when Tarn-Host went down were never properly closed.
Backup schedule drift — One job was set to 2 AM instead of 3 AM. This meant it competed with the Portage sync for disk I/O. Corrected.
DNS record stale — An old A record pointed to an IP that hadn’t existed for months. Cleaned up.
None of these would have caused an outage. All of them would have caused confusion during the next incident.
The Documentation Sprint
Auditing revealed gaps. Not in the services—in the documentation.
When I was guiding dad through Proxmox over the phone, I realized: there was no “click here, then here, then here” guide. I was improvising from memory while he was watching football.
Never again.
What I Wrote
Network Topology Diagram
Not just “here are the IPs.” A full visual:
- Both sites with their subnets
- Which machines are hypervisors vs VMs vs containers
- Where the Tailscale subnet routers live
- The path a packet takes from my desktop to dad’s Plex server
Service Catalog
Every running service, documented:
- What it does
- What depends on it
- Where the config lives
- How to restart it
- What “broken” looks like
Recovery Runbooks
Step-by-step procedures for:
- “Site is completely down” (check power, then UPS, then network, then hypervisor)
- “Service X isn’t responding” (flowchart by service type)
- “How to add a new service” (the right way, not the fast way)
- “How to explain to dad what broke” (simplified terminology, no jargon)
That last one is real. When I’m debugging at 11 PM and need dad to check a blinking light, I can’t use words like “hypervisor” or “LXC container.” The runbook has translations.
Emergency Contact Procedures
If something breaks and I’m unreachable:
- Check if it’s just the internet (can you load Google?)
- Check if Tailscale is connected (green icon in system tray)
- If both are fine, the problem is probably on my end—wait
- If the internet is down, restart the router (unplug, wait 30 seconds, plug back in)
- If that doesn’t work, call the ISP
This is printed and taped to the inside of dad’s network cabinet. Laminated.
Cleanup Tasks
While documenting, I found cruft:
Orphaned containers — Three Docker containers that hadn’t run in months. Old experiments. Removed.
Configuration archaeology — Files named traefik.yml.backup.old.working.FINAL. Archived to a dated folder, originals deleted.
Naming inconsistencies — Some configs said “titan,” others said “Tarn-Host.” Standardized everything to the Galactic naming scheme.
DNS debt — Records for services that no longer existed. Hostnames that pointed to old IPs. Cleaned.
Small things. The kind of things that don’t break anything—until they do.
Maintenance Automation
Some tasks I was doing manually every week:
- Check certificate expiration dates
- Verify backups actually ran
- Check disk space on all nodes
- Rotate logs before they fill drives
These are now scripts. They run via cron. They report to Uptime Kuma.
If a certificate is expiring in < 14 days, I get an alert. If a backup hasn’t run in > 36 hours, I get an alert. If any disk is > 85% full, I get an alert.
The goal: no surprises.
Synthetic Health Checks
The command center dashboard shows if services are “up.” But “up” doesn’t mean “working.”
A container can be running while the application inside has crashed. A web server can respond to health checks while returning 500 errors to users.
I added synthetic checks:
Plex: Every 5 minutes, try to play a 10-second test video. If it fails, alert.
Build swarm: Every 5 minutes, query the gateway API. If it returns invalid JSON or times out, alert.
NAS: Every 5 minutes, list a test directory. If it fails or times out, alert.
These catch the “it’s running but broken” failures that status pages miss.
The Philosophy
After a week of chaos and a weekend of cleanup, I had time to think about what I was actually building.
Infrastructure isn’t done when it works.
It’s done when:
-
Someone else could run it. If I got hit by a bus, could dad keep Plex running? Now, maybe. There’s a runbook.
-
Future-me won’t curse past-me. In six months, I won’t remember why I configured the subnet router that way. Now there’s a comment explaining it.
-
The documentation matches reality. Every IP in the docs is accurate. Every hostname is current. I verified them all.
-
Recovery doesn’t require heroics. Self-healing handles the common failures. Runbooks handle the uncommon ones. Phone calls to dad are the last resort, not the first.
None of this is exciting.
None of it makes for dramatic blog posts.
All of it is why I can sleep through power outages now.
The Scorecard
| Metric | Before | After |
|---|---|---|
| Services documented | ~40% | 100% |
| Recovery runbooks | 0 | 4 |
| Orphaned containers | 3 | 0 |
| Stale DNS records | 7 | 0 |
| Automated health checks | 0 | 12 |
| ”Things I should probably write down” | Infinite | Finite |
The week started with a power blip and ended with a laminated emergency guide in dad’s network cabinet.
The swarm is stable. The documentation is current. The runbooks are written.
And next time something breaks at halftime, dad knows exactly where to click.
System: maintainable. Finally.