Seven Services, One Hour: Deploying a Complete Monitoring Stack

Seven Services, One Hour: Deploying a Complete Monitoring Stack

There comes a point in every homelab where you stop being able to hold the whole thing in your head.

For me that point was somewhere around the fourteenth system. A 66-core build swarm spread across 2 networks, 5 drones, and 14 machines. Two Proxmox hypervisors. Docker hosts. NAS boxes. A Tailscale mesh stitching it all together across 40 miles.

When something goes wrong, my debugging process was: SSH into the box. Check htop. Check logs. SSH into the next box. Repeat. Maybe tail -f three different terminals side by side and squint.

Flying blind. That’s what it was.

So on February 6th, I carved out an hour and deployed seven monitoring services to Altair-Link (10.42.0.199). From zero to dashboards in under an hour. Mostly.


The Stack

Seven services. All Docker containers. One host.

ServicePortWhat It Does
Prometheus9090Time-series metrics collection
Grafana Enhanced3006Dashboards and visualization
Loki3100Log aggregation
cAdvisor8082Container-level metrics
Smokeping8081Network latency tracking
Healthchecks8084Dead man’s switch for services
PromtailLog shipping agent

The logic behind each one deserves explanation, because I didn’t pick seven services to feel important. Each one fills a gap that was actively hurting me.


Why Each One Matters

Prometheus is the foundation. Everything else is decoration without it. CPU, RAM, disk, network — pulled from node-exporters running on every machine in the fleet. Time-series data. Query it, graph it, alert on it. If a drone’s CPU has been pegged at 100% for three hours, Prometheus knows before I do.

Grafana is the translator. Raw PromQL queries are for people who enjoy suffering. I’ve done my time. Grafana turns those queries into dashboards with graphs, thresholds, and color-coded alerts. Green means I can go to bed. Red means I can’t.

Loki solves the log problem. Before Loki, checking build logs meant ssh drone-Tarn "tail -f /var/log/swarm-drone.log" and hoping I picked the right machine. Now every log ends up in one place. Query by host, by service, by time range. Grep across the entire fleet from a browser tab.

Promtail is Loki’s delivery driver. It runs on each host, watches log files, and ships them to Loki. Invisible but essential. The plumbing that makes centralized logging work.

cAdvisor answers the container question. When your infrastructure is 80% Docker containers, “which container is eating all the memory” is a question you ask weekly. cAdvisor instruments every container on the host — CPU, memory, network, disk I/O — and feeds it all to Prometheus.

Smokeping watches the network. I have two sites connected over Tailscale. The local Milky Way network and the Andromeda network, 40 miles away. That 38ms latency? It holds most of the time. Smokeping tracks it over days, weeks, months. When it spikes, I know. When it degrades slowly, I know that too.

Healthchecks is the dead man’s switch. Services check in on a schedule. If a cron job stops running, if a backup script dies silently, if a drone drops off the swarm — Healthchecks notices because the check-in stops arriving. The absence of a signal is itself the signal.


The Actual Deployment

Everything runs on Altair-Link, which already serves as the gateway host for the build swarm. Adding monitoring to it made sense — it has visibility into both networks and enough headroom for seven lightweight containers.

The deployment itself was a single script: deploy-monitoring-stack.sh. Pull images, create containers with the right port mappings and volume mounts, wire up the networking. Not glamorous, but effective.

The Name Collision

First snag: old containers with the same names were still hanging around from a previous experiment. Docker doesn’t let you create a container with a name that already exists, even if it’s stopped.

docker rm -f prometheus grafana loki cadvisor smokeping healthchecks promtail

Scorched earth. Start clean.

The Port Shuffle

cAdvisor defaults to port 8080. Guess what was already running on 8080. Right. So cAdvisor got bumped to 8082. This is the kind of thing that takes thirty seconds to fix but two hours to debug if you don’t notice it immediately.

docker run -d --name cadvisor \
  -p 8082:8080 \
  --volume=/:/rootfs:ro \
  --volume=/var/run:/var/run:ro \
  --volume=/sys:/sys:ro \
  --volume=/var/lib/docker/:/var/lib/docker:ro \
  gcr.io/cadvisor/cadvisor:latest

Port 8082 externally, 8080 internally. Docker networking at its most mundane.

Total Time

About 45 minutes from first command to all seven services showing green in docker ps. That includes the troubleshooting. Not bad for a full observability stack.


The NodeJS Detour (Or: The Irony of Binary Packages)

Before I even got to the monitoring stack, I discovered something annoying. Altair-Link didn’t have npm.

This is a Gentoo system. A driver system, meaning it’s supposed to pull binary packages from the binhost and never compile anything locally. That’s the whole point of the build swarm — drones compile, drivers consume.

Except the nodejs binary package on the binhost had been built without the npm USE flag. No npm. No npx. Nothing.

emerge --info nodejs | grep npm
# USE="... -npm -corepack ..."

There it is. The -npm flag. The build drone compiled nodejs without npm because nobody told it otherwise.

The fix: recompile nodejs locally with the right flags.

USE="npm corepack" emerge nodejs

On the system that’s architecturally designed to never compile anything. The system whose entire reason for existing is “don’t compile on me, that’s what the drones are for.”

Compiling nodejs. On the no-compile box.

The build swarm guys would be ashamed of me if the build swarm guys weren’t also me.

After about twenty minutes of watching C++ scroll past, npm was available and I could move on with my life. I updated the USE flags on the binhost so this won’t happen again. Probably.


What the Dashboards Actually Show

Dashboards are only useful if they show things you act on. Pretty graphs that nobody checks are just wasted pixels. Here’s what I configured and why.

Build Swarm Panel

The build swarm is the centerpiece. Five drones, 66 cores, packages churning through constantly.

  • Drone core allocation — how many cores each drone is using vs. available
  • Packages per hour — throughput across the swarm
  • Build queue depth — how many packages are waiting
  • Failure rates — which packages keep failing and on which drones

When drone-Meridian (the 24-core beast on the Andromeda network) drops to zero packages per hour, something’s wrong. Now I see it in seconds instead of discovering it hours later when a build times out.

Infrastructure Panel

The boring-but-critical stuff.

  • Disk usage across all nodes (nothing kills you faster than a full disk at 3 AM)
  • CPU load averages across the fleet
  • RAM utilization per host
  • Swap usage (if swap is being touched, something is already wrong)

Network Panel

Two sites, one mesh.

  • Tailscale latency between Milky Way and Andromeda — that 38ms baseline
  • Bandwidth utilization on key links
  • Packet loss over time (Smokeping’s specialty)

The Andromeda network is at a remote site. When latency doubles, it’s usually someone streaming 4K. When it triples, something more interesting is happening.

Container Panel

Per-container metrics from cAdvisor, visualized in Grafana.

  • CPU time per container
  • Memory usage (and limits, when set)
  • Network I/O per container
  • Disk reads/writes

This is where you find the container that’s been quietly leaking memory for three weeks. The one that works fine until it doesn’t, and then you’re restarting it at midnight wondering what happened.


Integration with the Build Swarm

The monitoring stack isn’t just passive observation. It ties directly into how I run the build swarm.

Prometheus scrapes node-exporters on every drone. Each build drone runs a node-exporter that publishes system metrics on port 9100. Prometheus pulls those metrics on a 15-second interval. CPU, memory, disk, network — all of it flowing into the time-series database.

Alerts fire when things go sideways. Drone offline for more than 5 minutes? Alert. Memory on a drone exceeding 90%? Alert. Disk usage above 85%? Alert. These aren’t emails I ignore — they’re Grafana alerts that change dashboard colors. Hard to miss a red panel.

Build metrics feed the dashboards. Compilation start times, completion times, success/failure states — all of it lands in Prometheus and shows up in Grafana. Over time, I can see trends. Which packages take longest. Which drones are fastest. Where the bottlenecks live.

Loki captures everything for post-mortems. When a build fails, the logs are already in Loki. I don’t need to SSH anywhere. Just open the Loki data source in Grafana, filter by hostname and time range, and read. It’s the difference between “let me check” and “I already know.”


The Lessons

Monitoring should be boring. The best monitoring setup is one you deploy once and then forget about until it saves you at 2 AM. I’m not trying to build an observability platform. I’m trying to sleep through the night knowing my infrastructure will wake me up if it needs me.

Seven services sounds like a lot. It isn’t. Each one is a single container. Total resource footprint is maybe 2GB of RAM across all seven. The value they provide — centralized logs, metrics, latency tracking, dead man switches — is worth ten times that.

Port conflicts are the dumbest time sink. I’ve been doing this for years and I still lose ten minutes to port collisions every time I deploy something new. Just check first. ss -tlnp | grep 8080. Save yourself the confusion.

Binary package systems only work if the packages are built right. The nodejs situation was a reminder. My build swarm compiles everything, but USE flags still need to be correct. A binary package without npm is technically a valid package. It’s just useless for my purposes. Garbage in, garbage out — even with automation.

Start with metrics, add logs second. If I had to pick just two services, it would be Prometheus and Grafana. Metrics tell you what is happening. Logs tell you why. You need the “what” before the “why” is useful.


What’s Next

The monitoring stack is live, but it’s not complete. Three things on the list:

Node-exporters on all drones. Some of the build drones are still running without node-exporters. That means Prometheus can’t scrape them, which means they’re invisible to the dashboards. Unacceptable. Every drone gets an exporter.

Smokeping targets for cross-site monitoring. Right now Smokeping is running but only monitoring a handful of hosts. I need to add every critical endpoint across both networks — the Proxmox hosts, the NAS boxes, the gateway nodes. Full coverage.

Healthchecks cron for critical services. The dead man’s switch only works if services actually check in. I need to add curl-based health checks to cron on every machine running something important. Backup scripts, sync jobs, the build swarm gateway — all of it.

Forty-five minutes to deploy. A lifetime of not flying blind.

That’s a trade I’ll take every time.