Monitoring & Observability

Monitoring is distributed across both sites with overlapping coverage. There is no single centralized monitoring platform — instead, multiple specialized tools handle different aspects of observability. Glances provides real-time host metrics on every significant machine, Grafana aggregates dashboards, Uptime Kuma tracks service availability, Netdata provides deep per-host telemetry, Dozzle handles Docker log aggregation, and Tautulli monitors Plex specifically.

Monitoring Architecture Overview

┌──────────────────────────────────────────────────────────────────┐
│                     MONITORING TOPOLOGY                           │
├──────────────────────────────────────────────────────────────────┤
│                                                                  │
│  Milky Way (10.42.0.0/24)                Andromeda (192.168.20.0/24)     │
│  ─────────────────                 ──────────────────────        │
│                                                                  │
│  Altair-Link (10.42.0.199)         Meridian-Host (192.168.20.50)    │
│  ├─ Homepage      :3001            ├─ Grafana       :3001       │
│  ├─ Grafana       :3002            ├─ Uptime Kuma   :3002       │
│  ├─ Uptime Kuma   :3003            ├─ Netdata       :19999      │
│  ├─ Netdata       :19999           ├─ Glances (v4)  :61208      │
│  ├─ Glances (v3)  :61208           ├─ Dozzle        :9999       │
│  ├─ Dozzle        :9999            └─ Tautulli      :8181       │
│  └─ RustDesk      :21115-17                                     │
│                                                                  │
└──────────────────────────────────────────────────────────────────┘

Glances (System Metrics)

Glances runs on every significant host, providing a web-based overview of CPU, memory, disk I/O, network, and process activity. It is the go-to tool for a quick health check on any individual machine.

Deployment

Host	IP	Port	Version	Access URL
Altair-Link	10.42.0.199	61208	v3	`http://10.42.0.199:61208`
Meridian-Host	192.168.20.50	61208	v4	`http://192.168.20.50:61208`
Capella-Outpost	10.42.0.100	61208	varies	`http://10.42.0.100:61208`
Other hosts	various	61208	varies	`http://<host-ip>:61208`

Version Differences

Glances v3 (Altair-Link): The legacy web UI. Functional but simpler layout. Runs as a Docker container.
Glances v4 (Meridian-Host and newer deployments): Redesigned web UI with improved charts and responsiveness. Preferred for new deployments.

Both versions expose the same REST API on port 61208, making them compatible with Grafana data sources regardless of UI version.

Configuration

Glances runs in web server mode by default:

# Docker run example (v4)
docker run -d \
  --name glances \
  --restart unless-stopped \
  --pid host \
  --network host \
  --privileged \
  -e GLANCES_OPT="-w" \
  -v /var/run/docker.sock:/var/run/docker.sock:ro \
  nicolargo/glances:latest-full

# v3 uses the same approach with an older image tag

The --pid host and --privileged flags give Glances full visibility into the host’s processes and hardware. The Docker socket mount enables container monitoring.

Use Cases

Quick health check: Open http://<host>:61208 to see if a machine is under load, out of memory, or experiencing disk pressure.
Process hunting: Identify runaway processes consuming CPU or memory.
Network throughput: Check real-time bandwidth on each interface.
Docker container stats: See per-container CPU and memory when the Docker socket is mounted.

Grafana (Dashboards)

Grafana provides centralized dashboarding and visualization. Two instances run independently on each site.

Instances

Host	IP	Port	Access URL	Scope
Altair-Link	10.42.0.199	3002	`http://10.42.0.199:3002`	Milky Way
Meridian-Host	192.168.20.50	3001	`http://192.168.20.50:3001`	Andromeda

Data Sources

Grafana pulls metrics from multiple sources:

Netdata (via Netdata’s built-in Prometheus exporter or direct API)
Glances API (REST endpoints on port 61208)
Prometheus (if deployed, scraping exporters)
InfluxDB (if deployed, for time-series storage)

Key Dashboards

Dashboard	Host	Purpose
Host Overview	Altair-Link	CPU, memory, disk for Milky Way hosts
Docker Containers	Altair-Link	Container resource usage
Build Swarm Status	Altair-Link	Drone health, build queue, package counts
Unraid Array	Meridian-Host	Disk utilization, temperatures, parity status
Network Throughput	Both	Bandwidth across interfaces and Tailscale

Configuration

Grafana runs as a Docker container with persistent volume for dashboards and data source configs:

docker run -d \
  --name grafana \
  --restart unless-stopped \
  -p 3002:3000 \
  -v grafana-data:/var/lib/grafana \
  grafana/grafana-oss:latest

Dashboards are configured through the web UI. There is no Grafana-as-code or provisioning setup currently — dashboards are created and edited manually.

Uptime Kuma (Availability Monitoring)

Uptime Kuma tracks the availability of services across both networks. It pings HTTP endpoints, TCP ports, and ICMP targets at regular intervals and sends notifications when something goes down.

Instances

Host	IP	Port	Access URL	Scope
Altair-Link	10.42.0.199	3003	`http://10.42.0.199:3003`	Milky Way + cross-site services
Meridian-Host	192.168.20.50	3002	`http://192.168.20.50:3002`	Andromeda services

Monitored Targets

The Altair-Link instance monitors:

All Docker services on Altair-Link (HTTP health checks)
Proxmox web UIs (Izar-Host, Arcturus-Prime, Tarn-Host via Tailscale)
Build swarm gateway (port 8090)
Cross-site services on Meridian-Host (via Tailscale)
Cloudflare Tunnel health
External endpoints (public-facing services)

The Meridian-Host instance monitors:

All Docker services on Meridian-Host (HTTP health checks)
Plex instances on Polaris-Media (ports 32400, 32401)
Synology DSM interfaces (Cassiel-Silo, Mobius-Silo)
ASUS gateway reachability

Notifications

Uptime Kuma supports multiple notification channels. Currently configured:

Browser notifications (when the Uptime Kuma UI is open)
Discord webhook (if configured)
Email (if SMTP is configured)

Check each instance’s Settings > Notifications page for active notification channels.

Status Pages

Uptime Kuma can generate public or private status pages showing the health of monitored services. These can be shared with specific people (e.g., a status page for dad showing Plex and NAS health on the Andromeda).

Netdata (Deep Telemetry)

Netdata provides per-second granularity metrics with automatic anomaly detection. It collects hundreds of metrics per host out of the box with zero configuration.

Instances

Host	IP	Port	Access URL
Altair-Link	10.42.0.199	19999	`http://10.42.0.199:19999`
Meridian-Host	192.168.20.50	19999	`http://192.168.20.50:19999`

Capabilities

Per-second metrics: CPU, memory, disk I/O, network, interrupts, softnet, and more at 1-second resolution.
Automatic application monitoring: Detects running applications (nginx, Docker, systemd services) and creates dashboards automatically.
Anomaly detection: Built-in ML-based anomaly detection highlights unusual behavior.
Docker monitoring: Per-container metrics when the Docker socket is mounted.
Disk health: SMART data monitoring for physical drives.

Netdata vs. Glances

Both provide host metrics, but they serve different purposes:

Feature	Netdata	Glances
Resolution	1 second	~3 seconds
History	Hours to days (local DB)	Real-time only
Depth	Hundreds of metrics	Key metrics overview
Setup	Agent-based, auto-discovers	Single container, minimal config
Use case	Deep investigation	Quick health check

Use Glances for “is this host okay?” and Netdata for “why is this host slow?”

Dozzle (Docker Logs)

Dozzle provides real-time Docker log viewing through a web UI. It reads from the Docker socket and streams container logs in the browser.

Instances

Host	IP	Port	Access URL
Altair-Link	10.42.0.199	9999	`http://10.42.0.199:9999`
Meridian-Host	192.168.20.50	9999	`http://192.168.20.50:9999`

Features

Real-time streaming: Logs appear as they are written, similar to docker logs -f.
Multi-container view: See logs from all containers simultaneously or filter to a specific one.
Search: Full-text search across log output.
No agents: Dozzle connects directly to the Docker socket — no sidecar containers or log shipping required.

Configuration

docker run -d \
  --name dozzle \
  --restart unless-stopped \
  -p 9999:8080 \
  -v /var/run/docker.sock:/var/run/docker.sock:ro \
  amir20/dozzle:latest

Dozzle is read-only — it cannot modify containers or their configurations. The Docker socket is mounted read-only (:ro).

Homepage (Dashboard)

Host: Altair-Link (10.42.0.199) Port: 3001 Access: http://10.42.0.199:3001

Homepage is the central services dashboard. It is not strictly a monitoring tool, but it aggregates status widgets from other monitoring tools and provides quick-access links to every service across both networks.

What It Shows

Service status indicators (up/down via HTTP checks)
Resource usage widgets (CPU, memory, disk from Glances/Netdata APIs)
Docker container status
Quick links to all services organized by host and category
Weather, bookmarks, and other widget integrations

Homepage is the first page loaded when checking on the infrastructure. If a service is down, the Homepage widget turns red before you even open Uptime Kuma.

Tautulli (Plex Monitoring)

Host: Meridian-Host (192.168.20.50) Port: 8181 Access: http://192.168.20.50:8181 or http://100.64.0.15.30:8181

Tautulli monitors the Plex Media Server instances running on Polaris-Media (192.168.20.201). It connects to the Plex API and tracks:

Active streams: Who is watching, what they are watching, stream quality, transcode status.
History: Complete playback history with timestamps, users, and content.
Library statistics: Media count, recently added, most played.
Notifications: Alerts for new content, stream issues, or server health.

Monitored Plex Instances

Instance	Host	Port	Library
Kraken-commander	Polaris-Media (192.168.20.201)	32400	Primary
Kraken-logistics-officer	Polaris-Media (192.168.20.201)	32401	Secondary

Tautulli provides the answer to “is anyone watching Plex right now?” and “what did people watch this week?” — useful for gauging whether Plex server changes (transcoding settings, library updates) are working correctly.

RustDesk (Remote Access)

Host: Altair-Link (10.42.0.199) Tailscale: 100.64.0.234.88 Ports: 21115, 21116, 21117

RustDesk is a self-hosted remote desktop solution. The relay server runs on Altair-Link, and clients on any machine connect through it for remote desktop sessions.

Port Breakdown

Port	Protocol	Function
21115	TCP	NAT type testing
21116	TCP/UDP	ID registration and hole punching
21117	TCP	Relay traffic

Client Configuration

RustDesk clients are configured to point at the self-hosted relay:

ID Server: 100.64.0.234.88
Relay Server: 100.64.0.234.88

This keeps all remote desktop traffic within the Tailscale mesh — no data flows through RustDesk’s public relay servers.

Monitoring Gaps and Known Issues

No Centralized Log Aggregation

Docker logs are visible per-host via Dozzle, but there is no centralized logging solution (ELK, Loki, etc.) that aggregates logs from all hosts into a single searchable store. Non-Docker services (Proxmox, bare-metal drones) have no log forwarding at all.

No Alerting Pipeline

Uptime Kuma provides basic availability alerting, but there is no structured alerting pipeline (PagerDuty, OpsGenie) for critical failures. Notifications go to Discord or email if configured, but there are no escalation policies or on-call rotations (it is a homelab, after all).

No Metrics Retention

Netdata retains metrics for hours to days depending on available disk. Grafana dashboards show real-time data but long-term trend analysis requires a proper time-series database (InfluxDB, Prometheus with retention). This is a known gap.

Host Coverage

Not all hosts run the full monitoring stack. The monitoring tools are deployed primarily on Altair-Link and Meridian-Host. Other hosts (Izar-Host, Tarn-Host, Tau-Host, Capella-Outpost) may have Glances running but do not have Grafana, Uptime Kuma, or Netdata locally. They are monitored remotely by the Altair-Link and Meridian-Host instances.

Future Improvements

Deploy Prometheus + Grafana Loki for centralized metrics and log aggregation.
Add node_exporter on all Linux hosts for consistent Prometheus scraping.
Set up automated alerting for disk space, array health, and Tailscale peer connectivity.
Long-term metrics storage with InfluxDB or Prometheus with extended retention.
Unified status page for both sites accessible from a single URL.