Skip to main content

Tutorial Playbooks

Real syntax, real deployment flow: each playbook includes complete files plus explicit deploy, verify, and rollback steps.

16 hands-on playbooks 74 blog posts 87 journal entries

Source Method

Patterns here are aligned to primary documentation and adapted for ArgoBox-style infrastructure. Use these as starting templates, then tune hostnames, ports, and auth to your environment.

Compose Stack with Traefik + Postgres + Healthchecks

A restart-safe stack with TLS routing, service health gates, and secret-based DB auth.

Intermediate 30-45 min docker
Prerequisites
  • Docker Engine + Compose v2
  • A DNS record pointing to your host
  • Ports 80/443 reachable
/opt/argobox/stack/docker-compose.yml
services:
  traefik:
    image: traefik:v3.1
    command:
      - --api.dashboard=true
      - --providers.docker=true
      - --providers.docker.exposedbydefault=false
      - --entrypoints.web.address=:80
      - --entrypoints.websecure.address=:443
      - [email protected]
      - --certificatesresolvers.letsencrypt.acme.storage=/letsencrypt/acme.json
      - --certificatesresolvers.letsencrypt.acme.httpchallenge=true
      - --certificatesresolvers.letsencrypt.acme.httpchallenge.entrypoint=web
    ports:
      - "80:80"
      - "443:443"
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock:ro
      - ./letsencrypt:/letsencrypt
    restart: unless-stopped

  postgres:
    image: postgres:16
    environment:
      POSTGRES_DB: app
      POSTGRES_USER: app
      POSTGRES_PASSWORD_FILE: /run/secrets/postgres_password
    secrets:
      - postgres_password
    volumes:
      - pgdata:/var/lib/postgresql/data
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U app -d app"]
      interval: 10s
      timeout: 5s
      retries: 6
    restart: unless-stopped

  app:
    image: ghcr.io/traefik/whoami:v1.10
    depends_on:
      postgres:
        condition: service_healthy
    labels:
      - traefik.enable=true
      - traefik.http.routers.app.rule=Host(`app.argobox.com`)
      - traefik.http.routers.app.entrypoints=websecure
      - traefik.http.routers.app.tls.certresolver=letsencrypt
      - traefik.http.services.app.loadbalancer.server.port=80
    restart: unless-stopped

secrets:
  postgres_password:
    file: ./secrets/postgres_password.txt

volumes:
  pgdata:
/opt/argobox/stack/secrets/postgres_password.txt
replace-with-a-long-random-password

Deploy

mkdir -p /opt/argobox/stack/{letsencrypt,secrets}
chmod 700 /opt/argobox/stack/letsencrypt
chmod 600 /opt/argobox/stack/secrets/postgres_password.txt
cd /opt/argobox/stack
docker compose pull
docker compose up -d --remove-orphans

Verify

docker compose ps
docker compose logs --no-log-prefix postgres | tail -n 30
curl -I https://app.argobox.com
docker exec $(docker compose ps -q postgres) pg_isready -U app -d app

Rollback

cd /opt/argobox/stack
docker compose down
docker compose up -d --remove-orphans

Kubernetes Deployment with Startup/Readiness/Liveness + Ingress

A rollout-safe deployment that survives slow startup and only receives traffic when healthy.

Advanced 35-50 min kubernetes
Prerequisites
  • Working Kubernetes cluster
  • Ingress controller installed
  • kubectl access to target namespace
/opt/argobox/k8s/app-stack.yaml
apiVersion: v1
kind: Namespace
metadata:
  name: app-prod
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: app-api
  namespace: app-prod
spec:
  replicas: 3
  revisionHistoryLimit: 5
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 0
      maxSurge: 1
  selector:
    matchLabels:
      app: app-api
  template:
    metadata:
      labels:
        app: app-api
    spec:
      containers:
        - name: api
          image: ghcr.io/acme/app-api:1.8.2
          ports:
            - containerPort: 8080
          env:
            - name: APP_ENV
              value: production
          resources:
            requests:
              cpu: 150m
              memory: 256Mi
            limits:
              cpu: 1
              memory: 1Gi
          startupProbe:
            httpGet:
              path: /health/startup
              port: 8080
            periodSeconds: 5
            failureThreshold: 30
          readinessProbe:
            httpGet:
              path: /health/ready
              port: 8080
            periodSeconds: 10
            timeoutSeconds: 2
            failureThreshold: 3
          livenessProbe:
            httpGet:
              path: /health/live
              port: 8080
            periodSeconds: 15
            timeoutSeconds: 2
            failureThreshold: 3
---
apiVersion: v1
kind: Service
metadata:
  name: app-api
  namespace: app-prod
spec:
  selector:
    app: app-api
  ports:
    - name: http
      port: 80
      targetPort: 8080
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: app-api
  namespace: app-prod
  annotations:
    cert-manager.io/cluster-issuer: letsencrypt-prod
spec:
  ingressClassName: nginx
  tls:
    - hosts: ["api.argobox.com"]
      secretName: app-api-tls
  rules:
    - host: api.argobox.com
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: app-api
                port:
                  number: 80

Deploy

kubectl apply -f /opt/argobox/k8s/app-stack.yaml
kubectl rollout status deploy/app-api -n app-prod --timeout=180s

Verify

kubectl get pods -n app-prod -o wide
kubectl get ingress -n app-prod
kubectl describe deploy app-api -n app-prod | rg -n "Strategy|Replicas|Probe" -N
curl -fsS https://api.argobox.com/health/ready

Rollback

kubectl rollout undo deploy/app-api -n app-prod
kubectl rollout status deploy/app-api -n app-prod

SSH Hardening + Fail2Ban for Admin Access

Password auth disabled, limited attack surface, modern cipher suite, group-restricted login, and automatic ban rules with incremental ban times for repeated auth failures.

Intermediate 25-35 min security
Prerequisites
  • Console/physical recovery path (IPMI, Proxmox console, or physical keyboard)
  • At least one tested SSH public key (ed25519 preferred)
  • Root/sudo access
  • fail2ban package available in repos (apt, dnf, or emerge)
/etc/ssh/sshd_config.d/99-hardening.conf
# ArgoBox SSH hardening — drop-in config
# Place in /etc/ssh/sshd_config.d/ to override defaults
# Test BEFORE reloading: sshd -t

Port 22
AddressFamily inet
ListenAddress 0.0.0.0

# Authentication
PermitRootLogin no
PubkeyAuthentication yes
PasswordAuthentication no
KbdInteractiveAuthentication no
ChallengeResponseAuthentication no
UsePAM yes
AuthenticationMethods publickey

# Restrict login to members of the 'ssh-users' group
# Add your user first: sudo groupadd ssh-users && sudo usermod -aG ssh-users commander
AllowGroups ssh-users

# Modern ciphers only — disable anything CBC or SHA1-based
Ciphers [email protected],[email protected],[email protected]
MACs [email protected],[email protected]
KexAlgorithms [email protected],curve25519-sha256,[email protected]
HostKeyAlgorithms ssh-ed25519,rsa-sha2-512,rsa-sha2-256

# Forwarding — disable everything not needed
AllowAgentForwarding no
AllowTcpForwarding no
X11Forwarding no
PermitTunnel no
AllowStreamLocalForwarding no
GatewayPorts no
PermitUserEnvironment no

# Session limits
MaxAuthTries 3
MaxSessions 5
LoginGraceTime 30
ClientAliveInterval 300
ClientAliveCountMax 2

# Logging
LogLevel VERBOSE
/etc/fail2ban/jail.d/sshd.local
[sshd]
enabled  = true
backend  = systemd
port     = ssh
filter   = sshd
logpath  = %(sshd_log)s

# Ban after 4 failed attempts within 10 minutes
maxretry  = 4
findtime  = 10m
bantime   = 1h

# Incremental bans — each repeat offense doubles the ban
# 1h -> 2h -> 4h -> 8h ... up to maxbantime
bantime.increment = true
bantime.multipliers = 1 2 4 8 16
bantime.maxtime = 4w
bantime.rndtime = 5m

# Action: ban via nftables and send email with whois + log lines
action = %(action_mwl)s
/etc/fail2ban/jail.local
# Global Fail2ban settings — applies to all jails
[DEFAULT]
# Ban method: nftables (preferred) or iptables
banaction = nftables-multiport
banaction_allports = nftables-allports

# Email notifications — set your mail relay and destination
destemail = [email protected]
sender    = [email protected]
mta       = sendmail

# Default ban parameters (jails can override)
bantime   = 1h
findtime  = 10m
maxretry  = 5

# Ignore local and Tailscale ranges
ignoreip  = 127.0.0.1/8 ::1 10.42.0.0/24 100.64.0.0/10

[sshd]
enabled = true

# Traefik auth failures (optional — enable if Traefik exposes basic auth)
[traefik-auth]
enabled  = true
port     = http,https
filter   = traefik-auth
logpath  = /var/log/traefik/access.log
maxretry = 5
findtime = 5m
bantime  = 30m
bantime.increment = true

Deploy

# Create the ssh-users group and add your account
sudo groupadd -f ssh-users
sudo usermod -aG ssh-users commander
# Install fail2ban if not present
sudo apt install -y fail2ban || sudo dnf install -y fail2ban || sudo emerge --ask net-analyzer/fail2ban
# Copy config files into place
sudo cp 99-hardening.conf /etc/ssh/sshd_config.d/
sudo cp sshd.local /etc/fail2ban/jail.d/
sudo cp jail.local /etc/fail2ban/
# Validate sshd config before reloading (CRITICAL — a bad config locks you out)
sshd -t
# Reload sshd to apply changes
sudo systemctl reload sshd || sudo rc-service sshd reload
# Enable and start fail2ban
sudo systemctl enable --now fail2ban || (sudo rc-update add fail2ban default && sudo rc-service fail2ban start)
# Verify fail2ban picked up the SSH jail
sudo fail2ban-client status sshd

Verify

# Confirm password auth is rejected
ssh -o PreferredAuthentications=password -o PubkeyAuthentication=no [email protected]  # should fail immediately
# Confirm pubkey auth still works (run from a machine with your key)
ssh -T [email protected]
# Check which ciphers the server offers (should only show modern ones)
ssh -vv [email protected] 2>&1 | grep "kex:" | head -5
# Verify fail2ban is running with the SSH jail active
sudo fail2ban-client status sshd
# Check auth log for recent activity
sudo journalctl -u sshd --since "1 hour ago" --no-pager | tail -20
# Deliberately trigger a ban (from a test IP, not your current session)
# Run 5 bad password attempts from another machine, then:
sudo fail2ban-client status sshd  # should show the test IP in "Banned IP list"
# Confirm unban happens after bantime expires, or manually unban:
sudo fail2ban-client set sshd unbanip <test-ip>

Rollback

sudo mv /etc/ssh/sshd_config.d/99-hardening.conf /etc/ssh/sshd_config.d/99-hardening.conf.bak
sudo systemctl reload sshd || sudo rc-service sshd reload
sudo fail2ban-client stop
sudo systemctl disable fail2ban || sudo rc-update del fail2ban default

Tailscale Subnet Router + ACL Policy + Exit Node

LAN subnets exposed through a controlled router node with auto-approved routes, exit node capability, IP forwarding, MagicDNS, and explicit ACL ownership.

Intermediate 20-30 min networking
Prerequisites
  • Tailscale installed on router node (v1.56+)
  • Admin access to Tailscale admin console or policy file (Settings > Access Controls)
  • Known LAN CIDR (e.g., 10.42.0.0/24)
  • Root/sudo access on the router node
  • Troubleshooting: if routes are advertised but not reachable, check that ip_forward is enabled and the router node firewall allows forwarding between tailscale0 and your LAN interface
/etc/tailscale/bootstrap-subnet-router.sh
#!/usr/bin/env bash
set -euo pipefail

# Tailscale subnet router bootstrap script
# This node will advertise local LAN routes and act as an exit node
# for remote clients who want full internet-via-homelab routing.

LAN_CIDR="10.42.0.0/24"

# Apply sysctl for IP forwarding (also persisted in /etc/sysctl.d/99-tailscale.conf)
sudo sysctl -w net.ipv4.ip_forward=1
sudo sysctl -w net.ipv6.conf.all.forwarding=1

tailscale up \
  --ssh \
  --advertise-routes="${LAN_CIDR}" \
  --advertise-exit-node \
  --accept-routes=false \
  --accept-dns=true \
  --hostname=argobox-subnet-router

echo ""
echo "Subnet router is advertising ${LAN_CIDR} and exit node capability."
echo ""
echo "If autoApprovers is configured in your ACL policy, routes will"
echo "be approved automatically. Otherwise, approve them manually:"
echo "  Tailscale Admin Console > Machines > this node > Edit route settings"
echo ""
echo "To use this as an exit node from a remote device:"
echo "  tailscale up --exit-node=argobox-subnet-router"
tailscale-acl.json
{
  "tagOwners": {
    "tag:subnet-router": ["autogroup:admin"],
    "tag:exit-node":     ["autogroup:admin"],
    "tag:server":        ["autogroup:admin"]
  },

  "groups": {
    "group:admins": ["[email protected]"]
  },

  "acls": [
    {
      "action": "accept",
      "src": ["group:admins"],
      "dst": ["*:*"],
      "comment": "Admins can reach everything on the tailnet and advertised subnets"
    },
    {
      "action": "accept",
      "src": ["tag:server"],
      "dst": ["tag:server:*"],
      "comment": "Servers can talk to each other (inter-node traffic)"
    },
    {
      "action": "accept",
      "src": ["group:admins"],
      "dst": ["10.42.0.0/24:*"],
      "comment": "Admins can reach the entire LAN via subnet router"
    }
  ],

  "autoApprovers": {
    "routes": {
      "10.42.0.0/24": ["tag:subnet-router"],
      "comment": "Auto-approve LAN subnet routes from tagged routers"
    },
    "exitNode": ["tag:exit-node"],
    "comment": "Auto-approve exit node capability from tagged nodes"
  },

  "ssh": [
    {
      "action": "accept",
      "src":  ["group:admins"],
      "dst":  ["tag:server"],
      "users": ["commander", "root"]
    }
  ],

  "dns": {
    "nameservers": ["10.42.0.1"],
    "domains":     ["argobox.tail"],
    "magicDNS":    true
  }
}
/etc/sysctl.d/99-tailscale.conf
# Required for Tailscale subnet routing and exit node functionality.
# Without these, the kernel will drop forwarded packets silently.
# Apply immediately: sudo sysctl -p /etc/sysctl.d/99-tailscale.conf

net.ipv4.ip_forward = 1
net.ipv6.conf.all.forwarding = 1

Deploy

# Persist IP forwarding settings
sudo cp 99-tailscale.conf /etc/sysctl.d/
sudo sysctl -p /etc/sysctl.d/99-tailscale.conf
# Tag this machine in the admin console as tag:subnet-router and tag:exit-node
# (or use --advertise-tags if your ACL allows self-tagging)
# Apply the ACL policy in Tailscale Admin Console > Access Controls
# Paste the contents of tailscale-acl.json and save
# Run the bootstrap script
chmod +x /etc/tailscale/bootstrap-subnet-router.sh
sudo /etc/tailscale/bootstrap-subnet-router.sh
# Verify the routes appear as approved (not "awaiting approval")
tailscale status

Verify

# Confirm this node is advertising routes and exit node
tailscale status --json | jq "{ routes: .Self.AllowedIPs, exitNode: .Self.ExitNode, online: .Self.Online }"
# List all peers and their status
tailscale status --peers
# From a REMOTE device on the tailnet, test subnet route access:
ping -c 3 10.42.0.1        # Should reach the LAN gateway via subnet route
ssh [email protected]   # Should reach a LAN host via subnet route
# Test MagicDNS resolution (from any tailnet device)
dig argobox-subnet-router.argobox.tail
tailscale ping argobox-subnet-router
# Verify IP forwarding is active on the router node
sysctl net.ipv4.ip_forward  # Should show = 1
# Test exit node from a remote client:
# tailscale up --exit-node=argobox-subnet-router
# curl ifconfig.me  # Should show the homelab public IP

Rollback

# Remove subnet routes and exit node
tailscale up --advertise-routes= --advertise-exit-node=false --reset
# Disable IP forwarding if no longer needed
sudo rm /etc/sysctl.d/99-tailscale.conf
sudo sysctl -w net.ipv4.ip_forward=0
sudo sysctl -w net.ipv6.conf.all.forwarding=0
tailscale status

Proxmox VM Template with Cloud-Init

A reusable VM template with cloud-init that provisions with SSH keys, static IP, and user account in under 60 seconds.

Beginner 20-30 min proxmox
Prerequisites
  • Proxmox VE 8.x
  • Ubuntu/Debian cloud image ISO
  • SSH public key
/opt/argobox/proxmox/create-template.sh
#!/usr/bin/env bash
set -euo pipefail

TEMPLATE_VMID=9000
TEMPLATE_NAME="ubuntu-cloud-template"
CLOUD_IMAGE_URL="https://cloud-images.ubuntu.com/noble/current/noble-server-cloudimg-amd64.img"
CLOUD_IMAGE="/var/lib/vz/template/iso/noble-server-cloudimg-amd64.img"
STORAGE="local-lvm"

# Download cloud image if not present
if [[ ! -f "${CLOUD_IMAGE}" ]]; then
  wget -O "${CLOUD_IMAGE}" "${CLOUD_IMAGE_URL}"
fi

# Destroy existing template if present
if qm status "${TEMPLATE_VMID}" &>/dev/null; then
  qm destroy "${TEMPLATE_VMID}" --purge
fi

# Create VM
qm create "${TEMPLATE_VMID}" \
  --name "${TEMPLATE_NAME}" \
  --ostype l26 \
  --cpu cputype=host \
  --cores 2 \
  --memory 2048 \
  --net0 virtio,bridge=vmbr0 \
  --agent enabled=1 \
  --scsihw virtio-scsi-single

# Import disk
qm set "${TEMPLATE_VMID}" --scsi0 "${STORAGE}:0,import-from=${CLOUD_IMAGE},discard=on,iothread=1"

# Resize disk to 32GB
qm disk resize "${TEMPLATE_VMID}" scsi0 32G

# Add cloud-init drive
qm set "${TEMPLATE_VMID}" --ide2 "${STORAGE}:cloudinit"

# Set boot order
qm set "${TEMPLATE_VMID}" --boot order=scsi0

# Set serial console for cloud image compatibility
qm set "${TEMPLATE_VMID}" --serial0 socket --vga serial0

# Set cloud-init defaults
qm set "${TEMPLATE_VMID}" \
  --ciuser commander \
  --citype nocloud \
  --sshkeys ~/.ssh/authorized_keys \
  --ipconfig0 ip=dhcp

# Convert to template
qm template "${TEMPLATE_VMID}"

echo "Template ${TEMPLATE_VMID} created."
/opt/argobox/proxmox/cloud-init-userdata.yml
#cloud-config
users:
  - name: commander
    groups: sudo
    shell: /bin/bash
    sudo: ALL=(ALL) NOPASSWD:ALL
    lock_passwd: true
    ssh_authorized_keys:
      - ssh-ed25519 AAAA...your-public-key-here commander@argobox

package_update: true
package_upgrade: true
packages:
  - qemu-guest-agent
  - curl
  - wget
  - vim
  - htop
  - net-tools
  - unattended-upgrades

runcmd:
  - systemctl enable --now qemu-guest-agent
  - timedatectl set-timezone America/Chicago
  - sed -i 's/#PermitRootLogin.*/PermitRootLogin no/' /etc/ssh/sshd_config
  - sed -i 's/#PasswordAuthentication.*/PasswordAuthentication no/' /etc/ssh/sshd_config
  - systemctl restart sshd

Deploy

chmod +x /opt/argobox/proxmox/create-template.sh
bash /opt/argobox/proxmox/create-template.sh
qm clone 9000 110 --name my-new-vm --full
qm set 110 --ipconfig0 ip=10.42.0.110/24,gw=10.42.0.1
qm start 110

Verify

qm status 110
ssh [email protected]
cloud-init status --wait
dpkg -l | grep qemu-guest-agent

Rollback

qm stop 110
qm destroy 110 --purge

K3s Single-Node to HA Cluster

A 3-node K3s HA cluster with embedded etcd, shared token, and load-balanced API server.

Advanced 45-60 min kubernetes
Prerequisites
  • 3 Linux nodes (2+ CPU, 4GB RAM each)
  • Network connectivity between nodes
  • Ports 6443, 2379-2380, 10250 open
/opt/argobox/k3s/k3s-init.sh
#!/usr/bin/env bash
set -euo pipefail

# First server node — initializes the HA cluster with embedded etcd
K3S_TOKEN="replace-with-a-long-random-token"
NODE_IP="10.42.0.50"
TLS_SAN="10.42.0.49"  # VIP or load balancer address

curl -sfL https://get.k3s.io | sh -s - server \
  --cluster-init \
  --token "${K3S_TOKEN}" \
  --node-ip "${NODE_IP}" \
  --tls-san "${TLS_SAN}" \
  --tls-san "${NODE_IP}" \
  --disable traefik \
  --disable servicelb \
  --write-kubeconfig-mode 644 \
  --etcd-expose-metrics \
  --kube-apiserver-arg default-not-ready-toleration-seconds=30 \
  --kube-apiserver-arg default-unreachable-toleration-seconds=30

echo "First server node initialized. Waiting for node to be ready..."
kubectl wait --for=condition=Ready node/$(hostname) --timeout=120s
/opt/argobox/k3s/k3s-join.sh
#!/usr/bin/env bash
set -euo pipefail

# Additional server nodes — join as control plane peers
K3S_TOKEN="replace-with-a-long-random-token"
FIRST_SERVER="https://10.42.0.50:6443"
NODE_IP="$(hostname -I | awk '{print $1}')"
TLS_SAN="10.42.0.49"

curl -sfL https://get.k3s.io | sh -s - server \
  --server "${FIRST_SERVER}" \
  --token "${K3S_TOKEN}" \
  --node-ip "${NODE_IP}" \
  --tls-san "${TLS_SAN}" \
  --disable traefik \
  --disable servicelb \
  --write-kubeconfig-mode 644 \
  --etcd-expose-metrics \
  --kube-apiserver-arg default-not-ready-toleration-seconds=30 \
  --kube-apiserver-arg default-unreachable-toleration-seconds=30

echo "Server node joined. Verifying cluster membership..."
kubectl get nodes
/opt/argobox/k3s/k3s-agent.sh
#!/usr/bin/env bash
set -euo pipefail

# Worker/agent node — joins the cluster as a workload runner only
K3S_TOKEN="replace-with-a-long-random-token"
K3S_URL="https://10.42.0.49:6443"  # Point at VIP or load balancer
NODE_IP="$(hostname -I | awk '{print $1}')"

curl -sfL https://get.k3s.io | sh -s - agent \
  --server "${K3S_URL}" \
  --token "${K3S_TOKEN}" \
  --node-ip "${NODE_IP}"

echo "Agent node joined the cluster."

Deploy

bash /opt/argobox/k3s/k3s-init.sh          # Run on first server
bash /opt/argobox/k3s/k3s-join.sh           # Run on second and third servers
bash /opt/argobox/k3s/k3s-agent.sh          # Run on any worker nodes
kubectl get nodes -o wide

Verify

kubectl get nodes -o wide
kubectl get pods -A
kubectl get endpoints kubernetes -o yaml   # Should list all server IPs
# Test HA: stop k3s on one server, confirm API still responds
ssh 10.42.0.51 "systemctl stop k3s"
kubectl get nodes                           # Downed node shows NotReady

Rollback

/usr/local/bin/k3s-uninstall.sh             # Run on each server node
/usr/local/bin/k3s-agent-uninstall.sh       # Run on each agent node

Complete Monitoring Stack

Prometheus + Grafana + Alertmanager with node exporters, pre-built dashboards, and Slack/Discord alert routing.

Intermediate 35-45 min monitoring
Prerequisites
  • Docker Engine + Compose v2
  • Ports 3000, 9090, 9093, 9100 available
  • Optional: Slack/Discord webhook URL
/opt/argobox/monitoring/docker-compose.yml
services:
  prometheus:
    image: prom/prometheus:v2.51.0
    command:
      - --config.file=/etc/prometheus/prometheus.yml
      - --storage.tsdb.path=/prometheus
      - --storage.tsdb.retention.time=30d
      - --web.enable-lifecycle
    volumes:
      - ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml:ro
      - ./prometheus/alert-rules.yml:/etc/prometheus/alert-rules.yml:ro
      - prometheus_data:/prometheus
    ports:
      - "9090:9090"
    restart: unless-stopped

  grafana:
    image: grafana/grafana:10.4.1
    environment:
      GF_SECURITY_ADMIN_USER: admin
      GF_SECURITY_ADMIN_PASSWORD_FILE: /run/secrets/grafana_admin_pw
      GF_USERS_ALLOW_SIGN_UP: "false"
      GF_SERVER_ROOT_URL: http://10.42.0.10:3000
    secrets:
      - grafana_admin_pw
    volumes:
      - grafana_data:/var/lib/grafana
    ports:
      - "3000:3000"
    depends_on:
      - prometheus
    restart: unless-stopped

  alertmanager:
    image: prom/alertmanager:v0.27.0
    command:
      - --config.file=/etc/alertmanager/alertmanager.yml
      - --storage.path=/alertmanager
    volumes:
      - ./alertmanager/alertmanager.yml:/etc/alertmanager/alertmanager.yml:ro
      - alertmanager_data:/alertmanager
    ports:
      - "9093:9093"
    restart: unless-stopped

  node-exporter:
    image: prom/node-exporter:v1.7.0
    command:
      - --path.rootfs=/host
      - --collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)
    volumes:
      - /:/host:ro,rslave
    network_mode: host
    pid: host
    restart: unless-stopped

  cadvisor:
    image: gcr.io/cadvisor/cadvisor:v0.49.1
    volumes:
      - /:/rootfs:ro
      - /var/run:/var/run:ro
      - /sys:/sys:ro
      - /var/lib/docker/:/var/lib/docker:ro
      - /dev/disk/:/dev/disk:ro
    ports:
      - "8080:8080"
    privileged: true
    devices:
      - /dev/kmsg
    restart: unless-stopped

secrets:
  grafana_admin_pw:
    file: ./secrets/grafana_admin_pw.txt

volumes:
  prometheus_data:
  grafana_data:
  alertmanager_data:
/opt/argobox/monitoring/prometheus/prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

rule_files:
  - /etc/prometheus/alert-rules.yml

alerting:
  alertmanagers:
    - static_configs:
        - targets:
            - alertmanager:9093

scrape_configs:
  - job_name: prometheus
    static_configs:
      - targets: ["localhost:9090"]

  - job_name: node-exporter
    static_configs:
      - targets:
          - 10.42.0.10:9100
          - 10.42.0.11:9100
          - 10.42.0.12:9100
        labels:
          env: homelab

  - job_name: cadvisor
    static_configs:
      - targets: ["cadvisor:8080"]

  - job_name: docker-services
    docker_sd_configs:
      - host: unix:///var/run/docker.sock
        refresh_interval: 30s
    relabel_configs:
      - source_labels: [__meta_docker_container_label_prometheus_scrape]
        regex: "true"
        action: keep
      - source_labels: [__meta_docker_container_name]
        target_label: container_name
/opt/argobox/monitoring/alertmanager/alertmanager.yml
global:
  resolve_timeout: 5m

route:
  group_by: [alertname, severity]
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: slack-notifications
  routes:
    - match:
        severity: critical
      receiver: slack-critical
      repeat_interval: 1h
    - match:
        severity: warning
      receiver: slack-notifications

receivers:
  - name: slack-notifications
    slack_configs:
      - api_url: https://hooks.slack.com/services/YOUR/WEBHOOK/URL
        channel: "#homelab-alerts"
        title: '{{ .GroupLabels.alertname }}'
        text: >-
          {{ range .Alerts }}
          *{{ .Labels.instance }}*: {{ .Annotations.description }}
          {{ end }}
        send_resolved: true

  - name: slack-critical
    slack_configs:
      - api_url: https://hooks.slack.com/services/YOUR/WEBHOOK/URL
        channel: "#homelab-critical"
        title: 'CRITICAL: {{ .GroupLabels.alertname }}'
        text: >-
          {{ range .Alerts }}
          *{{ .Labels.instance }}*: {{ .Annotations.description }}
          {{ end }}
        send_resolved: true

inhibit_rules:
  - source_match:
      severity: critical
    target_match:
      severity: warning
    equal: [alertname, instance]
/opt/argobox/monitoring/prometheus/alert-rules.yml
groups:
  - name: node-alerts
    rules:
      - alert: NodeDown
        expr: up{job="node-exporter"} == 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Node {{ $labels.instance }} is down"
          description: "Node exporter on {{ $labels.instance }} has been unreachable for more than 2 minutes."

      - alert: HighCpuUsage
        expr: 100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 85
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "High CPU usage on {{ $labels.instance }}"
          description: "CPU usage is above 85% for more than 10 minutes (current: {{ $value | printf "%.1f" }}%)."

      - alert: DiskSpaceLow
        expr: (node_filesystem_avail_bytes{fstype!~"tmpfs|overlay"} / node_filesystem_size_bytes) * 100 < 15
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Low disk space on {{ $labels.instance }}"
          description: "Filesystem {{ $labels.mountpoint }} has less than 15% free space (current: {{ $value | printf "%.1f" }}%)."

      - alert: DiskSpaceCritical
        expr: (node_filesystem_avail_bytes{fstype!~"tmpfs|overlay"} / node_filesystem_size_bytes) * 100 < 5
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Critical disk space on {{ $labels.instance }}"
          description: "Filesystem {{ $labels.mountpoint }} has less than 5% free space (current: {{ $value | printf "%.1f" }}%)."

      - alert: HighMemoryUsage
        expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) * 100 > 90
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "High memory usage on {{ $labels.instance }}"
          description: "Memory usage is above 90% for more than 10 minutes (current: {{ $value | printf "%.1f" }}%)."

Deploy

mkdir -p /opt/argobox/monitoring/{prometheus,alertmanager,secrets}
echo "replace-with-strong-password" > /opt/argobox/monitoring/secrets/grafana_admin_pw.txt
chmod 600 /opt/argobox/monitoring/secrets/grafana_admin_pw.txt
cd /opt/argobox/monitoring
docker compose up -d

Verify

docker compose ps
curl -s http://10.42.0.10:9090/api/v1/targets | jq ".data.activeTargets[] | {instance, health}"
curl -s http://10.42.0.10:3000/api/health
curl -s http://10.42.0.10:9093/api/v2/status | jq .cluster

Rollback

cd /opt/argobox/monitoring
docker compose down -v

GitOps with ArgoCD

ArgoCD managing application deployments from a Git repository with automatic sync and health monitoring.

Advanced 40-55 min kubernetes
Prerequisites
  • Working Kubernetes cluster
  • kubectl access
  • Git repository for app manifests
/opt/argobox/argocd/install.sh
#!/usr/bin/env bash
set -euo pipefail

NAMESPACE="argocd"
ARGOCD_VERSION="7.3.3"

# Create namespace
kubectl create namespace "${NAMESPACE}" --dry-run=client -o yaml | kubectl apply -f -

# Add Helm repo
helm repo add argo https://argoproj.github.io/argo-helm
helm repo update

# Install ArgoCD with custom values
helm upgrade --install argocd argo/argo-cd \
  --namespace "${NAMESPACE}" \
  --version "${ARGOCD_VERSION}" \
  --values /opt/argobox/argocd/argocd-values.yaml \
  --wait --timeout 5m

# Wait for all pods
kubectl wait --for=condition=Ready pods --all \
  -n "${NAMESPACE}" --timeout=300s

# Get initial admin password
echo ""
echo "ArgoCD initial admin password:"
kubectl -n "${NAMESPACE}" get secret argocd-initial-admin-secret \
  -o jsonpath="{.data.password}" | base64 -d
echo ""
echo ""
echo "Access the UI at https://argocd.argobox.com or port-forward:"
echo "  kubectl port-forward svc/argocd-server -n ${NAMESPACE} 8443:443"
/opt/argobox/argocd/argocd-values.yaml
configs:
  params:
    server.insecure: false
  cm:
    url: https://argocd.argobox.com
    exec.enabled: "false"
    admin.enabled: "true"
    timeout.reconciliation: 180s
  repositories:
    private-repo:
      url: https://github.com/argobox/k8s-manifests.git
      type: git
  rbac:
    policy.default: role:readonly
    policy.csv: |
      p, role:admin, applications, *, */*, allow
      p, role:admin, clusters, *, *, allow
      p, role:admin, repositories, *, *, allow
      g, admin, role:admin

server:
  replicas: 2
  ingress:
    enabled: true
    ingressClassName: nginx
    hostname: argocd.argobox.com
    tls: true
    annotations:
      cert-manager.io/cluster-issuer: letsencrypt-prod
      nginx.ingress.kubernetes.io/ssl-passthrough: "true"
      nginx.ingress.kubernetes.io/backend-protocol: HTTPS

controller:
  replicas: 1
  resources:
    requests:
      cpu: 250m
      memory: 512Mi
    limits:
      cpu: "1"
      memory: 1Gi

repoServer:
  replicas: 2
  resources:
    requests:
      cpu: 100m
      memory: 256Mi
    limits:
      cpu: "1"
      memory: 512Mi

applicationSet:
  replicas: 1
/opt/argobox/argocd/application.yaml
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: homelab-apps
  namespace: argocd
  finalizers:
    - resources-finalizer.argocd.argoproj.io
spec:
  project: default
  source:
    repoURL: https://github.com/argobox/k8s-manifests.git
    targetRevision: main
    path: apps
    directory:
      recurse: true
  destination:
    server: https://kubernetes.default.svc
    namespace: default
  syncPolicy:
    automated:
      prune: true
      selfHeal: true
      allowEmpty: false
    syncOptions:
      - CreateNamespace=true
      - PrunePropagationPolicy=foreground
      - PruneLast=true
    retry:
      limit: 3
      backoff:
        duration: 5s
        factor: 2
        maxDuration: 3m

Deploy

chmod +x /opt/argobox/argocd/install.sh
bash /opt/argobox/argocd/install.sh
kubectl apply -f /opt/argobox/argocd/application.yaml

Verify

kubectl get pods -n argocd
kubectl get applications -n argocd
kubectl -n argocd get application homelab-apps -o jsonpath="{.status.sync.status}"
kubectl -n argocd get application homelab-apps -o jsonpath="{.status.health.status}"
# Or via CLI:
argocd app list --server argocd.argobox.com

Rollback

argocd app delete homelab-apps --server argocd.argobox.com
helm uninstall argocd -n argocd
kubectl delete namespace argocd

Automated Backup Pipeline

Scheduled Restic backups of Docker volumes to a remote repository with automatic pruning and health checks.

Intermediate 25-35 min docker
Prerequisites
  • Docker Engine + Compose v2
  • Restic installed or container image
  • Remote storage target (S3, SFTP, or local path)
/opt/argobox/backup/docker-compose.yml
services:
  restic-backup:
    image: lobaro/restic-backup-docker:latest
    hostname: argobox-backup
    env_file:
      - ./backup.env
    volumes:
      - /opt/argobox:/data/argobox:ro
      - /var/lib/docker/volumes:/data/docker-volumes:ro
      - restic_cache:/root/.cache/restic
      - ./healthcheck.sh:/usr/local/bin/healthcheck.sh:ro
    healthcheck:
      test: ["CMD", "bash", "/usr/local/bin/healthcheck.sh"]
      interval: 1h
      timeout: 30s
      retries: 3
      start_period: 10m
    restart: unless-stopped

  restic-prune:
    image: lobaro/restic-backup-docker:latest
    hostname: argobox-prune
    env_file:
      - ./backup.env
    environment:
      - BACKUP_CRON=0 0 3 * * SUN
      - RESTIC_FORGET_ARGS=--prune --keep-daily 7 --keep-weekly 4 --keep-monthly 6
    volumes:
      - restic_cache:/root/.cache/restic
    entrypoint: /usr/local/bin/backup
    command: ["prune"]
    restart: unless-stopped

volumes:
  restic_cache:
/opt/argobox/backup/backup.env
RESTIC_REPOSITORY=sftp:[email protected]:/backups/argobox
RESTIC_PASSWORD=replace-with-a-strong-restic-password

BACKUP_CRON=0 0 2 * * *
RESTIC_FORGET_ARGS=--keep-daily 7 --keep-weekly 4 --keep-monthly 12 --keep-yearly 2

AWS_ACCESS_KEY_ID=
AWS_SECRET_ACCESS_KEY=

TZ=America/Chicago
/opt/argobox/backup/healthcheck.sh
#!/usr/bin/env bash
set -euo pipefail

# Verify the repository is accessible and the most recent snapshot
# is less than 26 hours old (allows for a daily schedule with margin)

MAX_AGE_HOURS=26

LATEST_SNAPSHOT=$(restic snapshots --json --latest 1 2>/dev/null)

if [[ -z "${LATEST_SNAPSHOT}" || "${LATEST_SNAPSHOT}" == "[]" ]]; then
  echo "No snapshots found"
  exit 1
fi

SNAPSHOT_TIME=$(echo "${LATEST_SNAPSHOT}" | jq -r '.[0].time')
SNAPSHOT_EPOCH=$(date -d "${SNAPSHOT_TIME}" +%s)
NOW_EPOCH=$(date +%s)
AGE_HOURS=$(( (NOW_EPOCH - SNAPSHOT_EPOCH) / 3600 ))

if (( AGE_HOURS > MAX_AGE_HOURS )); then
  echo "Latest snapshot is ${AGE_HOURS}h old (max: ${MAX_AGE_HOURS}h)"
  exit 1
fi

echo "Healthy: latest snapshot is ${AGE_HOURS}h old"
exit 0

Deploy

mkdir -p /opt/argobox/backup
chmod 600 /opt/argobox/backup/backup.env
chmod +x /opt/argobox/backup/healthcheck.sh
# Initialize the repository (first time only)
restic -r sftp:[email protected]:/backups/argobox init
cd /opt/argobox/backup
docker compose up -d

Verify

docker compose -f /opt/argobox/backup/docker-compose.yml ps
docker compose -f /opt/argobox/backup/docker-compose.yml logs restic-backup --tail 30
restic -r sftp:[email protected]:/backups/argobox snapshots
# Test restore to a temp directory
restic -r sftp:[email protected]:/backups/argobox restore latest --target /tmp/restore-test

Rollback

cd /opt/argobox/backup
docker compose down

VLAN Segmentation for Homelab

Isolated network segments for IoT, management, and services with inter-VLAN routing controlled by firewall rules.

Intermediate 30-40 min networking
Prerequisites
  • Managed switch with VLAN support
  • Router/firewall with VLAN trunking (OPNsense, pfSense, or Linux)
  • Console access to switch
/etc/network/interfaces.d/vlans
# VLAN sub-interfaces on trunk port (parent: eth0)
# Each VLAN gets its own subnet on the 10.42.x.0/24 range

# VLAN 10 — Management (switches, IPMI, Proxmox hosts)
auto eth0.10
iface eth0.10 inet static
    address 10.42.10.1/24
    vlan-raw-device eth0

# VLAN 20 — Services (Docker hosts, K3s nodes, databases)
auto eth0.20
iface eth0.20 inet static
    address 10.42.20.1/24
    vlan-raw-device eth0

# VLAN 30 — IoT (sensors, smart home devices, cameras)
auto eth0.30
iface eth0.30 inet static
    address 10.42.30.1/24
    vlan-raw-device eth0

# VLAN 40 — Guest (internet-only, no LAN access)
auto eth0.40
iface eth0.40 inet static
    address 10.42.40.1/24
    vlan-raw-device eth0
/etc/nftables.d/vlan-policy.conf
#!/usr/sbin/nft -f

table inet vlan_policy {

    chain forward {
        type filter hook forward priority 0; policy drop;

        # Allow established/related connections
        ct state established,related accept

        # Management (VLAN 10) — full access to all VLANs
        iifname "eth0.10" accept

        # Services (VLAN 20) — internet access + respond to management
        iifname "eth0.20" oifname "eth0" accept
        iifname "eth0.20" oifname "eth0.10" ct state established,related accept

        # IoT (VLAN 30) — isolated, internet via explicit allow
        iifname "eth0.30" oifname "eth0" ip daddr != { 10.42.0.0/16 } accept
        iifname "eth0.30" oifname "eth0.30" accept

        # Guest (VLAN 40) — internet-only, no LAN access at all
        iifname "eth0.40" oifname "eth0" ip daddr != { 10.42.0.0/16 } accept

        # Log and drop everything else
        log prefix "VLAN-DROP: " counter drop
    }

    chain input {
        type filter hook input priority 0; policy drop;

        # Loopback
        iif lo accept

        # Allow established/related
        ct state established,related accept

        # Management VLAN can reach the router
        iifname "eth0.10" accept

        # DHCP from all VLANs
        iifname "eth0.*" udp dport { 67, 68 } accept

        # DNS from all VLANs
        iifname "eth0.*" tcp dport 53 accept
        iifname "eth0.*" udp dport 53 accept

        # Drop anything else to the router from non-management VLANs
        log prefix "INPUT-DROP: " counter drop
    }
}

Deploy

# Load 8021q kernel module if not loaded
modprobe 8021q
echo "8021q" >> /etc/modules-load.d/vlans.conf
# Apply interface configuration
ifup eth0.10 eth0.20 eth0.30 eth0.40
# Enable IP forwarding
sysctl -w net.ipv4.ip_forward=1
echo "net.ipv4.ip_forward=1" >> /etc/sysctl.d/99-routing.conf
# Apply firewall rules
nft -f /etc/nftables.d/vlan-policy.conf

Verify

ip -d link show | grep "vlan protocol"
ip addr show eth0.10 eth0.20 eth0.30 eth0.40
# From a VLAN 30 (IoT) host, verify isolation:
ping -c 2 10.42.10.1    # Should FAIL (management blocked)
ping -c 2 10.42.20.1    # Should FAIL (services blocked)
ping -c 2 8.8.8.8       # Should PASS (internet allowed)
# From a VLAN 10 (management) host:
ping -c 2 10.42.20.1    # Should PASS (management has full access)
nft list table inet vlan_policy

Rollback

ifdown eth0.10 eth0.20 eth0.30 eth0.40
rm /etc/network/interfaces.d/vlans
nft delete table inet vlan_policy
rm /etc/nftables.d/vlan-policy.conf

ZFS Pool Setup + Automatic Snapshots

A mirrored ZFS pool with automatic snapshots, compression, and email alerts on disk errors.

Intermediate 30-40 min storage
Prerequisites
  • 2+ drives of the same size (SATA or NVMe)
  • ZFS kernel module loaded (zfs-dkms or built-in)
  • zfs-utils / zfsutils-linux package installed
  • Mail relay configured for ZED alerts (optional)
/opt/argobox/zfs/setup-zfs-pool.sh
#!/usr/bin/env bash
set -euo pipefail

# ZFS mirror pool creation script
# Requires: 2 drives of the same size, ZFS kernel module loaded
#
# Verify your target drives first:
#   lsblk -d -o NAME,SIZE,MODEL,SERIAL
#   wipefs -a /dev/sdX  (destroys all data — be certain)

POOL_NAME="tank"
DISK1="/dev/sda"
DISK2="/dev/sdb"

echo "Creating mirrored ZFS pool '${POOL_NAME}' with ${DISK1} and ${DISK2}..."
echo "This will DESTROY all data on both drives. Ctrl+C to abort."
sleep 5

# Create mirror pool with recommended settings
zpool create -f \
  -o ashift=12 \
  -o autotrim=on \
  -O compression=zstd \
  -O atime=off \
  -O relatime=on \
  -O xattr=sa \
  -O acltype=posixacl \
  -O dnodesize=auto \
  -O normalization=formD \
  -O mountpoint=/${POOL_NAME} \
  "${POOL_NAME}" mirror "${DISK1}" "${DISK2}"

echo "Pool created. Setting up datasets..."

# Dataset: media — large sequential reads (video, music, photos)
zfs create -o recordsize=1M -o compression=lz4 \
  -o mountpoint=/${POOL_NAME}/media \
  "${POOL_NAME}/media"

# Dataset: backups — general backup storage
zfs create -o recordsize=128k -o compression=zstd \
  -o mountpoint=/${POOL_NAME}/backups \
  "${POOL_NAME}/backups"

# Dataset: docker — container volumes and configs
zfs create -o recordsize=128k -o compression=zstd \
  -o mountpoint=/${POOL_NAME}/docker \
  "${POOL_NAME}/docker"

# Dataset: databases — small random I/O (Postgres, SQLite)
zfs create -o recordsize=64k -o compression=zstd \
  -o logbias=latency \
  -o mountpoint=/${POOL_NAME}/databases \
  "${POOL_NAME}/databases"

echo ""
echo "Pool and datasets created:"
zfs list -o name,used,avail,refer,mountpoint,compression,recordsize \
  -r "${POOL_NAME}"
echo ""
zpool status "${POOL_NAME}"
echo ""
echo "Next steps:"
echo "  1. Set up automatic snapshots (zfs-auto-snapshot.sh)"
echo "  2. Configure ZED for email alerts (zed.rc)"
echo "  3. Schedule a weekly scrub: echo '0 2 * * 0 root zpool scrub ${POOL_NAME}' >> /etc/crontab"
/opt/argobox/zfs/zfs-auto-snapshot.sh
#!/usr/bin/env bash
set -euo pipefail

# ZFS automatic snapshot script with configurable retention
# Install via cron — see bottom of script for suggested schedule.
#
# Usage: zfs-auto-snapshot.sh <label> <keep-count> [dataset]
# Examples:
#   zfs-auto-snapshot.sh hourly 24 tank
#   zfs-auto-snapshot.sh daily 30 tank
#   zfs-auto-snapshot.sh weekly 8 tank
#   zfs-auto-snapshot.sh monthly 12 tank

LABEL="${1:?Usage: zfs-auto-snapshot.sh <label> <keep> [dataset]}"
KEEP="${2:?Usage: zfs-auto-snapshot.sh <label> <keep> [dataset]}"
DATASET="${3:-tank}"
TIMESTAMP="$(date +%Y-%m-%d_%H%M)"
SNAP_NAME="${DATASET}@auto-${LABEL}-${TIMESTAMP}"

# Create recursive snapshot (covers all child datasets)
zfs snapshot -r "${SNAP_NAME}"
echo "Created snapshot: ${SNAP_NAME}"

# Prune old snapshots of this label, keeping the N most recent
zfs list -H -t snapshot -o name -S creation -r "${DATASET}" \
  | grep "@auto-${LABEL}-" \
  | tail -n +$(( KEEP + 1 )) \
  | while read -r old_snap; do
      echo "Destroying old snapshot: ${old_snap}"
      zfs destroy "${old_snap}"
    done

echo "Retention: keeping ${KEEP} most recent '${LABEL}' snapshots."

# Suggested cron entries (add to /etc/crontab or /etc/cron.d/zfs-snapshots):
#
# # Hourly — keep 24
# 0 * * * * root /opt/argobox/zfs/zfs-auto-snapshot.sh hourly 24 tank
#
# # Daily at midnight — keep 30
# 0 0 * * * root /opt/argobox/zfs/zfs-auto-snapshot.sh daily 30 tank
#
# # Weekly on Sunday at 1 AM — keep 8
# 0 1 * * 0 root /opt/argobox/zfs/zfs-auto-snapshot.sh weekly 8 tank
#
# # Monthly on the 1st at 2 AM — keep 12
# 0 2 1 * * root /opt/argobox/zfs/zfs-auto-snapshot.sh monthly 12 tank
/etc/zfs/zed.d/zed.rc
##
## ZFS Event Daemon (ZED) configuration
## Sends email alerts on disk errors, scrub results, and pool state changes.
##

# Email recipient for ZED alerts
ZED_EMAIL_ADDR="[email protected]"

# Email sender (requires a working MTA: postfix, msmtp, etc.)
ZED_EMAIL_OPTS="-s '@SUBJECT@' @ADDRESS@"

# Send alert on checksum errors (indicates data corruption or bad cable)
ZED_NOTIFY_INTERVAL_SECS=3600

# Alert on any pool state change (DEGRADED, FAULTED, REMOVED)
ZED_NOTIFY_VERBOSE=1

# Scrub completion notification
ZED_SCRUB_AFTER_RESILVER=1

# Use the system mailer
ZED_EMAIL_PROG="mail"

# Syslog integration
ZED_SYSLOG_TAG="zed"
ZED_SYSLOG_SUBCLASS_INCLUDE="checksum_error|io_error|data_error|scrub_finish|scrub_start|vdev.*"

# Lock file
ZED_LOCKDIR="/var/lock"

Deploy

# Verify ZFS kernel module is loaded
lsmod | grep zfs || sudo modprobe zfs
# Identify your target drives (double-check serials — wrong drive = data loss)
lsblk -d -o NAME,SIZE,MODEL,SERIAL
# Run the pool creation script
chmod +x /opt/argobox/zfs/setup-zfs-pool.sh
sudo bash /opt/argobox/zfs/setup-zfs-pool.sh
# Install the snapshot script
sudo cp /opt/argobox/zfs/zfs-auto-snapshot.sh /usr/local/bin/
sudo chmod +x /usr/local/bin/zfs-auto-snapshot.sh
# Add cron jobs for automatic snapshots
echo "0 * * * * root /usr/local/bin/zfs-auto-snapshot.sh hourly 24 tank" | sudo tee -a /etc/cron.d/zfs-snapshots
echo "0 0 * * * root /usr/local/bin/zfs-auto-snapshot.sh daily 30 tank" | sudo tee -a /etc/cron.d/zfs-snapshots
echo "0 1 * * 0 root /usr/local/bin/zfs-auto-snapshot.sh weekly 8 tank" | sudo tee -a /etc/cron.d/zfs-snapshots
echo "0 2 1 * * root /usr/local/bin/zfs-auto-snapshot.sh monthly 12 tank" | sudo tee -a /etc/cron.d/zfs-snapshots
# Add weekly scrub
echo "0 2 * * 0 root zpool scrub tank" | sudo tee -a /etc/cron.d/zfs-scrub
# Configure ZED for email alerts
sudo cp /opt/argobox/zfs/zed.rc /etc/zfs/zed.d/zed.rc
sudo systemctl restart zed || sudo rc-service zed restart

Verify

# Check pool health and layout
zpool status tank
# List all datasets and their properties
zfs list -o name,used,avail,refer,mountpoint,compression,recordsize -r tank
# Take a manual snapshot and confirm it exists
sudo zfs snapshot -r tank@manual-test
zfs list -t snapshot -r tank
# Run a scrub and check for errors
sudo zpool scrub tank
zpool status tank  # wait for scrub to complete, check for 0 errors
# Test snapshot rollback (on a non-critical dataset)
echo "test data" | sudo tee /tank/backups/test-file.txt
sudo zfs snapshot tank/backups@rollback-test
sudo rm /tank/backups/test-file.txt
sudo zfs rollback tank/backups@rollback-test
cat /tank/backups/test-file.txt  # should show "test data"
# Clean up test snapshots
sudo zfs destroy tank@manual-test
sudo zfs destroy tank/backups@rollback-test
sudo rm /tank/backups/test-file.txt
# Verify ZED is running for email alerts
sudo systemctl status zed || sudo rc-service zed status

Rollback

# Export the pool (unmounts all datasets, safe)
sudo zpool export tank
# Or permanently destroy (ALL DATA LOST):
# sudo zpool destroy tank
# Remove cron jobs
sudo rm -f /etc/cron.d/zfs-snapshots /etc/cron.d/zfs-scrub

Cert-Manager + Let's Encrypt (DNS-01 & HTTP-01)

Automatic TLS certificates for all Ingress resources via Let's Encrypt with DNS-01 challenge for wildcards and HTTP-01 for standard domains.

Intermediate 25-35 min kubernetes
Prerequisites
  • K3s or K8s cluster with Ingress controller (Traefik or nginx)
  • Domain with DNS hosted on Cloudflare (or other cert-manager supported provider)
  • Helm v3 installed
  • Cloudflare API token with Zone:DNS:Edit and Zone:Zone:Read permissions
/opt/argobox/k8s/cert-manager/cert-manager-values.yaml
# Helm values for cert-manager
# Install with: helm install cert-manager jetstack/cert-manager -f cert-manager-values.yaml -n cert-manager

installCRDs: true

replicaCount: 1

resources:
  requests:
    cpu: 50m
    memory: 128Mi
  limits:
    cpu: 200m
    memory: 256Mi

webhook:
  replicaCount: 1
  resources:
    requests:
      cpu: 25m
      memory: 64Mi
    limits:
      cpu: 100m
      memory: 128Mi

cainjector:
  resources:
    requests:
      cpu: 50m
      memory: 128Mi
    limits:
      cpu: 200m
      memory: 256Mi

prometheus:
  enabled: true
  servicemonitor:
    enabled: true
    namespace: monitoring
    labels:
      release: kube-prometheus

# Log level: 2 = info, 5 = debug
global:
  logLevel: 2
/opt/argobox/k8s/cert-manager/cloudflare-secret.yaml
# Cloudflare API token for DNS-01 challenge
# Create a scoped API token at: https://dash.cloudflare.com/profile/api-tokens
#
# Token permissions needed:
#   Zone > DNS > Edit
#   Zone > Zone > Read
#
# Zone resources:
#   Include > Specific zone > homelab.example.com
#
# Generate the base64 value:
#   echo -n "your-cloudflare-api-token" | base64

apiVersion: v1
kind: Secret
metadata:
  name: cloudflare-api-token
  namespace: cert-manager
type: Opaque
data:
  api-token: <base64-encoded-cloudflare-api-token>
/opt/argobox/k8s/cert-manager/clusterissuer.yaml
# ClusterIssuer using both HTTP-01 and DNS-01 solvers
# DNS-01 is required for wildcard certificates
# HTTP-01 works for specific domains when port 80 is reachable

apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: letsencrypt-prod
spec:
  acme:
    server: https://acme-v02.api.letsencrypt.org/directory
    email: [email protected]
    privateKeySecretRef:
      name: letsencrypt-prod-account-key
    solvers:
      # DNS-01 solver via Cloudflare — handles wildcards and private domains
      - dns01:
          cloudflare:
            apiTokenSecretRef:
              name: cloudflare-api-token
              key: api-token
        selector:
          dnsZones:
            - "homelab.example.com"
      # HTTP-01 solver via Traefik — handles standard public domains
      - http01:
          ingress:
            ingressClassName: traefik
---
# Staging issuer for testing (higher rate limits, fake certs)
apiVersion: cert-manager.io/v1
kind: ClusterIssuer
metadata:
  name: letsencrypt-staging
spec:
  acme:
    server: https://acme-staging-v02.api.letsencrypt.org/directory
    email: [email protected]
    privateKeySecretRef:
      name: letsencrypt-staging-account-key
    solvers:
      - dns01:
          cloudflare:
            apiTokenSecretRef:
              name: cloudflare-api-token
              key: api-token
        selector:
          dnsZones:
            - "homelab.example.com"
/opt/argobox/k8s/cert-manager/certificate.yaml
# Wildcard certificate via DNS-01 (covers *.homelab.example.com)
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  name: wildcard-homelab
  namespace: default
spec:
  secretName: wildcard-homelab-tls
  issuerRef:
    name: letsencrypt-prod
    kind: ClusterIssuer
  dnsNames:
    - "homelab.example.com"
    - "*.homelab.example.com"
  # Certificate will auto-renew 30 days before expiry
  renewBefore: 720h
---
# Specific certificate via HTTP-01 (for a public-facing service)
apiVersion: cert-manager.io/v1
kind: Certificate
metadata:
  name: app-homelab
  namespace: app-prod
spec:
  secretName: app-homelab-tls
  issuerRef:
    name: letsencrypt-prod
    kind: ClusterIssuer
  dnsNames:
    - "app.homelab.example.com"
  renewBefore: 720h

Deploy

# Add the Jetstack Helm repo
helm repo add jetstack https://charts.jetstack.io && helm repo update
# Create the namespace
kubectl create namespace cert-manager --dry-run=client -o yaml | kubectl apply -f -
# Install cert-manager with Helm
helm upgrade --install cert-manager jetstack/cert-manager \
  --namespace cert-manager \
  --values /opt/argobox/k8s/cert-manager/cert-manager-values.yaml \
  --wait --timeout 5m
# Wait for all cert-manager pods to be ready
kubectl wait --for=condition=Ready pods --all -n cert-manager --timeout=120s
# Apply the Cloudflare API token secret
kubectl apply -f /opt/argobox/k8s/cert-manager/cloudflare-secret.yaml
# Apply the ClusterIssuers
kubectl apply -f /opt/argobox/k8s/cert-manager/clusterissuer.yaml
# Apply the Certificate resources
kubectl apply -f /opt/argobox/k8s/cert-manager/certificate.yaml
# To use with any Ingress, add this annotation:
#   cert-manager.io/cluster-issuer: letsencrypt-prod

Verify

# Check cert-manager pods are running
kubectl get pods -n cert-manager
# Check ClusterIssuers are ready
kubectl get clusterissuers -o wide
# Check certificate status (READY should be True)
kubectl get certificates -A
# Inspect a specific certificate request for errors
kubectl describe certificaterequest -n default
# Check for any pending challenges
kubectl get challenges -A
# Verify the TLS secret was created
kubectl get secret wildcard-homelab-tls -n default
# Test the cert from outside the cluster
curl -vI https://app.homelab.example.com 2>&1 | grep -E "subject:|issuer:|expire"

Rollback

# Delete certificates and issuers first
kubectl delete -f /opt/argobox/k8s/cert-manager/certificate.yaml --ignore-not-found
kubectl delete -f /opt/argobox/k8s/cert-manager/clusterissuer.yaml --ignore-not-found
kubectl delete -f /opt/argobox/k8s/cert-manager/cloudflare-secret.yaml --ignore-not-found
# Uninstall cert-manager
helm uninstall cert-manager -n cert-manager
# Remove CRDs (cert-manager CRDs stick around after helm uninstall)
kubectl delete -f https://github.com/cert-manager/cert-manager/releases/latest/download/cert-manager.crds.yaml
kubectl delete namespace cert-manager

Proxmox Cluster + HA + Ceph Storage

A 3-node Proxmox cluster with HA failover, shared storage via Ceph, fencing, and live migration.

Advanced 45-60 min proxmox
Prerequisites
  • 3 Proxmox VE 8.x nodes on the same subnet (10.42.0.0/24)
  • Dedicated cluster network recommended (separate NIC or VLAN for corosync traffic)
  • At least 1 unused OSD disk per node for Ceph
  • All nodes must resolve each other by hostname (/etc/hosts or DNS)
/opt/argobox/proxmox/cluster-setup.sh
#!/usr/bin/env bash
set -euo pipefail

# Proxmox Cluster Setup — Run on NODE 1 only
# Nodes: pve1 (10.42.0.31), pve2 (10.42.0.32), pve3 (10.42.0.33)
# This script creates the cluster on the first node.
# Nodes 2 and 3 join separately (see join commands below).

CLUSTER_NAME="argobox-cluster"
NODE1_IP="10.42.0.31"

echo "Creating Proxmox cluster '${CLUSTER_NAME}' on $(hostname)..."

# Create the cluster (run on node 1 only)
pvecm create "${CLUSTER_NAME}" --link0 "${NODE1_IP}"

echo "Cluster created. Verifying..."
pvecm status

echo ""
echo "To join node 2 (run ON node 2):"
echo "  pvecm add ${NODE1_IP} --link0 10.42.0.32"
echo ""
echo "To join node 3 (run ON node 3):"
echo "  pvecm add ${NODE1_IP} --link0 10.42.0.33"
echo ""
echo "After all nodes join, verify with: pvecm status"
echo "Expected: 3 nodes, all online, quorate=yes"
/opt/argobox/proxmox/ha-group.sh
#!/usr/bin/env bash
set -euo pipefail

# HA Group and VM Assignment — Run after all nodes have joined
# This creates an HA group and assigns VMs to be managed by Proxmox HA.
# If a node fails, the HA manager will restart VMs on surviving nodes.

# Create HA group with all 3 nodes (priority: prefer pve1, then pve2, then pve3)
ha-manager groupadd prod-ha \
  --nodes pve1:2,pve2:1,pve3:1 \
  --restricted 1 \
  --nofailback 0 \
  --comment "Production HA group — all 3 nodes"

echo "HA group 'prod-ha' created."

# Assign VMs to the HA group
# Syntax: ha-manager set <type>:<vmid> --group <group> --state started --max_restart 3 --max_relocate 2
ha-manager set vm:100 --group prod-ha --state started --max_restart 3 --max_relocate 2
ha-manager set vm:101 --group prod-ha --state started --max_restart 3 --max_relocate 2
ha-manager set vm:102 --group prod-ha --state started --max_restart 3 --max_relocate 2

echo "VMs 100, 101, 102 assigned to HA group 'prod-ha'."

# Configure fencing (IPMI/iLO recommended for production)
# Without fencing, HA cannot safely restart VMs after a network partition.
# For testing, you can use the Proxmox built-in watchdog:
pvecm expected 2  # Allow cluster to operate with 2 of 3 nodes

echo ""
echo "Verify HA status:"
ha-manager status
echo ""
echo "HA resources:"
ha-manager config
/opt/argobox/proxmox/ceph-setup.sh
#!/usr/bin/env bash
set -euo pipefail

# Ceph Setup on 3-Node Proxmox Cluster
# Run each section on the appropriate node (marked in comments).
# This gives you shared block storage visible from all cluster nodes.
#
# Prerequisites:
#   - Proxmox cluster already formed (pvecm status shows 3 nodes)
#   - Each node has at least 1 unused disk for Ceph OSDs
#   - 10.42.0.0/24 network used for both public and cluster Ceph traffic
#     (dedicated Ceph network recommended in production)

# --- Run on ALL 3 nodes ---
echo "Installing Ceph packages on $(hostname)..."
pveceph install --repository no-subscription

# --- Run on NODE 1 only ---
echo "Initializing Ceph on $(hostname)..."
pveceph init --network 10.42.0.0/24

# --- Run on ALL 3 nodes ---
echo "Creating Ceph monitor on $(hostname)..."
pveceph mon create

# Wait for monitors to reach quorum
echo "Waiting for Ceph monitor quorum..."
sleep 10
ceph mon stat

# --- Run on ALL 3 nodes (adjust disk per node) ---
# List available disks:
#   lsblk -d -o NAME,SIZE,MODEL,SERIAL | grep -v "sda"  (exclude OS disk)
# Replace /dev/sdb with your actual OSD disk on each node:
echo "Creating OSD on $(hostname) using /dev/sdb..."
pveceph osd create /dev/sdb

# --- Run on NODE 1 only (after all OSDs are created) ---
echo "Creating Ceph pool 'vm-pool' for VM storage..."
pveceph pool create vm-pool --pg_num 128 --size 3 --min_size 2

# Add pool as Proxmox storage
pvesm add rbd vm-pool \
  --pool vm-pool \
  --monhost 10.42.0.31,10.42.0.32,10.42.0.33 \
  --content images,rootdir \
  --krbd 0

echo ""
echo "Ceph status:"
ceph -s
echo ""
echo "OSD tree:"
ceph osd tree
echo ""
echo "Pool list:"
ceph osd pool ls detail
echo ""
echo "Ceph setup complete. You can now create VMs on the 'vm-pool' storage."
echo "The pool is accessible from all 3 cluster nodes."

Deploy

# --- Phase 1: Create cluster (on node 1) ---
chmod +x /opt/argobox/proxmox/cluster-setup.sh
bash /opt/argobox/proxmox/cluster-setup.sh
# --- Phase 2: Join nodes (run on each additional node) ---
ssh [email protected] "pvecm add 10.42.0.31 --link0 10.42.0.32"
ssh [email protected] "pvecm add 10.42.0.31 --link0 10.42.0.33"
# Wait for all nodes to sync
pvecm status
# --- Phase 3: Set up Ceph (run sections on appropriate nodes) ---
bash /opt/argobox/proxmox/ceph-setup.sh  # see script for per-node instructions
# --- Phase 4: Configure HA ---
chmod +x /opt/argobox/proxmox/ha-group.sh
bash /opt/argobox/proxmox/ha-group.sh
# --- Phase 5: Test live migration ---
qm migrate 100 pve2 --online

Verify

# Cluster health — all 3 nodes should be online
pvecm status
pvecm nodes
# Ceph health — should show HEALTH_OK with 3 OSDs up
ceph -s
ceph osd tree
# HA status — all VMs should show "started"
ha-manager status
# Live migrate a VM and verify it stays running
qm migrate 100 pve2 --online
qm status 100  # should show "running" on pve2
# Simulate node failure — reboot one node and watch HA failover
ssh [email protected] "reboot"
# After ~60 seconds, VMs from pve2 should restart on pve1 or pve3:
ha-manager status
qm list

Rollback

# Remove HA assignments first
ha-manager remove vm:100
ha-manager remove vm:101
ha-manager remove vm:102
ha-manager groupremove prod-ha
# Destroy Ceph (DESTROYS ALL DATA ON CEPH POOL)
pveceph pool destroy vm-pool
ceph osd out 0 && ceph osd down 0 && ceph osd purge 0 --yes-i-really-mean-it
# Repeat OSD purge for each OSD (1, 2, etc.)
# Remove nodes from cluster (run on each node being removed)
pvecm delnode pve3  # run on pve1
pvecm delnode pve2  # run on pve1

Restore from Restic Backup

Validated restore procedure for Restic backups -- because untested backups are just hopes.

Intermediate 20-30 min docker
Prerequisites
  • Existing Restic repository (see the Automated Backup Pipeline playbook)
  • Target machine with Docker Engine + Compose v2
  • RESTIC_REPOSITORY and RESTIC_PASSWORD environment variables set or available
  • Enough free disk space for the restore (at least 1x the backup size)
/opt/argobox/restore/restore-procedure.sh
#!/usr/bin/env bash
set -euo pipefail

# Restic Restore Procedure
# This script walks through a full restore from a Restic backup.
# Set these before running:
#   export RESTIC_REPOSITORY="sftp:[email protected]:/backups/argobox"
#   export RESTIC_PASSWORD="your-restic-password"

RESTORE_TARGET="/tmp/argobox-restore"

echo "=== Step 1: Check repository health ==="
restic check
echo ""

echo "=== Step 2: List available snapshots ==="
restic snapshots --compact
echo ""

echo "=== Step 3: Show contents of latest snapshot ==="
restic ls latest --long | head -50
echo "(truncated — use 'restic ls latest' for full listing)"
echo ""

echo "=== Step 4: Restore latest snapshot ==="
echo "Restoring to ${RESTORE_TARGET}..."
mkdir -p "${RESTORE_TARGET}"
restic restore latest \
  --target "${RESTORE_TARGET}" \
  --include "/data/argobox" \
  --include "/data/docker-volumes"
echo ""

echo "=== Step 5: Verify restored file count ==="
RESTORED_FILES=$(find "${RESTORE_TARGET}" -type f | wc -l)
echo "Restored files: ${RESTORED_FILES}"
echo ""

echo "Restore complete. Files are in ${RESTORE_TARGET}"
echo ""
echo "To restore a SPECIFIC snapshot instead of latest:"
echo "  restic snapshots                     # find the snapshot ID"
echo "  restic restore abc123 --target /tmp/argobox-restore"
echo ""
echo "To browse a snapshot interactively (FUSE mount):"
echo "  mkdir -p /mnt/restic"
echo "  restic mount /mnt/restic &"
echo "  ls /mnt/restic/snapshots/"
echo "  # Browse freely, then unmount:"
echo "  fusermount -u /mnt/restic"
/opt/argobox/restore/restore-compose.sh
#!/usr/bin/env bash
set -euo pipefail

# Restore Docker Compose services from a Restic backup
# Run AFTER restore-procedure.sh has extracted files to RESTORE_TARGET.

RESTORE_TARGET="/tmp/argobox-restore"
ARGOBOX_ROOT="/opt/argobox"
DOCKER_VOLUMES="/var/lib/docker/volumes"

echo "=== Step 1: Stop running containers ==="
cd "${ARGOBOX_ROOT}"
for dir in stack monitoring backup; do
  if [[ -f "${ARGOBOX_ROOT}/${dir}/docker-compose.yml" ]]; then
    echo "Stopping ${dir}..."
    docker compose -f "${ARGOBOX_ROOT}/${dir}/docker-compose.yml" down || true
  fi
done
echo ""

echo "=== Step 2: Restore ArgoBox configs ==="
if [[ -d "${RESTORE_TARGET}/data/argobox" ]]; then
  rsync -av --backup --suffix=".pre-restore" \
    "${RESTORE_TARGET}/data/argobox/" "${ARGOBOX_ROOT}/"
  echo "Configs restored to ${ARGOBOX_ROOT}"
else
  echo "WARNING: No argobox configs found in restore."
fi
echo ""

echo "=== Step 3: Restore Docker volumes ==="
if [[ -d "${RESTORE_TARGET}/data/docker-volumes" ]]; then
  rsync -av --backup --suffix=".pre-restore" \
    "${RESTORE_TARGET}/data/docker-volumes/" "${DOCKER_VOLUMES}/"
  echo "Docker volumes restored."
else
  echo "WARNING: No docker volumes found in restore."
fi
echo ""

echo "=== Step 4: Fix ownership ==="
# Postgres needs its data owned by uid 999
if [[ -d "${DOCKER_VOLUMES}/stack_pgdata" ]]; then
  chown -R 999:999 "${DOCKER_VOLUMES}/stack_pgdata/_data/"
fi
# Grafana needs uid 472
if [[ -d "${DOCKER_VOLUMES}/monitoring_grafana_data" ]]; then
  chown -R 472:472 "${DOCKER_VOLUMES}/monitoring_grafana_data/_data/"
fi
echo ""

echo "=== Step 5: Restart containers ==="
for dir in stack monitoring backup; do
  if [[ -f "${ARGOBOX_ROOT}/${dir}/docker-compose.yml" ]]; then
    echo "Starting ${dir}..."
    docker compose -f "${ARGOBOX_ROOT}/${dir}/docker-compose.yml" up -d
  fi
done
echo ""

echo "Restore complete. Run verify-restore.sh to check service health."
/opt/argobox/restore/verify-restore.sh
#!/usr/bin/env bash
set -euo pipefail

# Post-Restore Verification Script
# Checks each service health endpoint after a restore.

PASS=0
FAIL=0

check() {
  local name="$1"
  local cmd="$2"
  echo -n "Checking ${name}... "
  if eval "${cmd}" &>/dev/null; then
    echo "OK"
    PASS=$((PASS + 1))
  else
    echo "FAILED"
    FAIL=$((FAIL + 1))
  fi
}

echo "=== Service Health Checks ==="
echo ""

# Docker containers running
check "Docker containers" "docker compose -f /opt/argobox/stack/docker-compose.yml ps --status running | grep -q 'running'"

# Traefik responding
check "Traefik dashboard" "curl -sf http://localhost:8080/api/overview"

# Postgres accepting connections
check "Postgres" "docker exec $(docker compose -f /opt/argobox/stack/docker-compose.yml ps -q postgres 2>/dev/null) pg_isready -U app -d app"

# Grafana API
check "Grafana" "curl -sf http://10.42.0.10:3000/api/health"

# Prometheus targets
check "Prometheus" "curl -sf http://10.42.0.10:9090/api/v1/targets"

# Alertmanager
check "Alertmanager" "curl -sf http://10.42.0.10:9093/api/v2/status"

echo ""
echo "=== Data Integrity Checks ==="
echo ""

# Postgres database integrity
check "Postgres integrity" "docker exec $(docker compose -f /opt/argobox/stack/docker-compose.yml ps -q postgres 2>/dev/null) psql -U app -d app -c 'SELECT 1'"

# Prometheus data directory not empty
check "Prometheus data" "docker exec $(docker compose -f /opt/argobox/monitoring/docker-compose.yml ps -q prometheus 2>/dev/null) ls /prometheus/wal"

# Grafana dashboards exist
check "Grafana dashboards" "curl -sf http://10.42.0.10:3000/api/search | grep -q 'title'"

echo ""
echo "=== Results ==="
echo "Passed: ${PASS}"
echo "Failed: ${FAIL}"

if [[ ${FAIL} -gt 0 ]]; then
  echo "Some checks failed. Investigate before considering the restore complete."
  exit 1
else
  echo "All checks passed. Restore verified."
fi

Deploy

# Set Restic repository credentials
export RESTIC_REPOSITORY="sftp:[email protected]:/backups/argobox"
export RESTIC_PASSWORD="your-restic-password"
# Step 1: Run the restore procedure (extracts files)
chmod +x /opt/argobox/restore/restore-procedure.sh
sudo -E bash /opt/argobox/restore/restore-procedure.sh
# Step 2: Restore and restart Docker services
chmod +x /opt/argobox/restore/restore-compose.sh
sudo bash /opt/argobox/restore/restore-compose.sh
# Step 3: Verify everything came back healthy
chmod +x /opt/argobox/restore/verify-restore.sh
sudo bash /opt/argobox/restore/verify-restore.sh

Verify

# Run the verification script
sudo bash /opt/argobox/restore/verify-restore.sh
# Manual spot checks
docker compose -f /opt/argobox/stack/docker-compose.yml ps
docker compose -f /opt/argobox/monitoring/docker-compose.yml ps
curl -sf http://10.42.0.10:3000/api/health | jq .
# Compare file counts against what Restic reports
restic stats latest --mode raw-data
find /opt/argobox -type f | wc -l

Rollback

# If the restore went wrong, restore from a different snapshot
export RESTIC_REPOSITORY="sftp:[email protected]:/backups/argobox"
export RESTIC_PASSWORD="your-restic-password"
restic snapshots  # pick a different snapshot ID
restic restore <snapshot-id> --target /tmp/argobox-restore-v2
# Then re-run restore-compose.sh with the new RESTORE_TARGET

Home Assistant + MQTT + IoT VLAN Isolation

Home Assistant with Mosquitto MQTT broker, isolated on an IoT VLAN with controlled access to the main network.

Intermediate 35-45 min docker
Prerequisites
  • Docker Engine + Compose v2
  • VLAN-capable switch (pairs with the VLAN segmentation playbook)
  • IoT VLAN (10.42.30.0/24) configured on the network
  • Zigbee/Z-Wave USB coordinator (optional, for Zigbee2MQTT)
/opt/argobox/homeassistant/docker-compose.yml
services:
  homeassistant:
    image: ghcr.io/home-assistant/home-assistant:stable
    container_name: homeassistant
    # network_mode: host is required for mDNS discovery of local devices
    # (Chromecast, Sonos, ESPHome, etc.)
    network_mode: host
    environment:
      TZ: America/Chicago
    volumes:
      - ha_config:/config
      - /run/dbus:/run/dbus:ro
    restart: unless-stopped
    healthcheck:
      test: ["CMD", "curl", "-sf", "http://localhost:8123/api/"]
      interval: 30s
      timeout: 10s
      retries: 5
      start_period: 60s

  mosquitto:
    image: eclipse-mosquitto:2
    container_name: mosquitto
    ports:
      - "1883:1883"
      - "9001:9001"
    volumes:
      - ./mosquitto/mosquitto.conf:/mosquitto/config/mosquitto.conf:ro
      - ./mosquitto/acl.conf:/mosquitto/config/acl.conf:ro
      - ./mosquitto/password_file:/mosquitto/config/password_file:ro
      - mosquitto_data:/mosquitto/data
      - mosquitto_log:/mosquitto/log
    restart: unless-stopped
    healthcheck:
      test: ["CMD", "mosquitto_sub", "-h", "localhost", "-t", "\$$SYS/broker/uptime", "-C", "1", "-W", "3"]
      interval: 30s
      timeout: 10s
      retries: 3

  zigbee2mqtt:
    image: koenkk/zigbee2mqtt:latest
    container_name: zigbee2mqtt
    depends_on:
      mosquitto:
        condition: service_healthy
    ports:
      - "8082:8080"
    environment:
      TZ: America/Chicago
    volumes:
      - z2m_data:/app/data
    # Uncomment and set the correct USB device path for your coordinator:
    # devices:
    #   - /dev/ttyUSB0:/dev/ttyACM0
    restart: unless-stopped
    healthcheck:
      test: ["CMD", "wget", "--spider", "-q", "http://localhost:8080"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 30s

volumes:
  ha_config:
  mosquitto_data:
  mosquitto_log:
  z2m_data:
/opt/argobox/homeassistant/mosquitto/mosquitto.conf
# Mosquitto MQTT Broker Configuration
# Docs: https://mosquitto.org/man/mosquitto-conf-5.html

# Persistence — retain messages and subscriptions across restarts
persistence true
persistence_location /mosquitto/data/

# Logging
log_dest stdout
log_type error
log_type warning
log_type notice
log_timestamp true
log_timestamp_format %Y-%m-%dT%H:%M:%S

# Listener on port 1883 (unencrypted, LAN only)
listener 1883
protocol mqtt

# WebSocket listener on port 9001 (for browser-based clients)
listener 9001
protocol websockets

# Authentication — require username/password
allow_anonymous false
password_file /mosquitto/config/password_file

# ACL — per-user topic restrictions
acl_file /mosquitto/config/acl.conf

# Connection limits
max_connections 100
max_inflight_messages 20
max_queued_messages 1000

# Keep-alive
max_keepalive 120
/opt/argobox/homeassistant/mosquitto/acl.conf
# Mosquitto ACL Configuration
# Restricts which users/clients can read/write which MQTT topics.
# Docs: https://mosquitto.org/man/mosquitto-conf-5.html#idm484

# Home Assistant — full read/write access to all topics
user homeassistant
topic readwrite #

# Zigbee2MQTT — full access to its own namespace + homeassistant discovery
user zigbee2mqtt
topic readwrite zigbee2mqtt/#
topic readwrite homeassistant/#

# IoT sensors — can only publish their own state, read commands
# Pattern: sensors/<device-id>/state (write)
#          homeassistant/<device-id>/command (read)
user iot-sensors
topic write sensors/+/state
topic write sensors/+/availability
topic read homeassistant/+/command

# Temperature/humidity sensors
user env-sensors
topic write sensors/environment/+
topic read homeassistant/climate/+/command

# Deny everything else by default (no "topic" line = deny)
/opt/argobox/homeassistant/nftables-iot.conf
#!/usr/sbin/nft -f

# IoT VLAN firewall rules
# Allows IoT devices (10.42.30.0/24) to reach ONLY the MQTT broker.
# Blocks all other access to the main LAN and management networks.
# This pairs with the VLAN Segmentation playbook.

table inet iot_isolation {

    chain forward {
        type filter hook forward priority 10; policy accept;

        # Allow IoT VLAN to reach MQTT broker on the Docker host (port 1883)
        iifname "eth0.30" ip daddr 10.42.0.10 tcp dport 1883 accept

        # Allow IoT VLAN to reach Home Assistant (port 8123) for direct integrations
        iifname "eth0.30" ip daddr 10.42.0.10 tcp dport 8123 accept

        # Allow IoT devices to talk to each other (mDNS, local protocols)
        iifname "eth0.30" oifname "eth0.30" accept

        # Allow established/related return traffic
        iifname "eth0.30" ct state established,related accept

        # Block IoT VLAN from reaching management VLAN
        iifname "eth0.30" ip daddr 10.42.10.0/24 drop

        # Block IoT VLAN from reaching services VLAN (except the rules above)
        iifname "eth0.30" ip daddr 10.42.20.0/24 drop

        # Block IoT from reaching Proxmox hosts directly
        iifname "eth0.30" ip daddr { 10.42.0.31, 10.42.0.32, 10.42.0.33 } drop

        # Allow IoT VLAN internet access (for firmware updates, NTP)
        iifname "eth0.30" oifname "eth0" ip daddr != { 10.42.0.0/16 } accept

        # Log and drop anything else from IoT VLAN
        iifname "eth0.30" log prefix "IOT-DROP: " counter drop
    }
}

Deploy

# Create directory structure
mkdir -p /opt/argobox/homeassistant/mosquitto
# Create MQTT password file (generates hashed passwords)
docker run --rm -v /opt/argobox/homeassistant/mosquitto:/mosquitto/config eclipse-mosquitto:2 \
  mosquitto_passwd -c -b /mosquitto/config/password_file homeassistant "replace-with-ha-mqtt-password"
docker run --rm -v /opt/argobox/homeassistant/mosquitto:/mosquitto/config eclipse-mosquitto:2 \
  mosquitto_passwd -b /mosquitto/config/password_file zigbee2mqtt "replace-with-z2m-mqtt-password"
docker run --rm -v /opt/argobox/homeassistant/mosquitto:/mosquitto/config eclipse-mosquitto:2 \
  mosquitto_passwd -b /mosquitto/config/password_file iot-sensors "replace-with-sensor-password"
docker run --rm -v /opt/argobox/homeassistant/mosquitto:/mosquitto/config eclipse-mosquitto:2 \
  mosquitto_passwd -b /mosquitto/config/password_file env-sensors "replace-with-env-sensor-password"
# Set permissions on password file
chmod 600 /opt/argobox/homeassistant/mosquitto/password_file
# Start the stack
cd /opt/argobox/homeassistant && docker compose up -d
# Wait for Home Assistant to finish initial setup
echo "Home Assistant is starting at http://10.42.0.10:8123 — initial setup takes 1-2 minutes."
# Configure MQTT integration in Home Assistant:
#   Settings > Devices & Services > Add Integration > MQTT
#   Broker: 10.42.0.10, Port: 1883, User: homeassistant, Password: <from above>
# Apply IoT VLAN firewall rules (if using VLAN segmentation)
sudo nft -f /opt/argobox/homeassistant/nftables-iot.conf

Verify

# Check all containers are running and healthy
docker compose -f /opt/argobox/homeassistant/docker-compose.yml ps
# Test MQTT publish/subscribe
docker exec mosquitto mosquitto_pub -h localhost -u homeassistant -P "replace-with-ha-mqtt-password" -t "test/topic" -m "hello"
docker exec mosquitto mosquitto_sub -h localhost -u homeassistant -P "replace-with-ha-mqtt-password" -t "test/topic" -C 1 -W 5
# Verify Home Assistant is responding
curl -sf http://10.42.0.10:8123/api/ | head -1
# Verify Zigbee2MQTT frontend
curl -sf http://10.42.0.10:8082/ | head -1
# Test IoT VLAN isolation (from a device on 10.42.30.0/24):
# ping 10.42.0.10    # Should FAIL (blocked by firewall)
# But MQTT should work:
# mosquitto_pub -h 10.42.0.10 -p 1883 -u iot-sensors -P "<password>" -t "sensors/test/state" -m "25.3"
# Verify the sensor cannot reach Proxmox:
# ping 10.42.0.31   # Should FAIL (IoT blocked from management)
# Check firewall counters
sudo nft list table inet iot_isolation

Rollback

cd /opt/argobox/homeassistant && docker compose down -v
# Remove IoT firewall rules
sudo nft delete table inet iot_isolation
rm -rf /opt/argobox/homeassistant

Longhorn Distributed Storage + Backups

Distributed block storage for K3s with automatic backups to NFS, recurring snapshots, and disaster recovery tested.

Intermediate 30-40 min kubernetes
Prerequisites
  • K3s cluster with 2+ nodes
  • NFS server for backup target (e.g., 10.42.0.20:/backups/longhorn)
  • open-iscsi installed on all worker nodes
  • Helm v3 installed
/opt/argobox/k8s/longhorn/longhorn-values.yaml
# Longhorn Helm values for K3s homelab
# Docs: https://longhorn.io/docs/

defaultSettings:
  # Backup target — NFS share for storing volume backups
  backupTarget: nfs://10.42.0.20:/backups/longhorn

  # Default replica count — 2 for a small cluster (minimum 2 nodes)
  defaultReplicaCount: 2

  # Data locality — prefer keeping data on the node that uses it
  defaultDataLocality: best-effort

  # Storage over-provisioning — allow 200% of physical capacity
  storageOverProvisioningPercentage: 200

  # Minimum storage available before Longhorn stops scheduling
  storageMinimalAvailablePercentage: 15

  # Auto-delete workload pod when volume is detached unexpectedly
  autoDeletePodWhenVolumeDetachedUnexpectedly: true

  # Guaranteed instance manager CPU (millicores per node)
  guaranteedInstanceManagerCPU: 12

  # Replica auto-balance across nodes
  replicaAutoBalance: best-effort

persistence:
  # Set Longhorn as the default StorageClass
  defaultClass: true
  defaultClassReplicaCount: 2
  reclaimPolicy: Retain

ingress:
  enabled: true
  ingressClassName: traefik
  host: longhorn.argobox.com
  tls: true
  tlsSecret: longhorn-tls
  annotations:
    cert-manager.io/cluster-issuer: letsencrypt-prod

longhornUI:
  replicas: 1

# Prometheus ServiceMonitor for metrics collection
metrics:
  serviceMonitor:
    enabled: true
    additionalLabels:
      release: kube-prometheus
/opt/argobox/k8s/longhorn/recurring-job.yaml
# Longhorn Recurring Jobs — automatic snapshots and backups
# These run on all volumes with the matching label (or set as default).

# Hourly snapshots — keep 24 (local only, fast rollback)
apiVersion: longhorn.io/v1beta2
kind: RecurringJob
metadata:
  name: snapshot-hourly
  namespace: longhorn-system
spec:
  name: snapshot-hourly
  task: snapshot
  cron: "0 * * * *"
  retain: 24
  concurrency: 2
  labels:
    recurring-job.longhorn.io/source: system
  groups:
    - default
---
# Daily backups to NFS — keep 7
apiVersion: longhorn.io/v1beta2
kind: RecurringJob
metadata:
  name: backup-daily
  namespace: longhorn-system
spec:
  name: backup-daily
  task: backup
  cron: "0 2 * * *"
  retain: 7
  concurrency: 1
  labels:
    recurring-job.longhorn.io/source: system
  groups:
    - default
---
# Weekly backups to NFS — keep 4
apiVersion: longhorn.io/v1beta2
kind: RecurringJob
metadata:
  name: backup-weekly
  namespace: longhorn-system
spec:
  name: backup-weekly
  task: backup
  cron: "0 3 * * 0"
  retain: 4
  concurrency: 1
  labels:
    recurring-job.longhorn.io/source: system
  groups:
    - default
/opt/argobox/k8s/longhorn/restore-test.yaml
# Restore a Longhorn volume from backup
# First, find the backup URL in the Longhorn UI:
#   Backup > select volume > select backup > copy "Backup URL"
# Or via kubectl:
#   kubectl -n longhorn-system get backups.longhorn.io

# PVC that restores from a Longhorn backup
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: restored-data
  namespace: default
  annotations:
    # Replace with the actual backup URL from Longhorn
    longhorn.io/backup-url: "nfs://10.42.0.20:/backups/longhorn/default-pvc-abc123?backup=backup-xyz789"
spec:
  accessModes:
    - ReadWriteOnce
  storageClassName: longhorn
  resources:
    requests:
      storage: 10Gi
  dataSource:
    name: restored-data
    kind: PersistentVolumeClaim
---
# Test pod that mounts the restored volume and verifies data
apiVersion: v1
kind: Pod
metadata:
  name: restore-verify
  namespace: default
spec:
  containers:
    - name: verify
      image: busybox:1.36
      command:
        - /bin/sh
        - -c
        - |
          echo "=== Restore Verification ==="
          echo "Volume contents:"
          ls -la /data/
          echo ""
          echo "File count:"
          find /data -type f | wc -l
          echo ""
          echo "Disk usage:"
          du -sh /data/
          echo ""
          echo "Verification complete. Pod will stay running for inspection."
          echo "Delete with: kubectl delete pod restore-verify"
          sleep 3600
      volumeMounts:
        - name: restored-data
          mountPath: /data
  volumes:
    - name: restored-data
      persistentVolumeClaim:
        claimName: restored-data
  restartPolicy: Never

Deploy

# Install open-iscsi on all worker nodes (required for Longhorn)
ssh [email protected] "sudo apt install -y open-iscsi && sudo systemctl enable --now iscsid"
ssh [email protected] "sudo apt install -y open-iscsi && sudo systemctl enable --now iscsid"
# Add the Longhorn Helm repo
helm repo add longhorn https://charts.longhorn.io && helm repo update
# Create the namespace
kubectl create namespace longhorn-system --dry-run=client -o yaml | kubectl apply -f -
# Install Longhorn with Helm
helm upgrade --install longhorn longhorn/longhorn \
  --namespace longhorn-system \
  --values /opt/argobox/k8s/longhorn/longhorn-values.yaml \
  --wait --timeout 10m
# Wait for all Longhorn pods to be ready
kubectl wait --for=condition=Ready pods --all -n longhorn-system --timeout=300s
# Apply recurring snapshot/backup jobs
kubectl apply -f /opt/argobox/k8s/longhorn/recurring-job.yaml
# Create a test PVC to verify Longhorn is working
kubectl apply -f - <<TESTEOF
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: longhorn-test
  namespace: default
spec:
  accessModes: ["ReadWriteOnce"]
  storageClassName: longhorn
  resources:
    requests:
      storage: 1Gi
TESTEOF
# Trigger a manual backup of the test volume from the Longhorn UI
# (Longhorn UI: longhorn.argobox.com > Volume > longhorn-test > Create Backup)
# Test restore from backup
# kubectl apply -f /opt/argobox/k8s/longhorn/restore-test.yaml

Verify

# Check all Longhorn pods are running
kubectl get pods -n longhorn-system
# List Longhorn volumes
kubectl get volumes.longhorn.io -n longhorn-system
# Check Longhorn nodes (all should show "Ready")
kubectl get nodes.longhorn.io -n longhorn-system
# Verify recurring jobs are configured
kubectl get recurringjobs.longhorn.io -n longhorn-system
# Check the Longhorn UI for replica health
echo "Longhorn UI: https://longhorn.argobox.com"
# Verify backups exist on the NFS target
ls -la /backups/longhorn/  # run on the NFS server (10.42.0.20)
# Test the restored volume (if restore-test.yaml was applied)
kubectl logs restore-verify -n default
# Verify the default StorageClass is Longhorn
kubectl get storageclass

Rollback

# Delete test resources
kubectl delete pvc longhorn-test -n default --ignore-not-found
kubectl delete -f /opt/argobox/k8s/longhorn/restore-test.yaml --ignore-not-found
# Delete recurring jobs
kubectl delete -f /opt/argobox/k8s/longhorn/recurring-job.yaml --ignore-not-found
# Uninstall Longhorn (this deletes all Longhorn volumes — data loss)
helm uninstall longhorn -n longhorn-system
# Delete Longhorn CRDs
kubectl delete crd -l longhorn-manager
# Remove Longhorn data from all nodes
# ssh [email protected] "sudo rm -rf /var/lib/longhorn"
# ssh [email protected] "sudo rm -rf /var/lib/longhorn"
kubectl delete namespace longhorn-system