Real syntax, real deployment flow: each playbook includes complete files plus explicit
deploy, verify, and rollback steps.
16 hands-on playbooks74 blog posts87 journal entries
Source Method
Patterns here are aligned to primary documentation and adapted for ArgoBox-style infrastructure.
Use these as starting templates, then tune hostnames, ports, and auth to your environment.
Password auth disabled, limited attack surface, modern cipher suite, group-restricted login, and automatic ban rules with incremental ban times for repeated auth failures.
Intermediate25-35 minsecurity
Prerequisites
Console/physical recovery path (IPMI, Proxmox console, or physical keyboard)
At least one tested SSH public key (ed25519 preferred)
Root/sudo access
fail2ban package available in repos (apt, dnf, or emerge)
/etc/ssh/sshd_config.d/99-hardening.conf
# ArgoBox SSH hardening — drop-in config
# Place in /etc/ssh/sshd_config.d/ to override defaults
# Test BEFORE reloading: sshd -t
Port 22
AddressFamily inet
ListenAddress 0.0.0.0
# Authentication
PermitRootLogin no
PubkeyAuthentication yes
PasswordAuthentication no
KbdInteractiveAuthentication no
ChallengeResponseAuthentication no
UsePAM yes
AuthenticationMethods publickey
# Restrict login to members of the 'ssh-users' group
# Add your user first: sudo groupadd ssh-users && sudo usermod -aG ssh-users commander
AllowGroups ssh-users
# Modern ciphers only — disable anything CBC or SHA1-based
Ciphers [email protected],[email protected],[email protected]
MACs [email protected],[email protected]
KexAlgorithms [email protected],curve25519-sha256,[email protected]
HostKeyAlgorithms ssh-ed25519,rsa-sha2-512,rsa-sha2-256
# Forwarding — disable everything not needed
AllowAgentForwarding no
AllowTcpForwarding no
X11Forwarding no
PermitTunnel no
AllowStreamLocalForwarding no
GatewayPorts no
PermitUserEnvironment no
# Session limits
MaxAuthTries 3
MaxSessions 5
LoginGraceTime 30
ClientAliveInterval 300
ClientAliveCountMax 2
# Logging
LogLevel VERBOSE
/etc/fail2ban/jail.d/sshd.local
[sshd]
enabled = true
backend = systemd
port = ssh
filter = sshd
logpath = %(sshd_log)s
# Ban after 4 failed attempts within 10 minutes
maxretry = 4
findtime = 10m
bantime = 1h
# Incremental bans — each repeat offense doubles the ban
# 1h -> 2h -> 4h -> 8h ... up to maxbantime
bantime.increment = true
bantime.multipliers = 1 2 4 8 16
bantime.maxtime = 4w
bantime.rndtime = 5m
# Action: ban via nftables and send email with whois + log lines
action = %(action_mwl)s
/etc/fail2ban/jail.local
# Global Fail2ban settings — applies to all jails
[DEFAULT]
# Ban method: nftables (preferred) or iptables
banaction = nftables-multiport
banaction_allports = nftables-allports
# Email notifications — set your mail relay and destination
destemail = [email protected]
sender = [email protected]
mta = sendmail
# Default ban parameters (jails can override)
bantime = 1h
findtime = 10m
maxretry = 5
# Ignore local and Tailscale ranges
ignoreip = 127.0.0.1/8 ::1 10.42.0.0/24 100.64.0.0/10
[sshd]
enabled = true
# Traefik auth failures (optional — enable if Traefik exposes basic auth)
[traefik-auth]
enabled = true
port = http,https
filter = traefik-auth
logpath = /var/log/traefik/access.log
maxretry = 5
findtime = 5m
bantime = 30m
bantime.increment = true
Deploy
# Create the ssh-users group and add your account
sudo groupadd -f ssh-users
sudo usermod -aG ssh-users commander
# Install fail2ban if not present
sudo apt install -y fail2ban || sudo dnf install -y fail2ban || sudo emerge --ask net-analyzer/fail2ban
# Copy config files into place
sudo cp 99-hardening.conf /etc/ssh/sshd_config.d/
sudo cp sshd.local /etc/fail2ban/jail.d/
sudo cp jail.local /etc/fail2ban/
# Validate sshd config before reloading (CRITICAL — a bad config locks you out)
sshd -t
# Reload sshd to apply changes
sudo systemctl reload sshd || sudo rc-service sshd reload
# Enable and start fail2ban
sudo systemctl enable --now fail2ban || (sudo rc-update add fail2ban default && sudo rc-service fail2ban start)
# Verify fail2ban picked up the SSH jail
sudo fail2ban-client status sshd
Verify
# Confirm password auth is rejected
ssh -o PreferredAuthentications=password -o PubkeyAuthentication=no [email protected] # should fail immediately
# Confirm pubkey auth still works (run from a machine with your key)
ssh -T [email protected]
# Check which ciphers the server offers (should only show modern ones)
ssh -vv [email protected] 2>&1 | grep "kex:" | head -5
# Verify fail2ban is running with the SSH jail active
sudo fail2ban-client status sshd
# Check auth log for recent activity
sudo journalctl -u sshd --since "1 hour ago" --no-pager | tail -20
# Deliberately trigger a ban (from a test IP, not your current session)
# Run 5 bad password attempts from another machine, then:
sudo fail2ban-client status sshd # should show the test IP in "Banned IP list"
# Confirm unban happens after bantime expires, or manually unban:
sudo fail2ban-client set sshd unbanip <test-ip>
LAN subnets exposed through a controlled router node with auto-approved routes, exit node capability, IP forwarding, MagicDNS, and explicit ACL ownership.
Intermediate20-30 minnetworking
Prerequisites
Tailscale installed on router node (v1.56+)
Admin access to Tailscale admin console or policy file (Settings > Access Controls)
Known LAN CIDR (e.g., 10.42.0.0/24)
Root/sudo access on the router node
Troubleshooting: if routes are advertised but not reachable, check that ip_forward is enabled and the router node firewall allows forwarding between tailscale0 and your LAN interface
/etc/tailscale/bootstrap-subnet-router.sh
#!/usr/bin/env bash
set -euo pipefail
# Tailscale subnet router bootstrap script
# This node will advertise local LAN routes and act as an exit node
# for remote clients who want full internet-via-homelab routing.
LAN_CIDR="10.42.0.0/24"
# Apply sysctl for IP forwarding (also persisted in /etc/sysctl.d/99-tailscale.conf)
sudo sysctl -w net.ipv4.ip_forward=1
sudo sysctl -w net.ipv6.conf.all.forwarding=1
tailscale up \
--ssh \
--advertise-routes="${LAN_CIDR}" \
--advertise-exit-node \
--accept-routes=false \
--accept-dns=true \
--hostname=argobox-subnet-router
echo ""
echo "Subnet router is advertising ${LAN_CIDR} and exit node capability."
echo ""
echo "If autoApprovers is configured in your ACL policy, routes will"
echo "be approved automatically. Otherwise, approve them manually:"
echo " Tailscale Admin Console > Machines > this node > Edit route settings"
echo ""
echo "To use this as an exit node from a remote device:"
echo " tailscale up --exit-node=argobox-subnet-router"
tailscale-acl.json
{
"tagOwners": {
"tag:subnet-router": ["autogroup:admin"],
"tag:exit-node": ["autogroup:admin"],
"tag:server": ["autogroup:admin"]
},
"groups": {
"group:admins": ["[email protected]"]
},
"acls": [
{
"action": "accept",
"src": ["group:admins"],
"dst": ["*:*"],
"comment": "Admins can reach everything on the tailnet and advertised subnets"
},
{
"action": "accept",
"src": ["tag:server"],
"dst": ["tag:server:*"],
"comment": "Servers can talk to each other (inter-node traffic)"
},
{
"action": "accept",
"src": ["group:admins"],
"dst": ["10.42.0.0/24:*"],
"comment": "Admins can reach the entire LAN via subnet router"
}
],
"autoApprovers": {
"routes": {
"10.42.0.0/24": ["tag:subnet-router"],
"comment": "Auto-approve LAN subnet routes from tagged routers"
},
"exitNode": ["tag:exit-node"],
"comment": "Auto-approve exit node capability from tagged nodes"
},
"ssh": [
{
"action": "accept",
"src": ["group:admins"],
"dst": ["tag:server"],
"users": ["commander", "root"]
}
],
"dns": {
"nameservers": ["10.42.0.1"],
"domains": ["argobox.tail"],
"magicDNS": true
}
}
/etc/sysctl.d/99-tailscale.conf
# Required for Tailscale subnet routing and exit node functionality.
# Without these, the kernel will drop forwarded packets silently.
# Apply immediately: sudo sysctl -p /etc/sysctl.d/99-tailscale.conf
net.ipv4.ip_forward = 1
net.ipv6.conf.all.forwarding = 1
Deploy
# Persist IP forwarding settings
sudo cp 99-tailscale.conf /etc/sysctl.d/
sudo sysctl -p /etc/sysctl.d/99-tailscale.conf
# Tag this machine in the admin console as tag:subnet-router and tag:exit-node
# (or use --advertise-tags if your ACL allows self-tagging)
# Apply the ACL policy in Tailscale Admin Console > Access Controls
# Paste the contents of tailscale-acl.json and save
# Run the bootstrap script
chmod +x /etc/tailscale/bootstrap-subnet-router.sh
sudo /etc/tailscale/bootstrap-subnet-router.sh
# Verify the routes appear as approved (not "awaiting approval")
tailscale status
Verify
# Confirm this node is advertising routes and exit node
tailscale status --json | jq "{ routes: .Self.AllowedIPs, exitNode: .Self.ExitNode, online: .Self.Online }"
# List all peers and their status
tailscale status --peers
# From a REMOTE device on the tailnet, test subnet route access:
ping -c 3 10.42.0.1 # Should reach the LAN gateway via subnet route
ssh [email protected] # Should reach a LAN host via subnet route
# Test MagicDNS resolution (from any tailnet device)
dig argobox-subnet-router.argobox.tail
tailscale ping argobox-subnet-router
# Verify IP forwarding is active on the router node
sysctl net.ipv4.ip_forward # Should show = 1
# Test exit node from a remote client:
# tailscale up --exit-node=argobox-subnet-router
# curl ifconfig.me # Should show the homelab public IP
Rollback
# Remove subnet routes and exit node
tailscale up --advertise-routes= --advertise-exit-node=false --reset
# Disable IP forwarding if no longer needed
sudo rm /etc/sysctl.d/99-tailscale.conf
sudo sysctl -w net.ipv4.ip_forward=0
sudo sysctl -w net.ipv6.conf.all.forwarding=0
tailscale status
#!/usr/bin/env bash
set -euo pipefail
# Worker/agent node — joins the cluster as a workload runner only
K3S_TOKEN="replace-with-a-long-random-token"
K3S_URL="https://10.42.0.49:6443" # Point at VIP or load balancer
NODE_IP="$(hostname -I | awk '{print $1}')"
curl -sfL https://get.k3s.io | sh -s - agent \
--server "${K3S_URL}" \
--token "${K3S_TOKEN}" \
--node-ip "${NODE_IP}"
echo "Agent node joined the cluster."
Deploy
bash /opt/argobox/k3s/k3s-init.sh # Run on first server
bash /opt/argobox/k3s/k3s-join.sh # Run on second and third servers
bash /opt/argobox/k3s/k3s-agent.sh # Run on any worker nodes
kubectl get nodes -o wide
Verify
kubectl get nodes -o wide
kubectl get pods -A
kubectl get endpoints kubernetes -o yaml # Should list all server IPs
# Test HA: stop k3s on one server, confirm API still responds
ssh 10.42.0.51 "systemctl stop k3s"
kubectl get nodes # Downed node shows NotReady
Rollback
/usr/local/bin/k3s-uninstall.sh # Run on each server node
/usr/local/bin/k3s-agent-uninstall.sh # Run on each agent node
kubectl get pods -n argocd
kubectl get applications -n argocd
kubectl -n argocd get application homelab-apps -o jsonpath="{.status.sync.status}"
kubectl -n argocd get application homelab-apps -o jsonpath="{.status.health.status}"
# Or via CLI:
argocd app list --server argocd.argobox.com
#!/usr/bin/env bash
set -euo pipefail
# Verify the repository is accessible and the most recent snapshot
# is less than 26 hours old (allows for a daily schedule with margin)
MAX_AGE_HOURS=26
LATEST_SNAPSHOT=$(restic snapshots --json --latest 1 2>/dev/null)
if [[ -z "${LATEST_SNAPSHOT}" || "${LATEST_SNAPSHOT}" == "[]" ]]; then
echo "No snapshots found"
exit 1
fi
SNAPSHOT_TIME=$(echo "${LATEST_SNAPSHOT}" | jq -r '.[0].time')
SNAPSHOT_EPOCH=$(date -d "${SNAPSHOT_TIME}" +%s)
NOW_EPOCH=$(date +%s)
AGE_HOURS=$(( (NOW_EPOCH - SNAPSHOT_EPOCH) / 3600 ))
if (( AGE_HOURS > MAX_AGE_HOURS )); then
echo "Latest snapshot is ${AGE_HOURS}h old (max: ${MAX_AGE_HOURS}h)"
exit 1
fi
echo "Healthy: latest snapshot is ${AGE_HOURS}h old"
exit 0
Deploy
mkdir -p /opt/argobox/backup
chmod 600 /opt/argobox/backup/backup.env
chmod +x /opt/argobox/backup/healthcheck.sh
# Initialize the repository (first time only)
restic -r sftp:[email protected]:/backups/argobox init
cd /opt/argobox/backup
docker compose up -d
ip -d link show | grep "vlan protocol"
ip addr show eth0.10 eth0.20 eth0.30 eth0.40
# From a VLAN 30 (IoT) host, verify isolation:
ping -c 2 10.42.10.1 # Should FAIL (management blocked)
ping -c 2 10.42.20.1 # Should FAIL (services blocked)
ping -c 2 8.8.8.8 # Should PASS (internet allowed)
# From a VLAN 10 (management) host:
ping -c 2 10.42.20.1 # Should PASS (management has full access)
nft list table inet vlan_policy
A mirrored ZFS pool with automatic snapshots, compression, and email alerts on disk errors.
Intermediate30-40 minstorage
Prerequisites
2+ drives of the same size (SATA or NVMe)
ZFS kernel module loaded (zfs-dkms or built-in)
zfs-utils / zfsutils-linux package installed
Mail relay configured for ZED alerts (optional)
/opt/argobox/zfs/setup-zfs-pool.sh
#!/usr/bin/env bash
set -euo pipefail
# ZFS mirror pool creation script
# Requires: 2 drives of the same size, ZFS kernel module loaded
#
# Verify your target drives first:
# lsblk -d -o NAME,SIZE,MODEL,SERIAL
# wipefs -a /dev/sdX (destroys all data — be certain)
POOL_NAME="tank"
DISK1="/dev/sda"
DISK2="/dev/sdb"
echo "Creating mirrored ZFS pool '${POOL_NAME}' with ${DISK1} and ${DISK2}..."
echo "This will DESTROY all data on both drives. Ctrl+C to abort."
sleep 5
# Create mirror pool with recommended settings
zpool create -f \
-o ashift=12 \
-o autotrim=on \
-O compression=zstd \
-O atime=off \
-O relatime=on \
-O xattr=sa \
-O acltype=posixacl \
-O dnodesize=auto \
-O normalization=formD \
-O mountpoint=/${POOL_NAME} \
"${POOL_NAME}" mirror "${DISK1}" "${DISK2}"
echo "Pool created. Setting up datasets..."
# Dataset: media — large sequential reads (video, music, photos)
zfs create -o recordsize=1M -o compression=lz4 \
-o mountpoint=/${POOL_NAME}/media \
"${POOL_NAME}/media"
# Dataset: backups — general backup storage
zfs create -o recordsize=128k -o compression=zstd \
-o mountpoint=/${POOL_NAME}/backups \
"${POOL_NAME}/backups"
# Dataset: docker — container volumes and configs
zfs create -o recordsize=128k -o compression=zstd \
-o mountpoint=/${POOL_NAME}/docker \
"${POOL_NAME}/docker"
# Dataset: databases — small random I/O (Postgres, SQLite)
zfs create -o recordsize=64k -o compression=zstd \
-o logbias=latency \
-o mountpoint=/${POOL_NAME}/databases \
"${POOL_NAME}/databases"
echo ""
echo "Pool and datasets created:"
zfs list -o name,used,avail,refer,mountpoint,compression,recordsize \
-r "${POOL_NAME}"
echo ""
zpool status "${POOL_NAME}"
echo ""
echo "Next steps:"
echo " 1. Set up automatic snapshots (zfs-auto-snapshot.sh)"
echo " 2. Configure ZED for email alerts (zed.rc)"
echo " 3. Schedule a weekly scrub: echo '0 2 * * 0 root zpool scrub ${POOL_NAME}' >> /etc/crontab"
/opt/argobox/zfs/zfs-auto-snapshot.sh
#!/usr/bin/env bash
set -euo pipefail
# ZFS automatic snapshot script with configurable retention
# Install via cron — see bottom of script for suggested schedule.
#
# Usage: zfs-auto-snapshot.sh <label> <keep-count> [dataset]
# Examples:
# zfs-auto-snapshot.sh hourly 24 tank
# zfs-auto-snapshot.sh daily 30 tank
# zfs-auto-snapshot.sh weekly 8 tank
# zfs-auto-snapshot.sh monthly 12 tank
LABEL="${1:?Usage: zfs-auto-snapshot.sh <label> <keep> [dataset]}"
KEEP="${2:?Usage: zfs-auto-snapshot.sh <label> <keep> [dataset]}"
DATASET="${3:-tank}"
TIMESTAMP="$(date +%Y-%m-%d_%H%M)"
SNAP_NAME="${DATASET}@auto-${LABEL}-${TIMESTAMP}"
# Create recursive snapshot (covers all child datasets)
zfs snapshot -r "${SNAP_NAME}"
echo "Created snapshot: ${SNAP_NAME}"
# Prune old snapshots of this label, keeping the N most recent
zfs list -H -t snapshot -o name -S creation -r "${DATASET}" \
| grep "@auto-${LABEL}-" \
| tail -n +$(( KEEP + 1 )) \
| while read -r old_snap; do
echo "Destroying old snapshot: ${old_snap}"
zfs destroy "${old_snap}"
done
echo "Retention: keeping ${KEEP} most recent '${LABEL}' snapshots."
# Suggested cron entries (add to /etc/crontab or /etc/cron.d/zfs-snapshots):
#
# # Hourly — keep 24
# 0 * * * * root /opt/argobox/zfs/zfs-auto-snapshot.sh hourly 24 tank
#
# # Daily at midnight — keep 30
# 0 0 * * * root /opt/argobox/zfs/zfs-auto-snapshot.sh daily 30 tank
#
# # Weekly on Sunday at 1 AM — keep 8
# 0 1 * * 0 root /opt/argobox/zfs/zfs-auto-snapshot.sh weekly 8 tank
#
# # Monthly on the 1st at 2 AM — keep 12
# 0 2 1 * * root /opt/argobox/zfs/zfs-auto-snapshot.sh monthly 12 tank
/etc/zfs/zed.d/zed.rc
##
## ZFS Event Daemon (ZED) configuration
## Sends email alerts on disk errors, scrub results, and pool state changes.
##
# Email recipient for ZED alerts
ZED_EMAIL_ADDR="[email protected]"
# Email sender (requires a working MTA: postfix, msmtp, etc.)
ZED_EMAIL_OPTS="-s '@SUBJECT@' @ADDRESS@"
# Send alert on checksum errors (indicates data corruption or bad cable)
ZED_NOTIFY_INTERVAL_SECS=3600
# Alert on any pool state change (DEGRADED, FAULTED, REMOVED)
ZED_NOTIFY_VERBOSE=1
# Scrub completion notification
ZED_SCRUB_AFTER_RESILVER=1
# Use the system mailer
ZED_EMAIL_PROG="mail"
# Syslog integration
ZED_SYSLOG_TAG="zed"
ZED_SYSLOG_SUBCLASS_INCLUDE="checksum_error|io_error|data_error|scrub_finish|scrub_start|vdev.*"
# Lock file
ZED_LOCKDIR="/var/lock"
Deploy
# Verify ZFS kernel module is loaded
lsmod | grep zfs || sudo modprobe zfs
# Identify your target drives (double-check serials — wrong drive = data loss)
lsblk -d -o NAME,SIZE,MODEL,SERIAL
# Run the pool creation script
chmod +x /opt/argobox/zfs/setup-zfs-pool.sh
sudo bash /opt/argobox/zfs/setup-zfs-pool.sh
# Install the snapshot script
sudo cp /opt/argobox/zfs/zfs-auto-snapshot.sh /usr/local/bin/
sudo chmod +x /usr/local/bin/zfs-auto-snapshot.sh
# Add cron jobs for automatic snapshots
echo "0 * * * * root /usr/local/bin/zfs-auto-snapshot.sh hourly 24 tank" | sudo tee -a /etc/cron.d/zfs-snapshots
echo "0 0 * * * root /usr/local/bin/zfs-auto-snapshot.sh daily 30 tank" | sudo tee -a /etc/cron.d/zfs-snapshots
echo "0 1 * * 0 root /usr/local/bin/zfs-auto-snapshot.sh weekly 8 tank" | sudo tee -a /etc/cron.d/zfs-snapshots
echo "0 2 1 * * root /usr/local/bin/zfs-auto-snapshot.sh monthly 12 tank" | sudo tee -a /etc/cron.d/zfs-snapshots
# Add weekly scrub
echo "0 2 * * 0 root zpool scrub tank" | sudo tee -a /etc/cron.d/zfs-scrub
# Configure ZED for email alerts
sudo cp /opt/argobox/zfs/zed.rc /etc/zfs/zed.d/zed.rc
sudo systemctl restart zed || sudo rc-service zed restart
Verify
# Check pool health and layout
zpool status tank
# List all datasets and their properties
zfs list -o name,used,avail,refer,mountpoint,compression,recordsize -r tank
# Take a manual snapshot and confirm it exists
sudo zfs snapshot -r tank@manual-test
zfs list -t snapshot -r tank
# Run a scrub and check for errors
sudo zpool scrub tank
zpool status tank # wait for scrub to complete, check for 0 errors
# Test snapshot rollback (on a non-critical dataset)
echo "test data" | sudo tee /tank/backups/test-file.txt
sudo zfs snapshot tank/backups@rollback-test
sudo rm /tank/backups/test-file.txt
sudo zfs rollback tank/backups@rollback-test
cat /tank/backups/test-file.txt # should show "test data"
# Clean up test snapshots
sudo zfs destroy tank@manual-test
sudo zfs destroy tank/backups@rollback-test
sudo rm /tank/backups/test-file.txt
# Verify ZED is running for email alerts
sudo systemctl status zed || sudo rc-service zed status
Rollback
# Export the pool (unmounts all datasets, safe)
sudo zpool export tank
# Or permanently destroy (ALL DATA LOST):
# sudo zpool destroy tank
# Remove cron jobs
sudo rm -f /etc/cron.d/zfs-snapshots /etc/cron.d/zfs-scrub
# Add the Jetstack Helm repo
helm repo add jetstack https://charts.jetstack.io && helm repo update
# Create the namespace
kubectl create namespace cert-manager --dry-run=client -o yaml | kubectl apply -f -
# Install cert-manager with Helm
helm upgrade --install cert-manager jetstack/cert-manager \
--namespace cert-manager \
--values /opt/argobox/k8s/cert-manager/cert-manager-values.yaml \
--wait --timeout 5m
# Wait for all cert-manager pods to be ready
kubectl wait --for=condition=Ready pods --all -n cert-manager --timeout=120s
# Apply the Cloudflare API token secret
kubectl apply -f /opt/argobox/k8s/cert-manager/cloudflare-secret.yaml
# Apply the ClusterIssuers
kubectl apply -f /opt/argobox/k8s/cert-manager/clusterissuer.yaml
# Apply the Certificate resources
kubectl apply -f /opt/argobox/k8s/cert-manager/certificate.yaml
# To use with any Ingress, add this annotation:
# cert-manager.io/cluster-issuer: letsencrypt-prod
Verify
# Check cert-manager pods are running
kubectl get pods -n cert-manager
# Check ClusterIssuers are ready
kubectl get clusterissuers -o wide
# Check certificate status (READY should be True)
kubectl get certificates -A
# Inspect a specific certificate request for errors
kubectl describe certificaterequest -n default
# Check for any pending challenges
kubectl get challenges -A
# Verify the TLS secret was created
kubectl get secret wildcard-homelab-tls -n default
# Test the cert from outside the cluster
curl -vI https://app.homelab.example.com 2>&1 | grep -E "subject:|issuer:|expire"
A 3-node Proxmox cluster with HA failover, shared storage via Ceph, fencing, and live migration.
Advanced45-60 minproxmox
Prerequisites
3 Proxmox VE 8.x nodes on the same subnet (10.42.0.0/24)
Dedicated cluster network recommended (separate NIC or VLAN for corosync traffic)
At least 1 unused OSD disk per node for Ceph
All nodes must resolve each other by hostname (/etc/hosts or DNS)
/opt/argobox/proxmox/cluster-setup.sh
#!/usr/bin/env bash
set -euo pipefail
# Proxmox Cluster Setup — Run on NODE 1 only
# Nodes: pve1 (10.42.0.31), pve2 (10.42.0.32), pve3 (10.42.0.33)
# This script creates the cluster on the first node.
# Nodes 2 and 3 join separately (see join commands below).
CLUSTER_NAME="argobox-cluster"
NODE1_IP="10.42.0.31"
echo "Creating Proxmox cluster '${CLUSTER_NAME}' on $(hostname)..."
# Create the cluster (run on node 1 only)
pvecm create "${CLUSTER_NAME}" --link0 "${NODE1_IP}"
echo "Cluster created. Verifying..."
pvecm status
echo ""
echo "To join node 2 (run ON node 2):"
echo " pvecm add ${NODE1_IP} --link0 10.42.0.32"
echo ""
echo "To join node 3 (run ON node 3):"
echo " pvecm add ${NODE1_IP} --link0 10.42.0.33"
echo ""
echo "After all nodes join, verify with: pvecm status"
echo "Expected: 3 nodes, all online, quorate=yes"
/opt/argobox/proxmox/ha-group.sh
#!/usr/bin/env bash
set -euo pipefail
# HA Group and VM Assignment — Run after all nodes have joined
# This creates an HA group and assigns VMs to be managed by Proxmox HA.
# If a node fails, the HA manager will restart VMs on surviving nodes.
# Create HA group with all 3 nodes (priority: prefer pve1, then pve2, then pve3)
ha-manager groupadd prod-ha \
--nodes pve1:2,pve2:1,pve3:1 \
--restricted 1 \
--nofailback 0 \
--comment "Production HA group — all 3 nodes"
echo "HA group 'prod-ha' created."
# Assign VMs to the HA group
# Syntax: ha-manager set <type>:<vmid> --group <group> --state started --max_restart 3 --max_relocate 2
ha-manager set vm:100 --group prod-ha --state started --max_restart 3 --max_relocate 2
ha-manager set vm:101 --group prod-ha --state started --max_restart 3 --max_relocate 2
ha-manager set vm:102 --group prod-ha --state started --max_restart 3 --max_relocate 2
echo "VMs 100, 101, 102 assigned to HA group 'prod-ha'."
# Configure fencing (IPMI/iLO recommended for production)
# Without fencing, HA cannot safely restart VMs after a network partition.
# For testing, you can use the Proxmox built-in watchdog:
pvecm expected 2 # Allow cluster to operate with 2 of 3 nodes
echo ""
echo "Verify HA status:"
ha-manager status
echo ""
echo "HA resources:"
ha-manager config
/opt/argobox/proxmox/ceph-setup.sh
#!/usr/bin/env bash
set -euo pipefail
# Ceph Setup on 3-Node Proxmox Cluster
# Run each section on the appropriate node (marked in comments).
# This gives you shared block storage visible from all cluster nodes.
#
# Prerequisites:
# - Proxmox cluster already formed (pvecm status shows 3 nodes)
# - Each node has at least 1 unused disk for Ceph OSDs
# - 10.42.0.0/24 network used for both public and cluster Ceph traffic
# (dedicated Ceph network recommended in production)
# --- Run on ALL 3 nodes ---
echo "Installing Ceph packages on $(hostname)..."
pveceph install --repository no-subscription
# --- Run on NODE 1 only ---
echo "Initializing Ceph on $(hostname)..."
pveceph init --network 10.42.0.0/24
# --- Run on ALL 3 nodes ---
echo "Creating Ceph monitor on $(hostname)..."
pveceph mon create
# Wait for monitors to reach quorum
echo "Waiting for Ceph monitor quorum..."
sleep 10
ceph mon stat
# --- Run on ALL 3 nodes (adjust disk per node) ---
# List available disks:
# lsblk -d -o NAME,SIZE,MODEL,SERIAL | grep -v "sda" (exclude OS disk)
# Replace /dev/sdb with your actual OSD disk on each node:
echo "Creating OSD on $(hostname) using /dev/sdb..."
pveceph osd create /dev/sdb
# --- Run on NODE 1 only (after all OSDs are created) ---
echo "Creating Ceph pool 'vm-pool' for VM storage..."
pveceph pool create vm-pool --pg_num 128 --size 3 --min_size 2
# Add pool as Proxmox storage
pvesm add rbd vm-pool \
--pool vm-pool \
--monhost 10.42.0.31,10.42.0.32,10.42.0.33 \
--content images,rootdir \
--krbd 0
echo ""
echo "Ceph status:"
ceph -s
echo ""
echo "OSD tree:"
ceph osd tree
echo ""
echo "Pool list:"
ceph osd pool ls detail
echo ""
echo "Ceph setup complete. You can now create VMs on the 'vm-pool' storage."
echo "The pool is accessible from all 3 cluster nodes."
Deploy
# --- Phase 1: Create cluster (on node 1) ---
chmod +x /opt/argobox/proxmox/cluster-setup.sh
bash /opt/argobox/proxmox/cluster-setup.sh
# --- Phase 2: Join nodes (run on each additional node) ---
ssh [email protected] "pvecm add 10.42.0.31 --link0 10.42.0.32"
ssh [email protected] "pvecm add 10.42.0.31 --link0 10.42.0.33"
# Wait for all nodes to sync
pvecm status
# --- Phase 3: Set up Ceph (run sections on appropriate nodes) ---
bash /opt/argobox/proxmox/ceph-setup.sh # see script for per-node instructions
# --- Phase 4: Configure HA ---
chmod +x /opt/argobox/proxmox/ha-group.sh
bash /opt/argobox/proxmox/ha-group.sh
# --- Phase 5: Test live migration ---
qm migrate 100 pve2 --online
Verify
# Cluster health — all 3 nodes should be online
pvecm status
pvecm nodes
# Ceph health — should show HEALTH_OK with 3 OSDs up
ceph -s
ceph osd tree
# HA status — all VMs should show "started"
ha-manager status
# Live migrate a VM and verify it stays running
qm migrate 100 pve2 --online
qm status 100 # should show "running" on pve2
# Simulate node failure — reboot one node and watch HA failover
ssh [email protected] "reboot"
# After ~60 seconds, VMs from pve2 should restart on pve1 or pve3:
ha-manager status
qm list
Rollback
# Remove HA assignments first
ha-manager remove vm:100
ha-manager remove vm:101
ha-manager remove vm:102
ha-manager groupremove prod-ha
# Destroy Ceph (DESTROYS ALL DATA ON CEPH POOL)
pveceph pool destroy vm-pool
ceph osd out 0 && ceph osd down 0 && ceph osd purge 0 --yes-i-really-mean-it
# Repeat OSD purge for each OSD (1, 2, etc.)
# Remove nodes from cluster (run on each node being removed)
pvecm delnode pve3 # run on pve1
pvecm delnode pve2 # run on pve1
Validated restore procedure for Restic backups -- because untested backups are just hopes.
Intermediate20-30 mindocker
Prerequisites
Existing Restic repository (see the Automated Backup Pipeline playbook)
Target machine with Docker Engine + Compose v2
RESTIC_REPOSITORY and RESTIC_PASSWORD environment variables set or available
Enough free disk space for the restore (at least 1x the backup size)
/opt/argobox/restore/restore-procedure.sh
#!/usr/bin/env bash
set -euo pipefail
# Restic Restore Procedure
# This script walks through a full restore from a Restic backup.
# Set these before running:
# export RESTIC_REPOSITORY="sftp:[email protected]:/backups/argobox"
# export RESTIC_PASSWORD="your-restic-password"
RESTORE_TARGET="/tmp/argobox-restore"
echo "=== Step 1: Check repository health ==="
restic check
echo ""
echo "=== Step 2: List available snapshots ==="
restic snapshots --compact
echo ""
echo "=== Step 3: Show contents of latest snapshot ==="
restic ls latest --long | head -50
echo "(truncated — use 'restic ls latest' for full listing)"
echo ""
echo "=== Step 4: Restore latest snapshot ==="
echo "Restoring to ${RESTORE_TARGET}..."
mkdir -p "${RESTORE_TARGET}"
restic restore latest \
--target "${RESTORE_TARGET}" \
--include "/data/argobox" \
--include "/data/docker-volumes"
echo ""
echo "=== Step 5: Verify restored file count ==="
RESTORED_FILES=$(find "${RESTORE_TARGET}" -type f | wc -l)
echo "Restored files: ${RESTORED_FILES}"
echo ""
echo "Restore complete. Files are in ${RESTORE_TARGET}"
echo ""
echo "To restore a SPECIFIC snapshot instead of latest:"
echo " restic snapshots # find the snapshot ID"
echo " restic restore abc123 --target /tmp/argobox-restore"
echo ""
echo "To browse a snapshot interactively (FUSE mount):"
echo " mkdir -p /mnt/restic"
echo " restic mount /mnt/restic &"
echo " ls /mnt/restic/snapshots/"
echo " # Browse freely, then unmount:"
echo " fusermount -u /mnt/restic"
/opt/argobox/restore/restore-compose.sh
#!/usr/bin/env bash
set -euo pipefail
# Restore Docker Compose services from a Restic backup
# Run AFTER restore-procedure.sh has extracted files to RESTORE_TARGET.
RESTORE_TARGET="/tmp/argobox-restore"
ARGOBOX_ROOT="/opt/argobox"
DOCKER_VOLUMES="/var/lib/docker/volumes"
echo "=== Step 1: Stop running containers ==="
cd "${ARGOBOX_ROOT}"
for dir in stack monitoring backup; do
if [[ -f "${ARGOBOX_ROOT}/${dir}/docker-compose.yml" ]]; then
echo "Stopping ${dir}..."
docker compose -f "${ARGOBOX_ROOT}/${dir}/docker-compose.yml" down || true
fi
done
echo ""
echo "=== Step 2: Restore ArgoBox configs ==="
if [[ -d "${RESTORE_TARGET}/data/argobox" ]]; then
rsync -av --backup --suffix=".pre-restore" \
"${RESTORE_TARGET}/data/argobox/" "${ARGOBOX_ROOT}/"
echo "Configs restored to ${ARGOBOX_ROOT}"
else
echo "WARNING: No argobox configs found in restore."
fi
echo ""
echo "=== Step 3: Restore Docker volumes ==="
if [[ -d "${RESTORE_TARGET}/data/docker-volumes" ]]; then
rsync -av --backup --suffix=".pre-restore" \
"${RESTORE_TARGET}/data/docker-volumes/" "${DOCKER_VOLUMES}/"
echo "Docker volumes restored."
else
echo "WARNING: No docker volumes found in restore."
fi
echo ""
echo "=== Step 4: Fix ownership ==="
# Postgres needs its data owned by uid 999
if [[ -d "${DOCKER_VOLUMES}/stack_pgdata" ]]; then
chown -R 999:999 "${DOCKER_VOLUMES}/stack_pgdata/_data/"
fi
# Grafana needs uid 472
if [[ -d "${DOCKER_VOLUMES}/monitoring_grafana_data" ]]; then
chown -R 472:472 "${DOCKER_VOLUMES}/monitoring_grafana_data/_data/"
fi
echo ""
echo "=== Step 5: Restart containers ==="
for dir in stack monitoring backup; do
if [[ -f "${ARGOBOX_ROOT}/${dir}/docker-compose.yml" ]]; then
echo "Starting ${dir}..."
docker compose -f "${ARGOBOX_ROOT}/${dir}/docker-compose.yml" up -d
fi
done
echo ""
echo "Restore complete. Run verify-restore.sh to check service health."
/opt/argobox/restore/verify-restore.sh
#!/usr/bin/env bash
set -euo pipefail
# Post-Restore Verification Script
# Checks each service health endpoint after a restore.
PASS=0
FAIL=0
check() {
local name="$1"
local cmd="$2"
echo -n "Checking ${name}... "
if eval "${cmd}" &>/dev/null; then
echo "OK"
PASS=$((PASS + 1))
else
echo "FAILED"
FAIL=$((FAIL + 1))
fi
}
echo "=== Service Health Checks ==="
echo ""
# Docker containers running
check "Docker containers" "docker compose -f /opt/argobox/stack/docker-compose.yml ps --status running | grep -q 'running'"
# Traefik responding
check "Traefik dashboard" "curl -sf http://localhost:8080/api/overview"
# Postgres accepting connections
check "Postgres" "docker exec $(docker compose -f /opt/argobox/stack/docker-compose.yml ps -q postgres 2>/dev/null) pg_isready -U app -d app"
# Grafana API
check "Grafana" "curl -sf http://10.42.0.10:3000/api/health"
# Prometheus targets
check "Prometheus" "curl -sf http://10.42.0.10:9090/api/v1/targets"
# Alertmanager
check "Alertmanager" "curl -sf http://10.42.0.10:9093/api/v2/status"
echo ""
echo "=== Data Integrity Checks ==="
echo ""
# Postgres database integrity
check "Postgres integrity" "docker exec $(docker compose -f /opt/argobox/stack/docker-compose.yml ps -q postgres 2>/dev/null) psql -U app -d app -c 'SELECT 1'"
# Prometheus data directory not empty
check "Prometheus data" "docker exec $(docker compose -f /opt/argobox/monitoring/docker-compose.yml ps -q prometheus 2>/dev/null) ls /prometheus/wal"
# Grafana dashboards exist
check "Grafana dashboards" "curl -sf http://10.42.0.10:3000/api/search | grep -q 'title'"
echo ""
echo "=== Results ==="
echo "Passed: ${PASS}"
echo "Failed: ${FAIL}"
if [[ ${FAIL} -gt 0 ]]; then
echo "Some checks failed. Investigate before considering the restore complete."
exit 1
else
echo "All checks passed. Restore verified."
fi
Deploy
# Set Restic repository credentials
export RESTIC_REPOSITORY="sftp:[email protected]:/backups/argobox"
export RESTIC_PASSWORD="your-restic-password"
# Step 1: Run the restore procedure (extracts files)
chmod +x /opt/argobox/restore/restore-procedure.sh
sudo -E bash /opt/argobox/restore/restore-procedure.sh
# Step 2: Restore and restart Docker services
chmod +x /opt/argobox/restore/restore-compose.sh
sudo bash /opt/argobox/restore/restore-compose.sh
# Step 3: Verify everything came back healthy
chmod +x /opt/argobox/restore/verify-restore.sh
sudo bash /opt/argobox/restore/verify-restore.sh
Verify
# Run the verification script
sudo bash /opt/argobox/restore/verify-restore.sh
# Manual spot checks
docker compose -f /opt/argobox/stack/docker-compose.yml ps
docker compose -f /opt/argobox/monitoring/docker-compose.yml ps
curl -sf http://10.42.0.10:3000/api/health | jq .
# Compare file counts against what Restic reports
restic stats latest --mode raw-data
find /opt/argobox -type f | wc -l
Rollback
# If the restore went wrong, restore from a different snapshot
export RESTIC_REPOSITORY="sftp:[email protected]:/backups/argobox"
export RESTIC_PASSWORD="your-restic-password"
restic snapshots # pick a different snapshot ID
restic restore <snapshot-id> --target /tmp/argobox-restore-v2
# Then re-run restore-compose.sh with the new RESTORE_TARGET
# Mosquitto ACL Configuration
# Restricts which users/clients can read/write which MQTT topics.
# Docs: https://mosquitto.org/man/mosquitto-conf-5.html#idm484
# Home Assistant — full read/write access to all topics
user homeassistant
topic readwrite #
# Zigbee2MQTT — full access to its own namespace + homeassistant discovery
user zigbee2mqtt
topic readwrite zigbee2mqtt/#
topic readwrite homeassistant/#
# IoT sensors — can only publish their own state, read commands
# Pattern: sensors/<device-id>/state (write)
# homeassistant/<device-id>/command (read)
user iot-sensors
topic write sensors/+/state
topic write sensors/+/availability
topic read homeassistant/+/command
# Temperature/humidity sensors
user env-sensors
topic write sensors/environment/+
topic read homeassistant/climate/+/command
# Deny everything else by default (no "topic" line = deny)
/opt/argobox/homeassistant/nftables-iot.conf
#!/usr/sbin/nft -f
# IoT VLAN firewall rules
# Allows IoT devices (10.42.30.0/24) to reach ONLY the MQTT broker.
# Blocks all other access to the main LAN and management networks.
# This pairs with the VLAN Segmentation playbook.
table inet iot_isolation {
chain forward {
type filter hook forward priority 10; policy accept;
# Allow IoT VLAN to reach MQTT broker on the Docker host (port 1883)
iifname "eth0.30" ip daddr 10.42.0.10 tcp dport 1883 accept
# Allow IoT VLAN to reach Home Assistant (port 8123) for direct integrations
iifname "eth0.30" ip daddr 10.42.0.10 tcp dport 8123 accept
# Allow IoT devices to talk to each other (mDNS, local protocols)
iifname "eth0.30" oifname "eth0.30" accept
# Allow established/related return traffic
iifname "eth0.30" ct state established,related accept
# Block IoT VLAN from reaching management VLAN
iifname "eth0.30" ip daddr 10.42.10.0/24 drop
# Block IoT VLAN from reaching services VLAN (except the rules above)
iifname "eth0.30" ip daddr 10.42.20.0/24 drop
# Block IoT from reaching Proxmox hosts directly
iifname "eth0.30" ip daddr { 10.42.0.31, 10.42.0.32, 10.42.0.33 } drop
# Allow IoT VLAN internet access (for firmware updates, NTP)
iifname "eth0.30" oifname "eth0" ip daddr != { 10.42.0.0/16 } accept
# Log and drop anything else from IoT VLAN
iifname "eth0.30" log prefix "IOT-DROP: " counter drop
}
}
Deploy
# Create directory structure
mkdir -p /opt/argobox/homeassistant/mosquitto
# Create MQTT password file (generates hashed passwords)
docker run --rm -v /opt/argobox/homeassistant/mosquitto:/mosquitto/config eclipse-mosquitto:2 \
mosquitto_passwd -c -b /mosquitto/config/password_file homeassistant "replace-with-ha-mqtt-password"
docker run --rm -v /opt/argobox/homeassistant/mosquitto:/mosquitto/config eclipse-mosquitto:2 \
mosquitto_passwd -b /mosquitto/config/password_file zigbee2mqtt "replace-with-z2m-mqtt-password"
docker run --rm -v /opt/argobox/homeassistant/mosquitto:/mosquitto/config eclipse-mosquitto:2 \
mosquitto_passwd -b /mosquitto/config/password_file iot-sensors "replace-with-sensor-password"
docker run --rm -v /opt/argobox/homeassistant/mosquitto:/mosquitto/config eclipse-mosquitto:2 \
mosquitto_passwd -b /mosquitto/config/password_file env-sensors "replace-with-env-sensor-password"
# Set permissions on password file
chmod 600 /opt/argobox/homeassistant/mosquitto/password_file
# Start the stack
cd /opt/argobox/homeassistant && docker compose up -d
# Wait for Home Assistant to finish initial setup
echo "Home Assistant is starting at http://10.42.0.10:8123 — initial setup takes 1-2 minutes."
# Configure MQTT integration in Home Assistant:
# Settings > Devices & Services > Add Integration > MQTT
# Broker: 10.42.0.10, Port: 1883, User: homeassistant, Password: <from above>
# Apply IoT VLAN firewall rules (if using VLAN segmentation)
sudo nft -f /opt/argobox/homeassistant/nftables-iot.conf
Verify
# Check all containers are running and healthy
docker compose -f /opt/argobox/homeassistant/docker-compose.yml ps
# Test MQTT publish/subscribe
docker exec mosquitto mosquitto_pub -h localhost -u homeassistant -P "replace-with-ha-mqtt-password" -t "test/topic" -m "hello"
docker exec mosquitto mosquitto_sub -h localhost -u homeassistant -P "replace-with-ha-mqtt-password" -t "test/topic" -C 1 -W 5
# Verify Home Assistant is responding
curl -sf http://10.42.0.10:8123/api/ | head -1
# Verify Zigbee2MQTT frontend
curl -sf http://10.42.0.10:8082/ | head -1
# Test IoT VLAN isolation (from a device on 10.42.30.0/24):
# ping 10.42.0.10 # Should FAIL (blocked by firewall)
# But MQTT should work:
# mosquitto_pub -h 10.42.0.10 -p 1883 -u iot-sensors -P "<password>" -t "sensors/test/state" -m "25.3"
# Verify the sensor cannot reach Proxmox:
# ping 10.42.0.31 # Should FAIL (IoT blocked from management)
# Check firewall counters
sudo nft list table inet iot_isolation
Distributed block storage for K3s with automatic backups to NFS, recurring snapshots, and disaster recovery tested.
Intermediate30-40 minkubernetes
Prerequisites
K3s cluster with 2+ nodes
NFS server for backup target (e.g., 10.42.0.20:/backups/longhorn)
open-iscsi installed on all worker nodes
Helm v3 installed
/opt/argobox/k8s/longhorn/longhorn-values.yaml
# Longhorn Helm values for K3s homelab
# Docs: https://longhorn.io/docs/
defaultSettings:
# Backup target — NFS share for storing volume backups
backupTarget: nfs://10.42.0.20:/backups/longhorn
# Default replica count — 2 for a small cluster (minimum 2 nodes)
defaultReplicaCount: 2
# Data locality — prefer keeping data on the node that uses it
defaultDataLocality: best-effort
# Storage over-provisioning — allow 200% of physical capacity
storageOverProvisioningPercentage: 200
# Minimum storage available before Longhorn stops scheduling
storageMinimalAvailablePercentage: 15
# Auto-delete workload pod when volume is detached unexpectedly
autoDeletePodWhenVolumeDetachedUnexpectedly: true
# Guaranteed instance manager CPU (millicores per node)
guaranteedInstanceManagerCPU: 12
# Replica auto-balance across nodes
replicaAutoBalance: best-effort
persistence:
# Set Longhorn as the default StorageClass
defaultClass: true
defaultClassReplicaCount: 2
reclaimPolicy: Retain
ingress:
enabled: true
ingressClassName: traefik
host: longhorn.argobox.com
tls: true
tlsSecret: longhorn-tls
annotations:
cert-manager.io/cluster-issuer: letsencrypt-prod
longhornUI:
replicas: 1
# Prometheus ServiceMonitor for metrics collection
metrics:
serviceMonitor:
enabled: true
additionalLabels:
release: kube-prometheus
/opt/argobox/k8s/longhorn/recurring-job.yaml
# Longhorn Recurring Jobs — automatic snapshots and backups
# These run on all volumes with the matching label (or set as default).
# Hourly snapshots — keep 24 (local only, fast rollback)
apiVersion: longhorn.io/v1beta2
kind: RecurringJob
metadata:
name: snapshot-hourly
namespace: longhorn-system
spec:
name: snapshot-hourly
task: snapshot
cron: "0 * * * *"
retain: 24
concurrency: 2
labels:
recurring-job.longhorn.io/source: system
groups:
- default
---
# Daily backups to NFS — keep 7
apiVersion: longhorn.io/v1beta2
kind: RecurringJob
metadata:
name: backup-daily
namespace: longhorn-system
spec:
name: backup-daily
task: backup
cron: "0 2 * * *"
retain: 7
concurrency: 1
labels:
recurring-job.longhorn.io/source: system
groups:
- default
---
# Weekly backups to NFS — keep 4
apiVersion: longhorn.io/v1beta2
kind: RecurringJob
metadata:
name: backup-weekly
namespace: longhorn-system
spec:
name: backup-weekly
task: backup
cron: "0 3 * * 0"
retain: 4
concurrency: 1
labels:
recurring-job.longhorn.io/source: system
groups:
- default
/opt/argobox/k8s/longhorn/restore-test.yaml
# Restore a Longhorn volume from backup
# First, find the backup URL in the Longhorn UI:
# Backup > select volume > select backup > copy "Backup URL"
# Or via kubectl:
# kubectl -n longhorn-system get backups.longhorn.io
# PVC that restores from a Longhorn backup
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: restored-data
namespace: default
annotations:
# Replace with the actual backup URL from Longhorn
longhorn.io/backup-url: "nfs://10.42.0.20:/backups/longhorn/default-pvc-abc123?backup=backup-xyz789"
spec:
accessModes:
- ReadWriteOnce
storageClassName: longhorn
resources:
requests:
storage: 10Gi
dataSource:
name: restored-data
kind: PersistentVolumeClaim
---
# Test pod that mounts the restored volume and verifies data
apiVersion: v1
kind: Pod
metadata:
name: restore-verify
namespace: default
spec:
containers:
- name: verify
image: busybox:1.36
command:
- /bin/sh
- -c
- |
echo "=== Restore Verification ==="
echo "Volume contents:"
ls -la /data/
echo ""
echo "File count:"
find /data -type f | wc -l
echo ""
echo "Disk usage:"
du -sh /data/
echo ""
echo "Verification complete. Pod will stay running for inspection."
echo "Delete with: kubectl delete pod restore-verify"
sleep 3600
volumeMounts:
- name: restored-data
mountPath: /data
volumes:
- name: restored-data
persistentVolumeClaim:
claimName: restored-data
restartPolicy: Never
Deploy
# Install open-iscsi on all worker nodes (required for Longhorn)
ssh [email protected] "sudo apt install -y open-iscsi && sudo systemctl enable --now iscsid"
ssh [email protected] "sudo apt install -y open-iscsi && sudo systemctl enable --now iscsid"
# Add the Longhorn Helm repo
helm repo add longhorn https://charts.longhorn.io && helm repo update
# Create the namespace
kubectl create namespace longhorn-system --dry-run=client -o yaml | kubectl apply -f -
# Install Longhorn with Helm
helm upgrade --install longhorn longhorn/longhorn \
--namespace longhorn-system \
--values /opt/argobox/k8s/longhorn/longhorn-values.yaml \
--wait --timeout 10m
# Wait for all Longhorn pods to be ready
kubectl wait --for=condition=Ready pods --all -n longhorn-system --timeout=300s
# Apply recurring snapshot/backup jobs
kubectl apply -f /opt/argobox/k8s/longhorn/recurring-job.yaml
# Create a test PVC to verify Longhorn is working
kubectl apply -f - <<TESTEOF
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: longhorn-test
namespace: default
spec:
accessModes: ["ReadWriteOnce"]
storageClassName: longhorn
resources:
requests:
storage: 1Gi
TESTEOF
# Trigger a manual backup of the test volume from the Longhorn UI
# (Longhorn UI: longhorn.argobox.com > Volume > longhorn-test > Create Backup)
# Test restore from backup
# kubectl apply -f /opt/argobox/k8s/longhorn/restore-test.yaml
Verify
# Check all Longhorn pods are running
kubectl get pods -n longhorn-system
# List Longhorn volumes
kubectl get volumes.longhorn.io -n longhorn-system
# Check Longhorn nodes (all should show "Ready")
kubectl get nodes.longhorn.io -n longhorn-system
# Verify recurring jobs are configured
kubectl get recurringjobs.longhorn.io -n longhorn-system
# Check the Longhorn UI for replica health
echo "Longhorn UI: https://longhorn.argobox.com"
# Verify backups exist on the NFS target
ls -la /backups/longhorn/ # run on the NFS server (10.42.0.20)
# Test the restored volume (if restore-test.yaml was applied)
kubectl logs restore-verify -n default
# Verify the default StorageClass is Longhorn
kubectl get storageclass