user@argobox:~/journal/2023-09-30-the-463-message-traefik-saga
$ cat entry.md

The 463-Message Traefik Saga

○ NOT REVIEWED

The 463-Message Traefik Saga

Date: September 27-30, 2023 Duration: 4 days of intensive troubleshooting Issue: Everything broke while containerizing the homelab Root Cause: Docker networks, Traefik labels, firewall rules, DNS — all of it


The Goal

Simple enough: move from services installed directly on hosts to Docker containers with Traefik as reverse proxy.

What I thought would take an afternoon became a four-day marathon.


Day 1: Optimism

Installed Docker and Docker Compose. Wrote my first docker-compose.yml for Traefik. Started it up.

Container started. Logs looked clean. Port 8080 showed the Traefik dashboard.

Then I tried to actually route traffic to a service.

Nothing.


Day 2: Docker Won’t Pull

Network issues started appearing. Docker couldn’t reliably pull images.

Error response from daemon: Get "https://registry-1.docker.io/v2/": dial tcp: lookup registry-1.docker.io: no such host

Sometimes it worked. Sometimes it failed. The inconsistency made debugging impossible.

Tried:

  • Manual DNS configuration in daemon.json
  • Different DNS servers (8.8.8.8, 1.1.1.1)
  • Retry loops in deployment scripts
  • Partial image cleanup

By end of day 2: still fighting basic connectivity.


Day 3: The 238-Message Day

This was the day everything exploded. Every fix revealed a new problem.

Container networking: Bridge mode wasn’t exposing ports right. Host mode broke isolation. Overlay seemed overkill for a single host.

Traefik routing: Labels weren’t being detected. Services showed in the dashboard but returned 404. The documentation made it look simple. It wasn’t.

Cloudflare DDNS: API tokens weren’t authenticating. Zone permissions were wrong. DNS records weren’t updating.

Cronjobs: Script worked manually but not from cron. Environment variables missing. Path issues.

All of this, simultaneously.


Day 4: The Network That Already Exists

Started seeing this constantly:

Error: network traefik-public already exists

Docker Compose trying to create a network that was already there. But the services couldn’t use it. Orphaned network state from failed deployments.

The fix was embarrassingly simple:

networks:
  traefik-public:
    external: true

Tell Compose the network exists externally. Don’t try to create it.

But that wasn’t the end.


The Firewall Problem

Services were running. Traefik was routing. But nothing was accessible from the LAN.

pfSense was blocking Docker traffic. Default deny on subnets I hadn’t explicitly allowed.

Docker uses 172.17.0.0/16 by default for its bridge network. pfSense had no idea this subnet existed. Traffic from containers to the LAN got dropped at the firewall.

Added rules:

  • Allow Docker host to LAN
  • Allow Docker subnets (172.17.0.0/16) outbound
  • Allow LAN to Docker host ports

Suddenly, everything worked.


The Breakthrough

Around message 450 of 463, I hit the Traefik dashboard and saw:

  • All services green
  • Routes resolving correctly
  • SSL certificates generating
  • Containers communicating through the proxy

Four days of incremental progress, one moment of everything clicking into place.


What Finally Worked

Traefik docker-compose.yml:

version: '3'

services:
  traefik:
    image: traefik:latest
    command:
      - "--api.insecure=true"
      - "--providers.docker=true"
      - "--providers.docker.exposedbydefault=false"
      - "--entrypoints.web.address=:80"
      - "--entrypoints.websecure.address=:443"
    ports:
      - "80:80"
      - "443:443"
      - "8080:8080"
    volumes:
      - "/var/run/docker.sock:/var/run/docker.sock:ro"
    networks:
      - traefik-public

networks:
  traefik-public:
    external: true

Service labels:

labels:
  - "traefik.enable=true"
  - "traefik.http.routers.myapp.rule=Host(`app.domain.com`)"
  - "traefik.http.routers.myapp.entrypoints=websecure"
  - "traefik.http.services.myapp.loadbalancer.server.port=8080"

Key insight: traefik.enable=true is required when using exposedbydefault=false. Every service needs it explicitly.


What I Learned

Docker networks are stateful. Failed deployments leave orphaned networks. Clean up with docker network prune before retrying.

Traefik labels are picky. One typo and your service is invisible. The dashboard shows discovered services — use it to verify detection.

Firewalls don’t know about containers. Docker creates its own subnets. Your firewall rules need to account for traffic from those subnets.

External networks are the answer. Create the shared network once with docker network create traefik-public. Reference it as external in every compose file.

The breakthrough comes after the frustration. 463 messages. 4 days. All for a configuration that now seems obvious.


The Before and After

Before:

  • Services installed directly on VMs
  • Manual port management
  • Individual service configs
  • No centralized proxy
  • SSL certificates per service

After:

  • Containerized services
  • Traefik handles all routing
  • One compose file per service
  • Automatic SSL via Let’s Encrypt
  • Single point of ingress

463 messages. 4 days. The modern homelab infrastructure was worth every error message.