Hardening the Build Swarm: Ghost Drones and NAT Masquerade Attacks

Security isn’t about perfectly configuring a firewall once. It’s about designing systems that remain secure even when you make mistakes.

I learned this the hard way when a ghost infiltrated my build swarm. And again when a NAT bug almost rebooted my core infrastructure.


The Mystery Drone Incident

The Discovery

It was a Tuesday morning. I ran my usual checkout:

build-swarm status
═══ BUILD SWARM STATUS ═══

Gateway:      ✓ 10.42.0.199
Orchestrator: ✓ 10.42.0.201 (API Online)

Active Drones:
  drone-Izar      │ 16 cores │ Idle
  drone-Tarn   │ 14 cores │ Idle
  drone-mm2     │ 24 cores │ Idle
  drone-lxc     │  4 cores │ Building: sys-libs/glibc

Wait. drone-lxc?

I checked my inventory. I own four drones. None of them are called drone-lxc.

build-swarm drone-info drone-lxc
Drone: drone-lxc
  IP: 10.42.0.184
  Cores: 4
  Registered: 2025-09-12 14:23:17
  Jobs completed: 3
  Last seen: 2s ago

A drone I didn’t recognize had been building packages for three days.

The Investigation

I SSH’d into the mystery IP:

ssh [email protected]

I was in. It was an LXC container on my Tau-Ceti-Lab server.

Then I remembered.

Three days earlier, I’d spun up a test container to verify the drone package dependencies. I installed swarm-drone, configured it, ran a quick test… and forgot about it.

The container auto-started on reboot. The drone service auto-started with the container. mDNS broadcast announced its presence. The gateway auto-discovered it. The orchestrator auto-registered it. And for three days, it had been quietly compiling packages and uploading them to my production binhost.

The packages it compiled were fine. The container was legitimate—just forgotten. But the implications terrified me.

What Could Have Gone Wrong

If that container had been compromised—or malicious:

  • It could have uploaded backdoored packages
  • Those packages would have been signed by my staging process
  • My desktop would have installed them
  • Game over

The build swarm’s “zero configuration” design was a security hole. Any machine that could reach the gateway could join the swarm. No authentication. No approval. Just announce yourself and start building.

The Fix: Gateway Authorization

I added a “pending” state for new drones:

# gateway.py

class DroneRegistry:
    def __init__(self):
        self.approved = {}
        self.pending = {}

    def register(self, drone_id, drone_info):
        if drone_id in self.approved:
            # Known drone, allow
            self.approved[drone_id].update(drone_info)
            return {"status": "approved"}
        else:
            # New drone, hold for approval
            self.pending[drone_id] = drone_info
            return {"status": "pending", "message": "Awaiting admin approval"}

    def approve(self, drone_id):
        if drone_id in self.pending:
            self.approved[drone_id] = self.pending.pop(drone_id)
            return True
        return False

Now the flow is:

  1. Drone connects and announces itself
  2. Gateway puts it in “pending” state
  3. Admin runs swarm-ctl approve <drone-id>
  4. Only then does it receive work
$ swarm-ctl pending

Pending Drones:
  drone-lxc 10.42.0.184 Registered: 2s ago

$ swarm-ctl approve drone-lxc
Drone drone-lxc approved.

I also added alerts. When a new drone appears, I get a notification via Uptime Kuma webhook.


The NAT Masquerade Attack

Three weeks after the ghost drone incident, something worse happened.

The Symptom

The build swarm has a “restart stuck drone” feature. If a drone stops responding for 10 minutes, the orchestrator SSHs into it and runs reboot.

One morning, I woke up to find my gateway container had rebooted. Multiple times.

last reboot
reboot   system boot  6.1.67-gentoo    Mon Sep 18 03:14
reboot   system boot  6.1.67-gentoo    Mon Sep 18 02:58
reboot   system boot  6.1.67-gentoo    Mon Sep 18 02:42

Every 16 minutes. All night.

The orchestrator logs showed:

[02:42:15] Drone drone-Tarn unresponsive for 600s
[02:42:16] Initiating SSH reboot to 10.42.0.199
[02:42:17] Reboot command sent

It thought drone-Tarn was at 10.42.0.199. That’s the gateway’s IP, not the drone’s IP.

The Root Cause

drone-Tarn sits at a remote site. It connects to the swarm via Tailscale. But its registration goes through NAT.

Here’s what happened:

  1. drone-Tarn (at remote site) connects to gateway (at home)
  2. The TCP connection traverses NAT
  3. Gateway sees the connection coming from the NAT router’s IP: 10.42.0.199
  4. Gateway registers the drone with IP 10.42.0.199
  5. drone-Tarn goes offline (network blip)
  6. Orchestrator tries to reboot drone-Tarn
  7. SSH to 10.42.0.199… which is the gateway
  8. Gateway reboots
  9. Orchestrator loses connection, waits, times out
  10. Repeat

The gateway was rebooting itself because it trusted the packet’s source IP.

The Fix: Self-Reported IP

The fix was to stop trusting network headers. Instead, drones explicitly report their preferred contact IP in the registration payload:

{
  "id": "drone-Tarn",
  "cores": 14,
  "advertised_ip": "100.64.0.91"
}

The advertised_ip is the Tailscale IP—stable, reachable from anywhere in the mesh, not subject to NAT.

Gateway logic now:

def register_drone(request):
    payload = request.json
    source_ip = request.remote_addr  # What we see (may be NAT'd)
    advertised_ip = payload.get('advertised_ip')  # What drone claims

    # If source is on Tailscale subnet, trust advertised IP
    if is_tailscale_subnet(source_ip) or is_tailscale_subnet(advertised_ip):
        final_ip = advertised_ip
    else:
        # Legacy LAN drones use source IP
        final_ip = source_ip

    # Sanity check: don't register as gateway IP
    if final_ip == GATEWAY_IP:
        logger.error(f"Drone tried to register as gateway IP!")
        return {"error": "Invalid IP"}, 400

    drone = Drone(id=payload['id'], ip=final_ip)
    drone.save()

I also added a blocklist. The gateway’s own IPs can never be registered as a drone target:

BLOCKED_IPS = [
    "10.42.0.199",    # Gateway LAN
    "100.64.0.88",   # Gateway Tailscale
    "127.0.0.1",     # Localhost
]

Why Didn’t I Notice Earlier?

The bug had always existed. It just hadn’t triggered.

drone-Tarn usually connects via Tailscale with a stable IP. The NAT path is a fallback. I only hit the bug when Tailscale had a brief outage and the drone tried to reconnect via the backup path.

Edge cases get you.


Zero Trust in the Homelab

These incidents changed how I think about homelab security.

The Old Model: Trust the Network

“Everything on my LAN is trusted. Firewall keeps the bad guys out.”

This works until:

  • You spin up a test container and forget it
  • A family member’s phone gets compromised
  • A smart device phones home to China
  • You misconfigure NAT and the router becomes a target

The New Model: Trust Nothing

Every connection must authenticate, regardless of source IP.

For the build swarm:

  • Drones must be explicitly approved
  • Drones report their own IP (verified via TLS cert)
  • Orchestrator can’t target critical infrastructure IPs
  • All cross-site traffic goes through Tailscale (encrypted, authenticated)

For the gateway:

  • SSH: key-only, AllowUsers commander
  • API: authenticated, rate-limited
  • Firewall: deny by default, allow specific ports
  • Tailscale ACLs: only accept traffic from known tags
{
  "acls": [
    // Build swarm components can talk to each other
    { "action": "accept", "src": ["tag:swarm"], "dst": ["tag:swarm:*"] },

    // My devices can access everything
    { "action": "accept", "src": ["tag:admin"], "dst": ["*:*"] },

    // Default deny
    { "action": "deny", "src": ["*"], "dst": ["*:*"] }
  ]
}

Defense in Depth

No single control should be the only thing between an attacker and your infrastructure.

Layer 1: Network

  • Tailscale ACLs restrict who can connect
  • Firewall drops unexpected traffic
  • VLANs separate trusted/untrusted devices

Layer 2: Authentication

  • SSH keys, not passwords
  • Drone approval before receiving work
  • API tokens with limited scope

Layer 3: Authorization

  • Orchestrator can’t target gateway IPs
  • Drones can only upload to staging, not production
  • Package signing requires separate key

Layer 4: Monitoring

  • Alerts on new drone registration
  • Alerts on unexpected reboots
  • Build logs for audit trail

The Current Gateway Hardening

Here’s what the gateway (Altair-Link) looks like now:

SSH Configuration

# /etc/ssh/sshd_config

PermitRootLogin no
PasswordAuthentication no
PubkeyAuthentication yes
AllowUsers commander
MaxAuthTries 3
LoginGraceTime 30

# Additional restrictions
AllowAgentForwarding no
AllowTcpForwarding no
X11Forwarding no

Firewall (UFW)

ufw default deny incoming
ufw default allow outgoing

# SSH (from Tailscale only)
ufw allow from 100.64.0.0/10 to any port 22

# Swarm API (from Tailscale only)
ufw allow from 100.64.0.0/10 to any port 8090

# Internal dashboard
ufw allow from 10.42.0.0/24 to any port 3001

ufw enable

Tailscale ACLs

{
  "tagOwners": {
    "tag:gateway": ["[email protected]"],
    "tag:drone": ["[email protected]"],
    "tag:admin": ["[email protected]"]
  },
  "acls": [
    // Admin access to everything
    { "action": "accept", "src": ["tag:admin"], "dst": ["*:*"] },

    // Drones can reach gateway API
    { "action": "accept", "src": ["tag:drone"], "dst": ["tag:gateway:8090"] },

    // Gateway can SSH to drones (for reboot)
    { "action": "accept", "src": ["tag:gateway"], "dst": ["tag:drone:22"] },

    // Default deny
    { "action": "deny", "src": ["*"], "dst": ["*:*"] }
  ]
}

Automatic Blocks

# In gateway code

BLOCKED_ACTIONS = {
    "10.42.0.199": ["reboot", "shutdown", "kill"],
    "10.42.0.1": ["reboot", "shutdown", "kill"],  # Router
    "100.64.0.88": ["reboot", "shutdown", "kill"],
}

def execute_remote_action(target_ip, action):
    if action in BLOCKED_ACTIONS.get(target_ip, []):
        logger.critical(f"Blocked dangerous action {action} on {target_ip}")
        raise SecurityException("Action blocked by policy")
    # ... proceed with action

Lessons Learned

1. Zero Configuration = Zero Security

Automatic discovery is convenient. It’s also a security hole. Every auto-join system needs manual approval or cryptographic verification.

2. Trust Network Headers at Your Peril

Source IPs can be NAT’d, spoofed, or misconfigured. If you’re making security decisions based on IP, you’re vulnerable.

3. Test Your Edge Cases

Both bugs only appeared under unusual conditions. The ghost drone: container auto-restart. The NAT bug: Tailscale failover. Your happy path works. Test what happens when things go wrong.

4. Defense in Depth

Every security control should assume the previous one failed. If an attacker bypasses the firewall, authentication should stop them. If they bypass authentication, authorization should limit damage. If they bypass authorization, monitoring should catch them.

5. Treat Internal Networks as Hostile

The perimeter model is dead. Zero trust means every connection is suspect, even from “inside” the network.


Current Status

Ghost drone risk: Mitigated. All new drones require manual approval. I get alerts on pending registrations.

NAT masquerade risk: Eliminated. Drones self-report Tailscale IPs. Gateway can’t target its own infrastructure.

General posture: Defense in depth. Multiple layers. Assume each layer will fail eventually.

The swarm is more complex now. But it won’t reboot my gateway at 3 AM anymore.


Related: The Build Swarm Architecture, Mastering Tailscale ACLs.