Hardening the Build Swarm: Ghost Drones and NAT Masquerade Attacks
Security isn’t about perfectly configuring a firewall once. It’s about designing systems that remain secure even when you make mistakes.
I learned this the hard way when a ghost infiltrated my build swarm. And again when a NAT bug almost rebooted my core infrastructure.
The Mystery Drone Incident
The Discovery
It was a Tuesday morning. I ran my usual checkout:
build-swarm status
═══ BUILD SWARM STATUS ═══
Gateway: ✓ 10.42.0.199
Orchestrator: ✓ 10.42.0.201 (API Online)
Active Drones:
drone-Izar │ 16 cores │ Idle
drone-Tarn │ 14 cores │ Idle
drone-mm2 │ 24 cores │ Idle
drone-lxc │ 4 cores │ Building: sys-libs/glibc
Wait. drone-lxc?
I checked my inventory. I own four drones. None of them are called drone-lxc.
build-swarm drone-info drone-lxc
Drone: drone-lxc
IP: 10.42.0.184
Cores: 4
Registered: 2025-09-12 14:23:17
Jobs completed: 3
Last seen: 2s ago
A drone I didn’t recognize had been building packages for three days.
The Investigation
I SSH’d into the mystery IP:
ssh [email protected]
I was in. It was an LXC container on my Tau-Ceti-Lab server.
Then I remembered.
Three days earlier, I’d spun up a test container to verify the drone package dependencies. I installed swarm-drone, configured it, ran a quick test… and forgot about it.
The container auto-started on reboot. The drone service auto-started with the container. mDNS broadcast announced its presence. The gateway auto-discovered it. The orchestrator auto-registered it. And for three days, it had been quietly compiling packages and uploading them to my production binhost.
The packages it compiled were fine. The container was legitimate—just forgotten. But the implications terrified me.
What Could Have Gone Wrong
If that container had been compromised—or malicious:
- It could have uploaded backdoored packages
- Those packages would have been signed by my staging process
- My desktop would have installed them
- Game over
The build swarm’s “zero configuration” design was a security hole. Any machine that could reach the gateway could join the swarm. No authentication. No approval. Just announce yourself and start building.
The Fix: Gateway Authorization
I added a “pending” state for new drones:
# gateway.py
class DroneRegistry:
def __init__(self):
self.approved = {}
self.pending = {}
def register(self, drone_id, drone_info):
if drone_id in self.approved:
# Known drone, allow
self.approved[drone_id].update(drone_info)
return {"status": "approved"}
else:
# New drone, hold for approval
self.pending[drone_id] = drone_info
return {"status": "pending", "message": "Awaiting admin approval"}
def approve(self, drone_id):
if drone_id in self.pending:
self.approved[drone_id] = self.pending.pop(drone_id)
return True
return False
Now the flow is:
- Drone connects and announces itself
- Gateway puts it in “pending” state
- Admin runs
swarm-ctl approve <drone-id> - Only then does it receive work
$ swarm-ctl pending
Pending Drones:
drone-lxc │ 10.42.0.184 │ Registered: 2s ago
$ swarm-ctl approve drone-lxc
Drone drone-lxc approved.
I also added alerts. When a new drone appears, I get a notification via Uptime Kuma webhook.
The NAT Masquerade Attack
Three weeks after the ghost drone incident, something worse happened.
The Symptom
The build swarm has a “restart stuck drone” feature. If a drone stops responding for 10 minutes, the orchestrator SSHs into it and runs reboot.
One morning, I woke up to find my gateway container had rebooted. Multiple times.
last reboot
reboot system boot 6.1.67-gentoo Mon Sep 18 03:14
reboot system boot 6.1.67-gentoo Mon Sep 18 02:58
reboot system boot 6.1.67-gentoo Mon Sep 18 02:42
Every 16 minutes. All night.
The orchestrator logs showed:
[02:42:15] Drone drone-Tarn unresponsive for 600s
[02:42:16] Initiating SSH reboot to 10.42.0.199
[02:42:17] Reboot command sent
It thought drone-Tarn was at 10.42.0.199. That’s the gateway’s IP, not the drone’s IP.
The Root Cause
drone-Tarn sits at a remote site. It connects to the swarm via Tailscale. But its registration goes through NAT.
Here’s what happened:
drone-Tarn(at remote site) connects to gateway (at home)- The TCP connection traverses NAT
- Gateway sees the connection coming from the NAT router’s IP:
10.42.0.199 - Gateway registers the drone with IP
10.42.0.199 - drone-Tarn goes offline (network blip)
- Orchestrator tries to reboot drone-Tarn
- SSH to
10.42.0.199… which is the gateway - Gateway reboots
- Orchestrator loses connection, waits, times out
- Repeat
The gateway was rebooting itself because it trusted the packet’s source IP.
The Fix: Self-Reported IP
The fix was to stop trusting network headers. Instead, drones explicitly report their preferred contact IP in the registration payload:
{
"id": "drone-Tarn",
"cores": 14,
"advertised_ip": "100.64.0.91"
}
The advertised_ip is the Tailscale IP—stable, reachable from anywhere in the mesh, not subject to NAT.
Gateway logic now:
def register_drone(request):
payload = request.json
source_ip = request.remote_addr # What we see (may be NAT'd)
advertised_ip = payload.get('advertised_ip') # What drone claims
# If source is on Tailscale subnet, trust advertised IP
if is_tailscale_subnet(source_ip) or is_tailscale_subnet(advertised_ip):
final_ip = advertised_ip
else:
# Legacy LAN drones use source IP
final_ip = source_ip
# Sanity check: don't register as gateway IP
if final_ip == GATEWAY_IP:
logger.error(f"Drone tried to register as gateway IP!")
return {"error": "Invalid IP"}, 400
drone = Drone(id=payload['id'], ip=final_ip)
drone.save()
I also added a blocklist. The gateway’s own IPs can never be registered as a drone target:
BLOCKED_IPS = [
"10.42.0.199", # Gateway LAN
"100.64.0.88", # Gateway Tailscale
"127.0.0.1", # Localhost
]
Why Didn’t I Notice Earlier?
The bug had always existed. It just hadn’t triggered.
drone-Tarn usually connects via Tailscale with a stable IP. The NAT path is a fallback. I only hit the bug when Tailscale had a brief outage and the drone tried to reconnect via the backup path.
Edge cases get you.
Zero Trust in the Homelab
These incidents changed how I think about homelab security.
The Old Model: Trust the Network
“Everything on my LAN is trusted. Firewall keeps the bad guys out.”
This works until:
- You spin up a test container and forget it
- A family member’s phone gets compromised
- A smart device phones home to China
- You misconfigure NAT and the router becomes a target
The New Model: Trust Nothing
Every connection must authenticate, regardless of source IP.
For the build swarm:
- Drones must be explicitly approved
- Drones report their own IP (verified via TLS cert)
- Orchestrator can’t target critical infrastructure IPs
- All cross-site traffic goes through Tailscale (encrypted, authenticated)
For the gateway:
- SSH: key-only,
AllowUsers commander - API: authenticated, rate-limited
- Firewall: deny by default, allow specific ports
- Tailscale ACLs: only accept traffic from known tags
{
"acls": [
// Build swarm components can talk to each other
{ "action": "accept", "src": ["tag:swarm"], "dst": ["tag:swarm:*"] },
// My devices can access everything
{ "action": "accept", "src": ["tag:admin"], "dst": ["*:*"] },
// Default deny
{ "action": "deny", "src": ["*"], "dst": ["*:*"] }
]
}
Defense in Depth
No single control should be the only thing between an attacker and your infrastructure.
Layer 1: Network
- Tailscale ACLs restrict who can connect
- Firewall drops unexpected traffic
- VLANs separate trusted/untrusted devices
Layer 2: Authentication
- SSH keys, not passwords
- Drone approval before receiving work
- API tokens with limited scope
Layer 3: Authorization
- Orchestrator can’t target gateway IPs
- Drones can only upload to staging, not production
- Package signing requires separate key
Layer 4: Monitoring
- Alerts on new drone registration
- Alerts on unexpected reboots
- Build logs for audit trail
The Current Gateway Hardening
Here’s what the gateway (Altair-Link) looks like now:
SSH Configuration
# /etc/ssh/sshd_config
PermitRootLogin no
PasswordAuthentication no
PubkeyAuthentication yes
AllowUsers commander
MaxAuthTries 3
LoginGraceTime 30
# Additional restrictions
AllowAgentForwarding no
AllowTcpForwarding no
X11Forwarding no
Firewall (UFW)
ufw default deny incoming
ufw default allow outgoing
# SSH (from Tailscale only)
ufw allow from 100.64.0.0/10 to any port 22
# Swarm API (from Tailscale only)
ufw allow from 100.64.0.0/10 to any port 8090
# Internal dashboard
ufw allow from 10.42.0.0/24 to any port 3001
ufw enable
Tailscale ACLs
{
"tagOwners": {
"tag:gateway": ["[email protected]"],
"tag:drone": ["[email protected]"],
"tag:admin": ["[email protected]"]
},
"acls": [
// Admin access to everything
{ "action": "accept", "src": ["tag:admin"], "dst": ["*:*"] },
// Drones can reach gateway API
{ "action": "accept", "src": ["tag:drone"], "dst": ["tag:gateway:8090"] },
// Gateway can SSH to drones (for reboot)
{ "action": "accept", "src": ["tag:gateway"], "dst": ["tag:drone:22"] },
// Default deny
{ "action": "deny", "src": ["*"], "dst": ["*:*"] }
]
}
Automatic Blocks
# In gateway code
BLOCKED_ACTIONS = {
"10.42.0.199": ["reboot", "shutdown", "kill"],
"10.42.0.1": ["reboot", "shutdown", "kill"], # Router
"100.64.0.88": ["reboot", "shutdown", "kill"],
}
def execute_remote_action(target_ip, action):
if action in BLOCKED_ACTIONS.get(target_ip, []):
logger.critical(f"Blocked dangerous action {action} on {target_ip}")
raise SecurityException("Action blocked by policy")
# ... proceed with action
Lessons Learned
1. Zero Configuration = Zero Security
Automatic discovery is convenient. It’s also a security hole. Every auto-join system needs manual approval or cryptographic verification.
2. Trust Network Headers at Your Peril
Source IPs can be NAT’d, spoofed, or misconfigured. If you’re making security decisions based on IP, you’re vulnerable.
3. Test Your Edge Cases
Both bugs only appeared under unusual conditions. The ghost drone: container auto-restart. The NAT bug: Tailscale failover. Your happy path works. Test what happens when things go wrong.
4. Defense in Depth
Every security control should assume the previous one failed. If an attacker bypasses the firewall, authentication should stop them. If they bypass authentication, authorization should limit damage. If they bypass authorization, monitoring should catch them.
5. Treat Internal Networks as Hostile
The perimeter model is dead. Zero trust means every connection is suspect, even from “inside” the network.
Current Status
Ghost drone risk: Mitigated. All new drones require manual approval. I get alerts on pending registrations.
NAT masquerade risk: Eliminated. Drones self-report Tailscale IPs. Gateway can’t target its own infrastructure.
General posture: Defense in depth. Multiple layers. Assume each layer will fail eventually.
The swarm is more complex now. But it won’t reboot my gateway at 3 AM anymore.
Related: The Build Swarm Architecture, Mastering Tailscale ACLs.