The Orchestrator That Rebooted My Workstation
Date: 2026-01-30 Issue: Build swarm orchestrator rebooted my development workstation Root Cause: LXC container reported host IP instead of container IP Lesson: Auto-healing is great until it heals the wrong machine
The Setup
I’ve been adding a new drone to the build swarm. Simple task: spin up an LXC container, install the drone service, watch it join the swarm. Done it a dozen times.
The new drone is called sweeper-Carme — a cleanup container that would handle maintenance tasks. The irony is not lost on me.
The Incident
I’m working on something else entirely when my screen goes black. The KDE logout animation. Full shutdown sequence.
My first thought: Power issue? No, the lights are on.
Second thought: Kernel panic? No, this was a clean shutdown.
Third thought: Did something just reboot my machine?
I check the system journal after it comes back up:
Jan 30 20:30:51 Galileo-Outpost sshd[12847]: Accepted publickey for root from 100.64.0.18
Jan 30 20:30:53 Galileo-Outpost systemd[1]: Stopping all remaining units...
Someone SSH’d in as root from 100.64.0.18 and rebooted my machine.
That IP is Icarus-Orchestrator. The build swarm orchestrator. My own infrastructure just rebooted my development workstation.
The Investigation
Here’s what happened:
- I set up
sweeper-Carmeas an LXC container on my workstation - The LXC bridge (
lxcbr0) wasn’t properly configured on the host - The drone service inside the container couldn’t get proper networking
- When the drone registered with the gateway, it reported the host’s IP instead of its own
- The drone went offline (because its networking was broken)
- The orchestrator’s auto-heal logic kicked in
- It SSH’d to the reported IP — my workstation — and ran
reboot
The orchestrator did exactly what it was supposed to do. It saw a drone go offline, SSH’d to the IP it had on file, and rebooted it to recover. The problem was the IP was wrong.
The Fix
Protected Hosts
First priority: make sure this can never happen again.
I added a protected hosts safeguard to the orchestrator:
"protected_hosts": {
"description": "IPs that should NEVER receive reboot commands",
"ips": ["10.42.0.100", "100.64.0.100", "10.42.0.199", "10.42.0.201"],
"note": "Add workstation, gateway, and critical infrastructure IPs"
}
Now before the orchestrator reboots anything, it checks the protected list:
def _reboot_drone(self, ip):
if ip in self.protected_hosts:
self.log.error(f"BLOCKED: Refusing to reboot protected host {ip}!")
return False
# ... proceed with reboot
Component Versioning
While I was in there, I also added a component versioning system. The main VERSION file was getting bumped constantly for minor changes, which made it hard to track what actually changed.
Now there’s a two-tier system:
| Component | Version | Purpose |
|---|---|---|
| Main VERSION | 0.4.1 | CLI changes, major features |
| orchestrator | 2.1.0 | Orchestrator-specific changes |
| drone | 2.0.0 | Drone-specific changes |
| gateway | 1.3.0 | Gateway-specific changes |
build-swarm bump orchestrator # Quick iteration on orchestrator
build-swarm bump patch # For CLI/major changes
What I Learned
-
Container networking must be verified before starting drone services. If the container can’t get a proper IP, it shouldn’t be allowed to register.
-
Critical infrastructure needs protection from automated actions. Workstations, gateways, and orchestrators should never be reboot targets.
-
Auto-heal features need safeguards. Just because you can SSH and reboot doesn’t mean you should.
-
The IP a service reports might not be the IP it’s running on. Especially with containers, VMs, and nested virtualization.
The Silver Lining
My daughter walked past right after my screen went black. “Did your computer crash again?”
“No,” I said, staring at the login screen. “My own infrastructure attacked me.”
She rolled her eyes and went back to Roblox. Fair enough.
At least I caught it immediately. Imagine if this had happened at 3 AM and the orchestrator kept trying to “heal” my workstation every time it came back up. Infinite reboot loop, powered by my own automation.
The protected hosts fix is deployed. sweeper-Carme is on hold until I properly configure lxcbr0. And I have a new rule: never test container networking while sitting at the machine the container is running on.
Files Changed
| File | Change |
|---|---|
config/swarm.json | Added protected_hosts section |
config/versions.json | NEW - Component versioning |
bin/swarm-orchestrator | Protected hosts check before reboot |
lib/swarm/deploy.py | Deploy protected_hosts.conf to orchestrators |
scripts/swarm-coordinator.py | bump command for component versions |
Next Steps
- Configure lxcbr0 properly on the workstation
- Add hostname verification to auto-reboot logic
- Consider requiring confirmation for any reboot command
- Maybe don’t run experimental containers on my primary dev machine
Incident documented for posterity. And as a warning to my future self.