The Orchestrator That Rebooted My Workstation

Date: 2026-01-30 Issue: Build swarm orchestrator rebooted my development workstation Root Cause: LXC container reported host IP instead of container IP Lesson: Auto-healing is great until it heals the wrong machine

The Setup

I’ve been adding a new drone to the build swarm. Simple task: spin up an LXC container, install the drone service, watch it join the swarm. Done it a dozen times.

The new drone is called sweeper-Carme — a cleanup container that would handle maintenance tasks. The irony is not lost on me.

The Incident

I’m working on something else entirely when my screen goes black. The KDE logout animation. Full shutdown sequence.

My first thought: Power issue? No, the lights are on.

Second thought: Kernel panic? No, this was a clean shutdown.

Third thought: Did something just reboot my machine?

I check the system journal after it comes back up:

Jan 30 20:30:51 Galileo-Outpost sshd[12847]: Accepted publickey for root from 100.64.0.18
Jan 30 20:30:53 Galileo-Outpost systemd[1]: Stopping all remaining units...

Someone SSH’d in as root from 100.64.0.18 and rebooted my machine.

That IP is Icarus-Orchestrator. The build swarm orchestrator. My own infrastructure just rebooted my development workstation.

The Investigation

Here’s what happened:

I set up sweeper-Carme as an LXC container on my workstation
The LXC bridge (lxcbr0) wasn’t properly configured on the host
The drone service inside the container couldn’t get proper networking
When the drone registered with the gateway, it reported the host’s IP instead of its own
The drone went offline (because its networking was broken)
The orchestrator’s auto-heal logic kicked in
It SSH’d to the reported IP — my workstation — and ran reboot

The orchestrator did exactly what it was supposed to do. It saw a drone go offline, SSH’d to the IP it had on file, and rebooted it to recover. The problem was the IP was wrong.

The Fix

Protected Hosts

First priority: make sure this can never happen again.

I added a protected hosts safeguard to the orchestrator:

"protected_hosts": {
  "description": "IPs that should NEVER receive reboot commands",
  "ips": ["10.42.0.100", "100.64.0.100", "10.42.0.199", "10.42.0.201"],
  "note": "Add workstation, gateway, and critical infrastructure IPs"
}

Now before the orchestrator reboots anything, it checks the protected list:

def _reboot_drone(self, ip):
    if ip in self.protected_hosts:
        self.log.error(f"BLOCKED: Refusing to reboot protected host {ip}!")
        return False
    # ... proceed with reboot

Component Versioning

While I was in there, I also added a component versioning system. The main VERSION file was getting bumped constantly for minor changes, which made it hard to track what actually changed.

Now there’s a two-tier system:

Component	Version	Purpose
Main VERSION	0.4.1	CLI changes, major features
orchestrator	2.1.0	Orchestrator-specific changes
drone	2.0.0	Drone-specific changes
gateway	1.3.0	Gateway-specific changes

build-swarm bump orchestrator    # Quick iteration on orchestrator
build-swarm bump patch           # For CLI/major changes

What I Learned

Container networking must be verified before starting drone services. If the container can’t get a proper IP, it shouldn’t be allowed to register.
Critical infrastructure needs protection from automated actions. Workstations, gateways, and orchestrators should never be reboot targets.
Auto-heal features need safeguards. Just because you can SSH and reboot doesn’t mean you should.
The IP a service reports might not be the IP it’s running on. Especially with containers, VMs, and nested virtualization.

The Silver Lining

My daughter walked past right after my screen went black. “Did your computer crash again?”

“No,” I said, staring at the login screen. “My own infrastructure attacked me.”

She rolled her eyes and went back to Roblox. Fair enough.

At least I caught it immediately. Imagine if this had happened at 3 AM and the orchestrator kept trying to “heal” my workstation every time it came back up. Infinite reboot loop, powered by my own automation.

The protected hosts fix is deployed. sweeper-Carme is on hold until I properly configure lxcbr0. And I have a new rule: never test container networking while sitting at the machine the container is running on.

Files Changed

File	Change
`config/swarm.json`	Added protected_hosts section
`config/versions.json`	NEW - Component versioning
`bin/swarm-orchestrator`	Protected hosts check before reboot
`lib/swarm/deploy.py`	Deploy protected_hosts.conf to orchestrators
`scripts/swarm-coordinator.py`	bump command for component versions

Next Steps

Configure lxcbr0 properly on the workstation
Add hostname verification to auto-reboot logic
Consider requiring confirmation for any reboot command
Maybe don’t run experimental containers on my primary dev machine

Incident documented for posterity. And as a warning to my future self.