user@argobox:~/journal/2026-01-30-the-orchestrator-that-rebooted-my-workstation
$ cat entry.md

The Orchestrator That Rebooted My Workstation

○ NOT REVIEWED

The Orchestrator That Rebooted My Workstation

Date: 2026-01-30 Issue: Build swarm orchestrator rebooted my development workstation Root Cause: LXC container reported host IP instead of container IP Lesson: Auto-healing is great until it heals the wrong machine


The Setup

I’ve been adding a new drone to the build swarm. Simple task: spin up an LXC container, install the drone service, watch it join the swarm. Done it a dozen times.

The new drone is called sweeper-Carme — a cleanup container that would handle maintenance tasks. The irony is not lost on me.


The Incident

I’m working on something else entirely when my screen goes black. The KDE logout animation. Full shutdown sequence.

My first thought: Power issue? No, the lights are on.

Second thought: Kernel panic? No, this was a clean shutdown.

Third thought: Did something just reboot my machine?

I check the system journal after it comes back up:

Jan 30 20:30:51 Galileo-Outpost sshd[12847]: Accepted publickey for root from 100.64.0.18
Jan 30 20:30:53 Galileo-Outpost systemd[1]: Stopping all remaining units...

Someone SSH’d in as root from 100.64.0.18 and rebooted my machine.

That IP is Icarus-Orchestrator. The build swarm orchestrator. My own infrastructure just rebooted my development workstation.


The Investigation

Here’s what happened:

  1. I set up sweeper-Carme as an LXC container on my workstation
  2. The LXC bridge (lxcbr0) wasn’t properly configured on the host
  3. The drone service inside the container couldn’t get proper networking
  4. When the drone registered with the gateway, it reported the host’s IP instead of its own
  5. The drone went offline (because its networking was broken)
  6. The orchestrator’s auto-heal logic kicked in
  7. It SSH’d to the reported IP — my workstation — and ran reboot

The orchestrator did exactly what it was supposed to do. It saw a drone go offline, SSH’d to the IP it had on file, and rebooted it to recover. The problem was the IP was wrong.


The Fix

Protected Hosts

First priority: make sure this can never happen again.

I added a protected hosts safeguard to the orchestrator:

"protected_hosts": {
  "description": "IPs that should NEVER receive reboot commands",
  "ips": ["10.42.0.100", "100.64.0.100", "10.42.0.199", "10.42.0.201"],
  "note": "Add workstation, gateway, and critical infrastructure IPs"
}

Now before the orchestrator reboots anything, it checks the protected list:

def _reboot_drone(self, ip):
    if ip in self.protected_hosts:
        self.log.error(f"BLOCKED: Refusing to reboot protected host {ip}!")
        return False
    # ... proceed with reboot

Component Versioning

While I was in there, I also added a component versioning system. The main VERSION file was getting bumped constantly for minor changes, which made it hard to track what actually changed.

Now there’s a two-tier system:

ComponentVersionPurpose
Main VERSION0.4.1CLI changes, major features
orchestrator2.1.0Orchestrator-specific changes
drone2.0.0Drone-specific changes
gateway1.3.0Gateway-specific changes
build-swarm bump orchestrator    # Quick iteration on orchestrator
build-swarm bump patch           # For CLI/major changes

What I Learned

  1. Container networking must be verified before starting drone services. If the container can’t get a proper IP, it shouldn’t be allowed to register.

  2. Critical infrastructure needs protection from automated actions. Workstations, gateways, and orchestrators should never be reboot targets.

  3. Auto-heal features need safeguards. Just because you can SSH and reboot doesn’t mean you should.

  4. The IP a service reports might not be the IP it’s running on. Especially with containers, VMs, and nested virtualization.


The Silver Lining

My daughter walked past right after my screen went black. “Did your computer crash again?”

“No,” I said, staring at the login screen. “My own infrastructure attacked me.”

She rolled her eyes and went back to Roblox. Fair enough.

At least I caught it immediately. Imagine if this had happened at 3 AM and the orchestrator kept trying to “heal” my workstation every time it came back up. Infinite reboot loop, powered by my own automation.

The protected hosts fix is deployed. sweeper-Carme is on hold until I properly configure lxcbr0. And I have a new rule: never test container networking while sitting at the machine the container is running on.


Files Changed

FileChange
config/swarm.jsonAdded protected_hosts section
config/versions.jsonNEW - Component versioning
bin/swarm-orchestratorProtected hosts check before reboot
lib/swarm/deploy.pyDeploy protected_hosts.conf to orchestrators
scripts/swarm-coordinator.pybump command for component versions

Next Steps

  • Configure lxcbr0 properly on the workstation
  • Add hostname verification to auto-reboot logic
  • Consider requiring confirmation for any reboot command
  • Maybe don’t run experimental containers on my primary dev machine

Incident documented for posterity. And as a warning to my future self.