The Reboot Loop That Blamed the Wrong Code

Incident Duration: ~1 hour (18:30 - 19:45) System: Alpha-Centauri (10.42.0.199) - Ubuntu 22.04 bare-metal Status: RESOLVED Lesson Learned: Don’t assume your code is guilty until the logs prove it

It’s 6:30 PM on a Tuesday, and I’m watching Alpha-Centauri reboot for the fourth time in fifteen minutes. My first thought? “The gateway code is broken. I broke it.”

This is the natural reaction when a system housing your code starts misbehaving. Surely it’s something I did. Maybe that error handling isn’t as graceful as I thought. Maybe there’s a memory leak I missed. Maybe — and this is the paranoid commander in me — the code has achieved sentience and decided to escape via repeated kernel panics.

Spoiler: The code was completely innocent.

The Crime Scene

Here’s what the reboot timeline looked like:

18:30 - System boot
18:33 - Reboot (3 min uptime)
18:34 - System boot
18:35 - Reboot (1 min uptime)
18:36 - System boot
18:39 - Reboot (3 min uptime)
... this continued for an hour

Three minutes. One minute. Three minutes. Like a metronome of server suffering.

When I SSH’d in (during one of those precious 3-minute windows), I found:

19,000 AppArmor denied operations every 5 seconds (Netdata getting slapped repeatedly)
CNI bridge interface flapping like a fish on land
K3s pods in CrashLoopBackOff with restart counts in the 40s

But the most damning evidence? A crash dump with my gateway’s name on it:

/var/crash/_opt_build-swarm_bin_swarm-gateway.0.crash

I stared at that filename for a solid minute, mentally composing my apology to the server.

The Investigation (Where I Learned Humility)

Before accepting my guilt, I did what any paranoid developer would do — I grepped my own codebase for reboot commands:

grep -r "reboot\|shutdown.*-r\|systemctl.*reboot\|/sbin/reboot" \
  bin/ lib/ scripts/ --include="*.py" --include="*.sh"

Result: Zero. Absolutely zero reboot commands in the entire gentoo-build-swarm repository.

The gateway code:

Only calls server.shutdown() on KeyboardInterrupt (that’s the HTTP server, not the system)
Has no os.system(), subprocess, or shell calls that could reboot
Has no systemd FailureAction=reboot
Is literally just a Python HTTP server that routes traffic

The crash dump I found? The gateway was terminated by signal — likely SIGKILL from the OOM killer or a kernel panic. It wasn’t crashing. It was being executed by the kernel, along with everything else on the system.

My code wasn’t the murderer. It was a victim.

The Actual Culprit: K3s

Here’s the real chain of events:

K3s pods started crash-looping — openwebui and quartz-vault were restarting every few seconds
CNI network interfaces thrashed — pod death means network namespace teardown, pod restart means recreate. Hundreds of times per minute.
Kernel panicked — the network stack couldn’t handle the rapid interface changes
System auto-rebooted — Ubuntu’s default kernel.panic=10 means “reboot after 10 seconds of panic”
K3s started again — and immediately started crash-looping
Return to step 1

The resource situation was also dire. An 8GB RAM system trying to run:

K3s control plane (~1GB)
Multiple pods (2-4GB when running, more when crash-looping)
Netdata (~500MB)
My gateway (~20MB — look, it’s efficient!)
Other services

When pods crash-loop, they don’t gracefully release resources. They thrash. And this system was drowning.

The Fixes

Fix 1: Stop the Auto-Reboot Loop

sysctl -w kernel.panic=0
echo 'kernel.panic = 0' >> /etc/sysctl.conf

Now if the kernel panics, it’ll halt instead of creating an infinite boot loop. This gives me time to actually investigate instead of racing against a 3-minute timer.

Fix 2: Disable K3s

systemctl stop k3s

No pods = no crash loops = no CNI thrashing = no kernel panics = no reboots.

The system immediately stabilized. Uptime shot past 7 minutes — the longest it had been up in an hour.

Fix 3: Harden the Gateway Service

Even though the gateway wasn’t causing reboots, I made it more resilient:

[Service]
Restart=always
RestartSec=10
StartLimitBurst=10
TimeoutStartSec=30

Now it can survive host instability better and won’t spam restart attempts.

Lessons Learned

1. Kernel Panic Defaults Can Create Boot Loops

Ubuntu’s kernel.panic=10 is great for production servers with monitoring. It’s terrible for development machines where nobody’s watching. The system will happily reboot forever until someone notices.

Recommendation: Set kernel.panic=0 on dev/test systems.

2. K3s on 8GB RAM is Risky

Kubernetes is heavy. K3s is lighter, but “lighter than Kubernetes” still means “heavy.” Running multiple pods alongside other services on 8GB is asking for trouble.

Better options:

Dedicated VM with resource limits
LXC containers instead of K8s for simpler services
More RAM (16GB+ if K3s is required)

3. Check Logs Before Blaming Your Code

The debugging order should be:

uptime — how long has this been happening?
last reboot — what’s the pattern?
journalctl -b -1 — what happened before the last reboot?
dmesg — kernel messages
/var/crash/ — crash dumps

Don’t assume your code is guilty until the logs prove it.

The Blame Assignment

After all investigation:

NOT the swarm-gateway code (zero reboot commands)
NOT the gateway systemd service (no failure actions)
K3s pod crashes (primary cause)
Default kernel panic setting (enabled the loop)
Resource contention (contributing factor)

The gateway was just running in a hostile environment. Other services were causing system-wide instability, and my code got blamed because its name showed up in a crash dump.

Sometimes the server reboots aren’t your fault. Sometimes you’re just collateral damage.

Incident resolved at 19:45. Analysis documented while system maintained a stable 99.9% uptime for the next 12 hours. The gateway ran flawlessly, because it always had been.