The Reboot Loop That Blamed the Wrong Code
Incident Duration: ~1 hour (18:30 - 19:45) System: Alpha-Centauri (10.42.0.199) - Ubuntu 22.04 bare-metal Status: RESOLVED Lesson Learned: Don’t assume your code is guilty until the logs prove it
It’s 6:30 PM on a Tuesday, and I’m watching Alpha-Centauri reboot for the fourth time in fifteen minutes. My first thought? “The gateway code is broken. I broke it.”
This is the natural reaction when a system housing your code starts misbehaving. Surely it’s something I did. Maybe that error handling isn’t as graceful as I thought. Maybe there’s a memory leak I missed. Maybe — and this is the paranoid commander in me — the code has achieved sentience and decided to escape via repeated kernel panics.
Spoiler: The code was completely innocent.
The Crime Scene
Here’s what the reboot timeline looked like:
18:30 - System boot
18:33 - Reboot (3 min uptime)
18:34 - System boot
18:35 - Reboot (1 min uptime)
18:36 - System boot
18:39 - Reboot (3 min uptime)
... this continued for an hour
Three minutes. One minute. Three minutes. Like a metronome of server suffering.
When I SSH’d in (during one of those precious 3-minute windows), I found:
- 19,000 AppArmor denied operations every 5 seconds (Netdata getting slapped repeatedly)
- CNI bridge interface flapping like a fish on land
- K3s pods in
CrashLoopBackOffwith restart counts in the 40s
But the most damning evidence? A crash dump with my gateway’s name on it:
/var/crash/_opt_build-swarm_bin_swarm-gateway.0.crash
I stared at that filename for a solid minute, mentally composing my apology to the server.
The Investigation (Where I Learned Humility)
Before accepting my guilt, I did what any paranoid developer would do — I grepped my own codebase for reboot commands:
grep -r "reboot\|shutdown.*-r\|systemctl.*reboot\|/sbin/reboot" \
bin/ lib/ scripts/ --include="*.py" --include="*.sh"
Result: Zero. Absolutely zero reboot commands in the entire gentoo-build-swarm repository.
The gateway code:
- Only calls
server.shutdown()on KeyboardInterrupt (that’s the HTTP server, not the system) - Has no
os.system(),subprocess, or shell calls that could reboot - Has no systemd
FailureAction=reboot - Is literally just a Python HTTP server that routes traffic
The crash dump I found? The gateway was terminated by signal — likely SIGKILL from the OOM killer or a kernel panic. It wasn’t crashing. It was being executed by the kernel, along with everything else on the system.
My code wasn’t the murderer. It was a victim.
The Actual Culprit: K3s
Here’s the real chain of events:
- K3s pods started crash-looping —
openwebuiandquartz-vaultwere restarting every few seconds - CNI network interfaces thrashed — pod death means network namespace teardown, pod restart means recreate. Hundreds of times per minute.
- Kernel panicked — the network stack couldn’t handle the rapid interface changes
- System auto-rebooted — Ubuntu’s default
kernel.panic=10means “reboot after 10 seconds of panic” - K3s started again — and immediately started crash-looping
- Return to step 1
The resource situation was also dire. An 8GB RAM system trying to run:
- K3s control plane (~1GB)
- Multiple pods (2-4GB when running, more when crash-looping)
- Netdata (~500MB)
- My gateway (~20MB — look, it’s efficient!)
- Other services
When pods crash-loop, they don’t gracefully release resources. They thrash. And this system was drowning.
The Fixes
Fix 1: Stop the Auto-Reboot Loop
sysctl -w kernel.panic=0
echo 'kernel.panic = 0' >> /etc/sysctl.conf
Now if the kernel panics, it’ll halt instead of creating an infinite boot loop. This gives me time to actually investigate instead of racing against a 3-minute timer.
Fix 2: Disable K3s
systemctl stop k3s
No pods = no crash loops = no CNI thrashing = no kernel panics = no reboots.
The system immediately stabilized. Uptime shot past 7 minutes — the longest it had been up in an hour.
Fix 3: Harden the Gateway Service
Even though the gateway wasn’t causing reboots, I made it more resilient:
[Service]
Restart=always
RestartSec=10
StartLimitBurst=10
TimeoutStartSec=30
Now it can survive host instability better and won’t spam restart attempts.
Lessons Learned
1. Kernel Panic Defaults Can Create Boot Loops
Ubuntu’s kernel.panic=10 is great for production servers with monitoring. It’s terrible for development machines where nobody’s watching. The system will happily reboot forever until someone notices.
Recommendation: Set kernel.panic=0 on dev/test systems.
2. K3s on 8GB RAM is Risky
Kubernetes is heavy. K3s is lighter, but “lighter than Kubernetes” still means “heavy.” Running multiple pods alongside other services on 8GB is asking for trouble.
Better options:
- Dedicated VM with resource limits
- LXC containers instead of K8s for simpler services
- More RAM (16GB+ if K3s is required)
3. Check Logs Before Blaming Your Code
The debugging order should be:
uptime— how long has this been happening?last reboot— what’s the pattern?journalctl -b -1— what happened before the last reboot?dmesg— kernel messages/var/crash/— crash dumps
Don’t assume your code is guilty until the logs prove it.
The Blame Assignment
After all investigation:
- NOT the swarm-gateway code (zero reboot commands)
- NOT the gateway systemd service (no failure actions)
- K3s pod crashes (primary cause)
- Default kernel panic setting (enabled the loop)
- Resource contention (contributing factor)
The gateway was just running in a hostile environment. Other services were causing system-wide instability, and my code got blamed because its name showed up in a crash dump.
Sometimes the server reboots aren’t your fault. Sometimes you’re just collateral damage.
Incident resolved at 19:45. Analysis documented while system maintained a stable 99.9% uptime for the next 12 hours. The gateway ran flawlessly, because it always had been.