The Reboot Loop That Blamed the Wrong Code
Incident Duration: ~1 hour (18:30 - 19:45) System: Alpha-Centauri (10.42.0.199) - Ubuntu 22.04 bare-metal Status: RESOLVED Lesson Learned: Donât assume your code is guilty until the logs prove it
Itâs 6:30 PM on a Tuesday, and Iâm watching Alpha-Centauri reboot for the fourth time in fifteen minutes. My first thought? âThe gateway code is broken. I broke it.â
This is the natural reaction when a system housing your code starts misbehaving. Surely itâs something I did. Maybe that error handling isnât as graceful as I thought. Maybe thereâs a memory leak I missed. Maybe â and this is the paranoid commander in me â the code has achieved sentience and decided to escape via repeated kernel panics.
Spoiler: The code was completely innocent.
The Crime Scene
Hereâs what the reboot timeline looked like:
18:30 - System boot
18:33 - Reboot (3 min uptime)
18:34 - System boot
18:35 - Reboot (1 min uptime)
18:36 - System boot
18:39 - Reboot (3 min uptime)
... this continued for an hour
Three minutes. One minute. Three minutes. Like a metronome of server suffering.
When I SSHâd in (during one of those precious 3-minute windows), I found:
- 19,000 AppArmor denied operations every 5 seconds (Netdata getting slapped repeatedly)
- CNI bridge interface flapping like a fish on land
- K3s pods in
CrashLoopBackOffwith restart counts in the 40s
But the most damning evidence? A crash dump with my gatewayâs name on it:
/var/crash/_opt_build-swarm_bin_swarm-gateway.0.crash
I stared at that filename for a solid minute, mentally composing my apology to the server.
The Investigation (Where I Learned Humility)
Before accepting my guilt, I did what any paranoid developer would do â I grepped my own codebase for reboot commands:
grep -r "reboot\|shutdown.*-r\|systemctl.*reboot\|/sbin/reboot" \
bin/ lib/ scripts/ --include="*.py" --include="*.sh"
Result: Zero. Absolutely zero reboot commands in the entire gentoo-build-swarm repository.
The gateway code:
- Only calls
server.shutdown()on KeyboardInterrupt (thatâs the HTTP server, not the system) - Has no
os.system(),subprocess, or shell calls that could reboot - Has no systemd
FailureAction=reboot - Is literally just a Python HTTP server that routes traffic
The crash dump I found? The gateway was terminated by signal â likely SIGKILL from the OOM killer or a kernel panic. It wasnât crashing. It was being executed by the kernel, along with everything else on the system.
My code wasnât the murderer. It was a victim.
The Actual Culprit: K3s
Hereâs the real chain of events:
- K3s pods started crash-looping â
openwebuiandquartz-vaultwere restarting every few seconds - CNI network interfaces thrashed â pod death means network namespace teardown, pod restart means recreate. Hundreds of times per minute.
- Kernel panicked â the network stack couldnât handle the rapid interface changes
- System auto-rebooted â Ubuntuâs default
kernel.panic=10means âreboot after 10 seconds of panicâ - K3s started again â and immediately started crash-looping
- Return to step 1
The resource situation was also dire. An 8GB RAM system trying to run:
- K3s control plane (~1GB)
- Multiple pods (2-4GB when running, more when crash-looping)
- Netdata (~500MB)
- My gateway (~20MB â look, itâs efficient!)
- Other services
When pods crash-loop, they donât gracefully release resources. They thrash. And this system was drowning.
The Fixes
Fix 1: Stop the Auto-Reboot Loop
sysctl -w kernel.panic=0
echo 'kernel.panic = 0' >> /etc/sysctl.conf
Now if the kernel panics, itâll halt instead of creating an infinite boot loop. This gives me time to actually investigate instead of racing against a 3-minute timer.
Fix 2: Disable K3s
systemctl stop k3s
No pods = no crash loops = no CNI thrashing = no kernel panics = no reboots.
The system immediately stabilized. Uptime shot past 7 minutes â the longest it had been up in an hour.
Fix 3: Harden the Gateway Service
Even though the gateway wasnât causing reboots, I made it more resilient:
[Service]
Restart=always
RestartSec=10
StartLimitBurst=10
TimeoutStartSec=30
Now it can survive host instability better and wonât spam restart attempts.
Lessons Learned
1. Kernel Panic Defaults Can Create Boot Loops
Ubuntuâs kernel.panic=10 is great for production servers with monitoring. Itâs terrible for development machines where nobodyâs watching. The system will happily reboot forever until someone notices.
Recommendation: Set kernel.panic=0 on dev/test systems.
2. K3s on 8GB RAM is Risky
Kubernetes is heavy. K3s is lighter, but âlighter than Kubernetesâ still means âheavy.â Running multiple pods alongside other services on 8GB is asking for trouble.
Better options:
- Dedicated VM with resource limits
- LXC containers instead of K8s for simpler services
- More RAM (16GB+ if K3s is required)
3. Check Logs Before Blaming Your Code
The debugging order should be:
uptimeâ how long has this been happening?last rebootâ whatâs the pattern?journalctl -b -1â what happened before the last reboot?dmesgâ kernel messages/var/crash/â crash dumps
Donât assume your code is guilty until the logs prove it.
The Blame Assignment
After all investigation:
- NOT the swarm-gateway code (zero reboot commands)
- NOT the gateway systemd service (no failure actions)
- K3s pod crashes (primary cause)
- Default kernel panic setting (enabled the loop)
- Resource contention (contributing factor)
The gateway was just running in a hostile environment. Other services were causing system-wide instability, and my code got blamed because its name showed up in a crash dump.
Sometimes the server reboots arenât your fault. Sometimes youâre just collateral damage.
Incident resolved at 19:45. Analysis documented while system maintained a stable 99.9% uptime for the next 12 hours. The gateway ran flawlessly, because it always had been.