Two Weeks, One Password

The Plan

February 25. I have 1,556 packages queued for a Gentoo @world update. I’d been building binary packages on the build swarm — 66 CPU cores across 5 drones compiling everything into a binhost. The plan was simple: point my workstation at the binhost and install 1,556 pre-compiled packages. Should take an hour, maybe two.

I kicked off the emerge at 11:41 PM on March 1. Then went to bed.

The Crash

March 2, around 10:02 AM. The emerge was still running — some packages need to compile locally even with the binhost. But it hit the X11 libraries.

Here’s the thing about @world updates: emerge unmerges the old version before installing the new one. If you’re running KDE (which uses X11 libraries), and emerge unmerges libX11… your display server loses its libraries mid-session.

The kernel locked up. X11 tried to access libraries that no longer existed on disk. The GPU driver couldn’t render. Process after process hit D-state waiting for I/O that would never complete. Hard lockup.

I should have run this from a TTY. Not a GUI. That’s the first lesson and it’s the obvious one. Boot to text mode, run the update, reboot when done. I know this. I just didn’t do it.

The Silent Completion

March 4. I booted back in from my OpenSUSE dual-boot partition (btrfs, separate nvme partition — this is the thing that saved me). Mounted the Gentoo root and checked the emerge log.

1772607677:  *** Finished. Cleaning up...
1772607677:  *** exiting successfully.
1772607679:  *** terminating.

It… finished. 1,319 of the 1,556 packages installed. The emerge survived the kernel lockup, kept running in the background (or recovered on a subsequent boot — the logs aren’t entirely clear), and completed.

I didn’t expect that. I was prepared for a half-installed disaster. Instead, almost everything was fine.

Almost.

The Real Problem

March 6. I try to log in to Gentoo. SDDM loads. I type my password. Rejected.

I try the TTY. Same thing. Password rejected.

But the password isn’t wrong. I know the password. I just used it 30 seconds ago to log in to OpenSUSE on the other partition.

Chroot into Gentoo from OpenSUSE:

sudo mount /dev/nvme0n1p7 /mnt/gentoo
sudo chroot /mnt/gentoo /bin/bash

Check the password hash in /etc/shadow. It’s SHA512-CRYPT — that $6$ prefix. Standard for modern Linux.

Try to verify it manually… and it fails. The hash doesn’t match.

That’s when I found it. libxcrypt — the library that handles password hashing — was recompiled during the @world update. But it lost SHA512-CRYPT support. The new libxcrypt can’t verify $6$ hashes. It doesn’t error. It doesn’t warn. It just says “wrong password” for every attempt, because it literally doesn’t know how to check a SHA512 hash anymore.

1,319 packages installed. One of them broke authentication.

The OpenSUSE Lifeline

Thank god for dual-boot. From OpenSUSE, I could mount the Gentoo root and fix things via chroot without needing to boot Gentoo at all.

March 6, first round of fixes:

Disabled the display manager. Removed display-manager from Gentoo’s default runlevel so it boots to TTY instead of SDDM. Can’t log in to SDDM if the password system is broken.

Enabled passwordless login. Modified /etc/shadow and PAM config with nullok rules. This gets me a TTY session without needing a working password hash.

Rebuilt the LD cache. Ran ldconfig inside the chroot to make sure the dynamic linker could find all the new libraries.

Backed up everything. Shadow file, PAM config, the original emerge log. If something gets worse, I want rollback options.

The Recovery Script

March 9. The manual fixes got me a TTY login, but the system still wasn’t production-ready. libxcrypt still couldn’t do SHA512. KDE wouldn’t launch. The display manager was disabled. I needed to automate the full recovery because I didn’t want to type 50 commands from a TTY with no history and no tab completion.

So I wrote a script with Claude’s help — I described each recovery phase, it generated the Bash, I tested from OpenSUSE via chroot. It grew to 28KB.

Eight phases:

Diagnostics — capture system state before changing anything. User info, environment, installed packages, PAM config, TTY status. Write it all to a log.
Process cleanup — check for D-state processes stuck from the original crash. Kill them if found.
libxcrypt rebuild — the core fix. Recompile with SHA512-CRYPT support. Up to an hour for a source compile.
Password reset — set a new password non-interactively. Write the hash directly to /etc/shadow instead of using passwd, because passwd is interactive and requires a working TTY that might not cooperate.
PAM verification — confirm that PAM can actually authenticate with the new password. Test with su before declaring success.
Display manager — re-enable SDDM in the default runlevel.
System state snapshot — capture process list, memory, disk, dmesg. Compare with the phase 1 diagnostics.
Summary — human-readable report of what was done and what failed.

The Logging System

The script creates an isolated log directory:

/tmp/gentoo-recovery-logs-20260310-143022/
├── main.log          # Full execution timeline
├── diagnostics.log   # System state before changes
├── system-state.log  # Process snapshots, memory, dmesg
├── error-bundle.txt  # Quick summary of failures
└── pivots.log        # What went wrong and what was tried instead

Five log files. If the system freezes during recovery — and D-state processes can lock up the entire kernel — I can reboot into OpenSUSE, mount the Gentoo partition, and read the logs to see exactly where it stopped.

The pivots.log was an afterthought that turned out to be the most useful. When a step fails, the script doesn’t just log “FAILED.” It logs what failed, why, and what it’s trying instead. “libxcrypt emerge timed out after 3600s, falling back to manual configure/make/install.” That kind of thing.

Timeouts

Every operation has a timeout. Default is 5 minutes. Emerge operations get 1 hour. If something exceeds its timeout, the script kills the process and moves to the next phase.

This is critical. The March 2 crash was caused by processes hanging indefinitely. A recovery script that can also hang indefinitely is worse than no script at all.

Non-Interactive Everything

No prompts. No “press Enter to continue.” No interactive password entry.

Previous recovery attempts had frozen because passwd was waiting for keyboard input that never came — TTY issues after a crash can make stdin unreliable. The script avoids all interactive commands. Passwords are set via direct shadow file manipulation. Configuration changes use sed and echo, not interactive editors.

The Shadow Surprise

March 10. Time to actually boot Gentoo and run the recovery script.

Boot. TTY login prompt. Type argo, press Enter (passwordless).

Login incorrect

What.

Back to OpenSUSE. Mount Gentoo. Check /etc/shadow:

argo:

That’s it. Just argo: and a newline. No hash field. No expiry fields. No nothing. The shadow entry has one field out of the required nine.

The shadow format requires nine colon-separated fields:

username:hash:lastchanged:minimum:maximum:warn:inactive:expire:reserved

My entry was just argo: — PAM couldn’t parse it. Not “wrong password” — “user doesn’t exist in shadow.” A different error entirely from the libxcrypt issue.

How did this happen? Best guess: the @world update crash interrupted a shadow file write. Or one of my earlier recovery attempts accidentally truncated the entry. The file wasn’t corrupted in a way that was obvious — grep argo /etc/shadow returned a result. It just wasn’t a valid result.

The Fix

argo:!:17876:0:99999:7:::

Nine fields. ! means the account is locked for password auth (but PAM’s nullok rule allows passwordless login). 17876 is days since epoch for last password change. The rest are standard defaults.

Verified with:

chroot /mnt/gentoo /bin/bash -c "su - argo -c 'whoami'"

Returns argo. Login works.

The Timeline

Date	What
Feb 25	Planning started, binhost prepared
Mar 1 23:41	Emerge started
Mar 2 ~10:02	Crash — X11 libs unmerged while KDE running
Mar 4	Discovered emerge completed (1,319 packages)
Mar 6	Found libxcrypt root cause, applied initial fixes
Mar 9	Deployed 28KB recovery script
Mar 10	Found shadow corruption, fixed, login restored

Fourteen days from emerge start to working login. For what was supposed to be a one-hour binary package install.

What I’d Do Differently

Run @world updates from a TTY. Not a GUI. Disable the display manager first. This is Gentoo 101 and I ignored it.

# The right way (for next time)
sudo rc-update del display-manager default
sudo reboot
# [login to TTY]
sudo emerge @world --keep-going
sudo emerge @preserved-rebuild
sudo revdep-rebuild
sudo rc-update add display-manager default
sudo reboot

Validate the shadow file after any recovery operation. A nine-field colon-separated format is easy to validate programmatically. The recovery script now checks this.

Keep the dual-boot partition. OpenSUSE saved this entire recovery. Without it, I’d be booting from a live USB every time I needed chroot access. The dual-boot adds complexity but it’s paid for itself three times over.

Don’t procrastinate on broken systems. The emerge crashed on March 2. I didn’t seriously investigate until March 6. Four days where the system sat broken because “I’ll fix it this weekend.” The fix was faster than the procrastination.

The Status

Gentoo is… still not fully recovered as I write this. The recovery script is deployed and tested. The shadow file is fixed. Login works. But I haven’t run the full 8-phase script on a booted Gentoo system yet. That’s the next reboot.

The script is ready. The logs are configured. The timeouts are set. I’ve been working on this recovery for two weeks and the actual execution will probably take 30 minutes.

14 days of prep. 30 minutes of runtime. That ratio feels about right for infrastructure work.

Two Weeks, One Password: The Gentoo Recovery Saga

Two Weeks, One Password

The Plan

The Crash

The Silent Completion

The Real Problem

The OpenSUSE Lifeline

The Recovery Script

The Logging System

Timeouts

Non-Interactive Everything

The Shadow Surprise

The Fix

The Timeline

What I’d Do Differently

The Status

System Status

🌐 Gateway

🚀 Orchestrators

🤖 Build Drones

🔨 Active Builds

Two Weeks, One Password

The Plan

The Crash

The Silent Completion

The Real Problem

The OpenSUSE Lifeline

The Recovery Script

The Logging System

Timeouts

Non-Interactive Everything

The Shadow Surprise

The Fix

The Timeline

What I’d Do Differently

The Status

Enjoyed this post?