user@argobox:~/journal/2025-04-01-the-namespace-that-wouldnt-die
$ cat entry.md

The Namespace That Wouldn't Die

○ NOT REVIEWED

The Namespace That Wouldn’t Die

Date: April 1, 2025 Duration: 51 minutes of namespace therapy Issue: Two namespaces stuck in “Terminating” for 15 days Root Cause: Stale API group with lingering finalizers


The Symptom

kubectl get namespaces
# cattle-system     Terminating   15d
# cert-manager      Terminating   15d

Two namespaces. Fifteen days in limbo. I’d tried to uninstall Rancher and cert-manager weeks ago. The applications were gone. But their namespaces refused to leave.


The Standard Fixes (That Didn’t Work)

Force delete:

kubectl delete namespace cattle-system --grace-period=0 --force
# Warning: Immediate deletion does not wait for confirmation...
# namespace "cattle-system" force deleted

Ran it. Got the confirmation. Checked namespaces.

Still there.

Remove finalizers via patch:

kubectl patch namespace cattle-system --type json -p '[{"op": "remove", "path": "/spec/finalizers"}]'
# namespace/cattle-system patched (no change)

No change. The finalizers weren’t in /spec/finalizers.


The Discovery

Dug into the namespace JSON:

kubectl get namespace cattle-system -o json
{
  "metadata": {
    "finalizers": [
      "controller.cattle.io/namespace-auth"
    ]
  }
}

The finalizer was in /metadata/finalizers, not /spec/finalizers. Different path.

But that wasn’t the only problem.


The Real Issue

When I tried operations on the namespace, I kept seeing:

Discovery failed for some groups, 1 failing:
unable to retrieve the complete list of server APIs:
ext.cattle.io/v1: stale GroupVersion discovery: ext.cattle.io/v1

The Kubernetes API server was trying to check a custom resource definition (ext.cattle.io/v1) that no longer existed. The Rancher CRDs had been partially deleted, leaving behind a ghost API group.

Every time I tried to delete the namespace, Kubernetes tried to list resources in that API group, failed, and gave up.


The Fix

Target the correct finalizer path:

kubectl patch namespace cattle-system --type json -p='[{"op": "remove", "path": "/metadata/finalizers/0"}]'
namespace/cattle-system patched

Check namespaces:

kubectl get namespaces | grep cattle-system
# (nothing)

Gone. After 15 days. One command.

Same for cert-manager:

kubectl patch namespace cert-manager --type json -p='[{"op": "remove", "path": "/metadata/finalizers/0"}]'

Both namespaces finally deleted.


Why Force Delete Didn’t Work

kubectl delete --force doesn’t remove finalizers. It just tells Kubernetes “don’t wait for graceful termination.”

But the namespace controller won’t actually delete a namespace until all finalizers are removed. The finalizer is a marker saying “someone needs to clean something up first.” Force delete is irrelevant — the finalizer is still blocking.


Why the Standard Patch Didn’t Work

Kubernetes namespaces can have finalizers in two places:

  • /spec/finalizers — where Kubernetes core finalizers usually live
  • /metadata/finalizers — where custom controllers (like Rancher) put their finalizers

The Rancher finalizer was in /metadata/finalizers. The standard patch command targeted /spec/finalizers. Wrong path, no removal.


The Stale API Group Problem

The deeper issue was the orphaned CRD. Rancher creates custom resources like:

  • cattle.io
  • management.cattle.io
  • ext.cattle.io

When I deleted Rancher but not all its CRDs, the API server kept trying to discover resources in these groups. When discovery failed, namespace operations that needed to enumerate all resources would hang or error.

If you’re uninstalling Rancher or any CRD-heavy operator, clean up CRDs first:

kubectl get crd | grep cattle
kubectl delete crd <each-cattle-crd>

The Session Also Fixed

While debugging the namespace issue, I discovered my filebrowser and OpenWebUI were resetting after reboots.

Root cause: HostPath volumes with permission issues. The Postgres container ran as user 999 (postgres), but the host directory was owned by a different UID after reboots.

Fix: Created a systemd service to fix permissions on boot:

[Unit]
Description=Fix PostgreSQL data directory permissions
After=local-fs.target

[Service]
Type=oneshot
ExecStart=/bin/chown -R 999:999 /mnt/postgres_data/openwebui/
RemainAfterExit=yes

[Install]
WantedBy=multi-user.target

Ugly, but it works. The real fix would be converting to proper PVCs, but that’s a project for another day.


The Lessons

Force delete doesn’t remove finalizers. It just skips graceful termination.

Check /metadata/finalizers not just /spec/finalizers. Custom controllers often use the metadata path.

Stale CRDs break API discovery. Clean up CRDs before assuming an operator is fully removed.

HostPath volumes have permission problems. Container UIDs don’t always match host UIDs after reboots.

Kubernetes namespaces can haunt you for weeks. 15 days. One wrong JSON path.


The namespace finally died. But it took longer than some of my actual production deployments.