Troubleshooting

This page covers the issues that come up repeatedly when developing, deploying, or operating Arcturus-Prime. If something breaks and you don’t know where to start, check here first.

Admin Area Returns Plain “Not Found”

Symptom: Every /admin/* route returns a plain text “Not Found” response — no styled 404 page, no layout, no navigation. Just the words “Not Found” on a blank page. /admin/sandbox and /admin/deployments still work.

Cause: DEMO_MODE is set to a truthy value (1, true, yes, or on) in the Cloudflare Pages production environment variables. When demo mode is active, the middleware blocks all /admin/* routes except /admin/sandbox and /admin/deployments.

Code path:

src/middleware.ts line ~96: if (isDemoMode()) { ... }
src/lib/demo-mode.ts: isDemoMode() reads the DEMO_MODE env var
isRestrictedInDemoRoute() matches any route starting with /admin that isn’t in the allowlist (/admin/sandbox, /admin/deployments)
Middleware returns new Response('Not Found', { status: 404 }) — the plain text response

Fix: Remove DEMO_MODE from the CF Pages production environment, or set it to an empty string / false. The admin area will work again on the next deployment.

CF Pages Dashboard → Settings → Environment Variables → Production
→ Delete or set DEMO_MODE to empty/false

Why this keeps happening: The credentials vault (credentials.md) lists DEMO_MODE: true in both Production and Preview environment variable sections. Any automated process or AI session that syncs environment variables from the vault to CF Pages will silently enable demo mode on production, locking out the entire admin area. This has caused multiple outages.

Prevention:

DEMO_MODE must be absent or false in the CF Pages production environment
The credentials vault should mark DEMO_MODE as demo-subdomain-only with a clear warning
After any bulk env var sync to CF Pages, verify DEMO_MODE is not truthy in production
The diagnostic giveaway is the plain text “Not Found” — a real Astro 404 would show a styled page with navigation

ViewTransitions and DOMContentLoaded

Symptom: JavaScript functionality works on initial page load but breaks when navigating between pages. Loading spinners get stuck. Interactive components stop responding after the first client-side navigation.

Cause: Arcturus-Prime uses Astro’s <ViewTransitions /> (Client Router) in both CosmicLayout.astro and BaseLayout.astro. When ViewTransitions handles a navigation, it performs a client-side page swap rather than a full page reload. This means DOMContentLoaded only fires once — on the initial full page load — and never fires again on subsequent navigations.

Fix: Replace all DOMContentLoaded listeners with astro:page-load:

// WRONG — breaks on client-side navigation
document.addEventListener('DOMContentLoaded', () => {
  init();
});

// ALSO WRONG — the readyState pattern doesn't help
if (document.readyState === 'loading') {
  document.addEventListener('DOMContentLoaded', init);
} else {
  init();
}

// CORRECT — fires on both initial load and ViewTransitions navigations
document.addEventListener('astro:page-load', () => {
  init();
});

This applies to ALL script tags: <script>, <script is:inline>, and <script define:vars> in pages, components, and layouts.

History: This bug affected 37 files across 23 pages, 6 components, and 3 layouts when ViewTransitions was added. Every file had to be audited and updated. The /blog/ page was the most visible breakage — the post list showed a permanent loading spinner when navigated to via a client-side link.

How to check: Search the codebase for DOMContentLoaded:

grep -r "DOMContentLoaded" src/

If that returns any matches in .astro, .js, or .ts files, they need to be updated.

Script Scope: `is:inline define:vars` vs Module Scripts

Symptom: A JavaScript variable defined in a <script is:inline define:vars> block throws ReferenceError: _variableName is not defined at runtime, even though the variable is clearly declared in the same component file.

Cause: Astro components can contain multiple <script> blocks, and each one is a separate JavaScript scope. A <script is:inline define:vars={{ ... }}> block runs as a separate inline script in the browser — its const/let/var declarations are not visible to a separate <script> (module) block in the same .astro file.

<!-- Block 1: inline script with server data -->
<script is:inline define:vars={{ myData }}>
  const parsed = JSON.parse(myData);
  // ❌ 'parsed' only exists in THIS script block
</script>

<!-- Block 2: module script -->
<script>
  console.log(parsed); // ❌ ReferenceError: parsed is not defined
</script>

Fix: Bridge the two scopes via window. Store data from the inline block on window, then read it in the module block:

<!-- Block 1: inline script — parse and expose globally -->
<script is:inline define:vars={{ myData }}>
  window.MY_DATA = JSON.parse(myData);
</script>

<!-- Block 2: module script — read from window -->
<script>
  const parsed = window.MY_DATA;
</script>

Why this pattern exists: define:vars is the only way to inject server-side (frontmatter) values into client-side scripts. The inline block handles the SSR→client handoff, while the module block handles imports (like import { TendrilGraph } from '@tendril/graph') which is:inline scripts cannot use.

History: This bug broke the KnowledgeGraph on /blog/ after the Tendril config system refactor (2026-02-28, commit 31bf382). Config sections were parsed as const variables in the define:vars block but referenced in the module script where TendrilGraph is imported. The fix stores all parsed config on window.GRAPH_CONFIG in the inline block and reads it back in the module block.

How to check: If a component has both <script is:inline define:vars> and a separate <script>, verify that no const/let/var from the first block is referenced in the second. All cross-block data must go through window.

Inline Script Type Annotations Causing `Unexpected token ':'`

Symptom: /admin intermittently throws:

Uncaught SyntaxError: Failed to execute 'replaceWith' on 'Element': Unexpected token ':'

Follow-on effects include broken module initializers (for example, knowledge graph container errors) because page scripts fail during client-side navigation.

Cause: TypeScript syntax (for example (r: any) => ...) was included inside a browser-executed inline script. Inline scripts in .astro pages run as plain JavaScript in the browser, so TS-only annotations are invalid and crash at parse time.

Fix: Remove TS annotations from inline scripts and keep the logic JavaScript-safe with null guards.

// WRONG inside inline browser script
(data.results || []).filter((r: any) => !r.synced)

// CORRECT
(data.results || []).filter((r) => !r.synced)

Where this occurred: src/pages/admin/index.astro in the syncToGitHub client-side function.

Prevention:

Treat <script is:inline> blocks as browser JavaScript only.
Keep TypeScript types in frontmatter, .ts files, or framework-compiled script contexts.
If a parser error appears from ClientRouter.*.js, inspect the originating inline script for TS syntax.

`/admin/system-test` Repeated DOM Errors (`style` / `appendChild`)

Symptom: Console spam during or after client-side navigation with errors like:

Cannot read properties of undefined (reading 'style')
Cannot read properties of null (reading 'appendChild')

In some cases this repeats many times and degrades the admin page experience.

Cause: System-test client code executed without strict route/DOM guards. Under ViewTransitions timing, the script can run when required elements are not present yet (or no longer present), and direct DOM mutations then throw.

Fix:

Scope initialization to the intended route (/admin/system-test or /system-test).
Verify required elements exist before mutating UI.
Make renderer tolerant to partial/missing JSON fields.
Register astro:page-load handler once to avoid duplicate binding.

Where this was fixed: src/pages/admin/system-test.astro (commit 3dde152 on 2026-02-28).

Prevention: For all inline page scripts, treat element lookups as optional and return early if the page root is missing. This is mandatory when using Astro ViewTransitions.

Console Error Spam: Null Crashes from View Transitions Race Condition

Symptom: The browser console fills with hundreds of errors when navigating between admin pages:

Cannot set properties of null (setting 'innerHTML')
Cannot set properties of null (setting 'textContent')
Cannot read properties of null (reading 'classList')

Errors reference functions like loadStatus, loadPipeline, loadStats, loadQueue, loadFollowUps.

Cause: Async race condition between in-flight fetch() callbacks and View Transitions DOM swap. The sequence:

User visits /admin/jobs — setInterval(loadAll, 10000) starts polling
Interval fires, multiple fetch() calls go out
User navigates to /admin/health via View Transitions
astro:before-swap fires — interval is cleared
But the already-in-flight fetch() calls resolve after the DOM swap
Callbacks try document.getElementById('startStopBtn').innerHTML = ...
Element doesn’t exist on the new page — crash

Clearing the interval only prevents future ticks. It does nothing for requests already in flight.

Fix: Null-guard every getElementById() call in polling functions. Two patterns:

// Pattern 1: Early bail on critical element
async function loadStatus() {
  const data = await apiFetch('status');
  if (!data) return;

  const btn = document.getElementById('startStopBtn');
  if (!btn) return; // Page navigated away — bail

  btn.innerHTML = data.running ? 'Stop' : 'Start';
}

// Pattern 2: Safe setter for many independent elements
const set = (id, val) => {
  const el = document.getElementById(id);
  if (el) el.textContent = val;
};
set('statTotal', s.total_applications || 0);
set('statToday', s.today || 0);

Where this was fixed: jobs.astro (5 functions, ~28 calls), email.astro (loadStats), servers/index.astro (added missing interval cleanup). Commit 9fcf10e on 2026-02-28.

Prevention: Every polling function that accesses DOM elements must null-check the result of getElementById(). See the View Transitions doc for the full defensive pattern.

Build Failure: Wrong Relative Import Path in Subdirectory Pages

Symptom: CF Pages build fails with:

Could not resolve "../../layouts/CosmicLayout.astro" from "src/pages/admin/servers/index.astro"

Cause: Pages directly in src/pages/admin/ are 2 directory levels below src/, so they use ../../layouts/CosmicLayout.astro. But pages in subdirectories like src/pages/admin/servers/ are 3 levels deep and need ../../../layouts/CosmicLayout.astro. Using the wrong depth resolves to a nonexistent path (src/pages/layouts/ instead of src/layouts/).

Fix: Count the directory depth from the file to src/ and use the correct number of ../ segments:

File location	Depth from `src/`	Layout import
`src/pages/admin/health.astro`	2	`../../layouts/CosmicLayout.astro`
`src/pages/admin/servers/index.astro`	3	`../../../layouts/CosmicLayout.astro`
`src/pages/api/admin/health-check.ts`	3	`../../../config/...` or `../../../lib/...`

Where this occurred: Commit 53bdc9b rewrote servers/index.astro from standalone HTML to use CosmicLayout, but copied the import path from a sibling file one level up. Fixed in commit 3b2f130 on 2026-02-28.

Prevention: When creating pages in subdirectories under admin/, verify the import path resolves correctly. This error passes npm run dev silently but breaks the CF Pages production build.

Build Failures: SSR Bundle Exclusions

Symptom: npm run build fails with errors related to better-sqlite3 or @argonaut/core. Errors typically mention native modules, missing bindings, or import resolution failures.

Cause: These packages contain native Node.js modules that can’t be bundled into the Cloudflare Workers SSR output. They need to be excluded from the Vite bundling process.

Fix: Ensure these packages are listed in the external configuration in astro.config.mjs:

// astro.config.mjs
export default defineConfig({
  vite: {
    build: {
      rollupOptions: {
        external: ['better-sqlite3', '@argonaut/core']
      }
    },
    optimizeDeps: {
      exclude: ['better-sqlite3', '@argonaut/core']
    }
  }
});

Both rollupOptions.external and optimizeDeps.exclude need to list these packages. The first prevents them from being bundled in the production build; the second prevents Vite from pre-bundling them in development.

When this happens: Usually after adding a new dependency that transitively depends on a native module, or after updating Astro/Vite versions that change bundling behavior.

Cloudflare Access Auth Issues

Symptom: Admin pages return 401 or 403 errors. Users who should have admin access are denied. The auth middleware rejects valid Cloudflare Access tokens.

Cause: Usually one of the Cloudflare Access environment variables is wrong or missing.

Checklist:

Check CF_ACCESS_AUD: This is the Application Audience tag from your Cloudflare Access application. It’s a 64-character hex string. If it doesn’t match exactly, JWT validation fails silently.
```
# Verify the secret is set in production
npx wrangler secret list
# Look for CF_ACCESS_AUD in the output
```
Check CF_ACCESS_TEAM_DOMAIN: Must match your Cloudflare Access team domain exactly, including the .cloudflareaccess.com suffix.
Check ADMIN_EMAILS: The email in the JWT must appear in this comma-separated list. Check for whitespace issues — "[email protected], [email protected]" (note the space after the comma) might not match "[email protected]" depending on how the comparison is implemented.
Check ALLOW_CF_AUTH_COOKIE_FALLBACK: If the JWT isn’t coming through the CF-Access-Jwt-Assertion header (common with iframes or embedded contexts), the middleware needs this set to true to check the CF_Authorization cookie as a fallback.
Token expiration: Cloudflare Access tokens have a TTL. If a user has been idle for a long time, their token may have expired. A page refresh triggers a new auth flow.

Local dev note: Authentication is bypassed in local development (npm run dev). If auth works locally but fails in production, the issue is definitely in the production secrets.

API Proxy Timeouts

Symptom: API calls to /api/proxy/*, /api/swarm/*, or /api/gateway/* return timeouts or 502 errors on the live site.

Cause: The proxy routes forward requests to Docker services running on Altair-Link (10.42.0.199). If those services are down, the proxy has nothing to forward to.

Diagnostic steps:

Check if Altair-Link is reachable:
```
ping 10.42.0.199
```

Check Docker services on Altair-Link:

ssh [email protected] "docker ps --format 'table {{.Names}}\t{{.Status}}\t{{.Ports}}'"

Check specific services:

# Gateway
curl -s http://10.42.0.199:8090/health

# Command Center
curl -s http://10.42.0.199:8093/api/v1/services/public/build-swarm

# Prometheus
curl -s http://10.42.0.199:9090/-/ready

# Loki
curl -s http://10.42.0.199:3100/ready

Check Dozzle for container logs: Dozzle runs on Altair-Link at port 9999 and provides a web interface for viewing Docker container logs:
```
http://10.42.0.199:9999
```

Restart the specific service:

ssh [email protected] "cd /path/to/service && docker compose restart"

Common cause: Altair-Link runs a lot of Docker containers. Occasionally a container will OOM or hit a resource limit and stop responding. The Docker restart: unless-stopped policy usually handles crashes, but hung containers (alive but not responding) need a manual restart.

Content Not Rendering

Symptom: A new content file doesn’t appear on the site, or renders with missing fields / wrong layout.

Cause: Frontmatter schema mismatch. Every content collection (posts, journal, docs) has a schema defined in src/content/config.ts. If the frontmatter in your file doesn’t match the schema, Astro either silently drops the file or throws a build error.

Checklist:

Check required fields: Open src/content/config.ts and check which fields are required for your content collection. Missing required fields cause silent failures — the file just doesn’t appear.
Check field types: A pubDate field expecting a Date type won’t accept a string like "2026-02-23" — it needs to be unquoted YAML (pubDate: 2026-02-23). Similarly, tags must be an array, not a comma-separated string.
Check file location: Content must be in the correct directory. A blog post in src/content/journal/ won’t appear in the blog collection.
Check the slug: If the filename contains characters that Astro can’t convert to a valid URL slug, the content may be silently skipped. Stick to lowercase alphanumeric characters and hyphens.
Check for YAML errors: Invalid YAML in frontmatter (missing closing ---, tabs instead of spaces, unquoted special characters) causes silent parse failures. Run the file through a YAML validator if unsure.

Quick test: Run npm run build and check the output. Astro reports content collection errors during build, even if npm run dev silently swallows them.

Dev Server vs Production Differences

Several things behave differently in local development vs production on Cloudflare Workers.

Behavior	Local Dev (`npm run dev`)	Production (Cloudflare Workers)
Authentication	Bypassed — no CF Access	Full Cloudflare Access flow
Content source	Local filesystem (`node:fs`)	Gitea API via `GITEA_*` env vars
KV storage	Miniflare local KV	Cloudflare Workers KV
API proxy	Vite dev server proxy	Cloudflare Worker fetch handler
Node.js APIs	Available (`node:fs`, `node:path`)	Not available (Workers runtime)
Env vars	`.dev.vars` file	`wrangler secret`

The big one: Code that uses node:fs to read files works locally but fails in production. If you’re adding a new feature that reads files, it must use the Gitea API in production and node:fs only in development. The content layer handles this abstraction for content collections, but custom file reading needs conditional logic.

// Pattern for dual-mode file reading
if (import.meta.env.DEV) {
  // Local development — read from filesystem
  const data = await fs.readFile(path, 'utf-8');
} else {
  // Production — fetch from Gitea API
  const response = await fetch(`${GITEA_API_URL}/repos/${owner}/${repo}/raw/${path}`);
  const data = await response.text();
}

Tailscale Connectivity

Symptom: Andromeda hosts (Tarn-Host, Meridian-Host, etc.) are unreachable. Build drones on the remote network can’t connect to the orchestrator. Tailscale pings fail.

Checklist:

Check Tailscale is running:
```
tailscale status
```
Look for the target host in the output. If it’s missing, the host’s Tailscale daemon might be down.
Check subnet routers are advertising: The Milky Way and Andromedas are connected via Tailscale subnet routing. The subnet routers on each network need to be running and advertising their local subnets:
```
# On the subnet router
tailscale status --peers
# Look for the subnet routes in the output
```
Check if the subnet routes are approved: In the Tailscale admin console, subnet routes need to be explicitly approved. A new subnet route shows as “pending” until an admin approves it.

Direct connectivity test:

# Ping via Tailscale IP
tailscale ping 100.64.0.27.91  # drone-Tarn

# Check if it's a relay connection
tailscale ping --verbose 100.64.0.57.110  # dr-Meridian-Host

DERP relay fallback: If direct connections fail, Tailscale falls back to DERP relay servers. This adds latency (~100ms+) but should still work. If even DERP is failing, there’s likely a firewall blocking UDP/41641 on one side.

Common cause: The remote site (Andromeda/the remote site) occasionally has network interruptions — router restarts, ISP hiccups, or configuration changes. Check if any Andromeda hosts are online before assuming a Tailscale problem.

Docker Container Restarts

Symptom: A Docker service on Altair-Link keeps restarting or has restarted unexpectedly.

Diagnostic:

Check container status and restart count:

ssh [email protected] "docker ps -a --format 'table {{.Names}}\t{{.Status}}\t{{.RunningFor}}'"

Check container logs via Dozzle: Open http://10.42.0.199:9999 in a browser. Dozzle provides a real-time log viewer for all Docker containers. Find the problematic container and check its recent logs.

Check container logs via CLI:

ssh [email protected] "docker logs --tail 100 container-name"

Check system resources:
```
ssh [email protected] "docker stats --no-stream"
```
Look for containers hitting memory limits or consuming excessive CPU.

Check Docker daemon logs:

ssh [email protected] "journalctl -u docker --since '1 hour ago'"

Common causes:

OOM kill: Container exceeds its memory limit. Check docker inspect container-name | grep Memory for limits, and container logs for OOM messages.
Health check failure: Containers with health checks get restarted if the check fails consecutively. Check if the service inside the container is responding to health checks.
Dependent service down: A container that depends on another service (like a database) will crash-loop if that service is unavailable.

Build Swarm Connectivity

Symptom: Build swarm commands fail. build-swarm status shows the orchestrator as offline or drones as unreachable.

Check the orchestrator:

The orchestrator runs on Izar-Host at 10.42.0.201:8080. If it’s unreachable, nothing in the swarm works.

# Direct check
curl -s http://10.42.0.201:8080/api/status

# Via gateway
curl -s http://10.42.0.199:8090/health

If the orchestrator is down:

# SSH to Izar-Host and check the container
ssh [email protected] "docker ps -a | grep swarm"
ssh [email protected] "docker logs swarm-orchestrator --tail 50"
ssh [email protected] "cd /opt/swarm-orchestrator && docker compose up -d"

Check individual drones:

# Check drone status via orchestrator API
curl -s http://10.42.0.201:8080/api/drones | python3 -m json.tool

# Ping drones directly
ping 10.42.0.203              # drone-Izar-Host
ping 10.42.0.194              # Tau-Host
tailscale ping 100.64.0.27.91  # drone-Tarn
tailscale ping 100.64.0.57.110 # dr-Meridian-Host

Common causes:

Orchestrator container crashed — restart it on Izar-Host
Gateway container crashed — restart it on Altair-Link
Andromeda drones unreachable — check Tailscale connectivity (see above)
dr-Meridian-Host offline — someone restarted Meridian-Host’s Unraid server at the remote site
Drone configuration drift — a drone updated its Portage config without syncing with the fleet

Health Monitor Diagnostics

The /admin/health page monitors 12 services across 3 categories (AI, Infrastructure, External APIs). The API (/api/admin/health-check) uses a service registry architecture — all service metadata (icon, env vars, endpoint, per-error-code hints) lives server-side. The frontend is a pure data renderer with zero hardcoded service knowledge.

Diagnostic Panels

Every non-online service card shows an inline diagnostic panel with:

Error code and meaning: e.g., “HTTP 401 — Unauthorized — token invalid or missing”
Fix hint: Per-service, per-error-code guidance from the API (e.g., “Relay secret mismatch. Verify FORGE_RELAY_SECRET matches the Workbench VM config.”)
Environment variables: Which CF Pages env vars control this service
Endpoint: Which URL path was probed

Self-Healing

The page automatically retries failed services up to 3 times with 8-second delays. Each retry re-probes only the failed service via the ?service=id query parameter. A countdown indicator shows when the next retry will fire. Services that recover get a green flash animation.

Single-Service Retry

Click the retry button on any failed service card to re-probe just that service. The API supports GET /api/admin/health-check?service=forge-relay for targeted re-checks without re-probing all 12 services.

Instructions Panel

The collapsible instructions panel at the top explains monitoring intervals, diagnostic panels, self-healing behavior, and how to configure services (env vars in CF Pages dashboard).

Common degraded states and their causes:

Service	Typical Error	Cause	Fix
Forge Relay	HTTP 401	Wrong auth header type	Must use `Authorization: Bearer <secret>` (not `X-Relay-Secret`)
Pentest Sentinel	HTTP 401	Wrong auth header or path	Uses `X-Api-Key` header at `/api/health` (not Bearer at `/health`)
Meridian-Host	HTTP 404	Wrong health endpoint path	Health is at `/api/health` (not `/health`)
Tarn-Host AdminBox	HTTP 404	Wrong health endpoint path	Health is at `/api/health` (not `/health`)
Uptime Kuma	HTTP 403/404	Wrong slug or URL	Slug is `public` (not `heartbeat`), URL should be `https://status.Arcturus-Prime.com`
ElevenLabs	HTTP 401	Endpoint needs elevated scope	Use `/v1/voices` instead of `/v1/user/subscription`

General Debugging Tips

Check Dozzle first: http://10.42.0.199:9999 is the fastest way to see what’s happening inside Docker containers on Altair-Link. It’s a web-based log viewer that shows real-time output from all containers.

Use npm run build over npm run dev for validation: The dev server is more lenient than the production build. Things that work in dev can fail in build — especially content schema validation, SSR compatibility, and import resolution.

Check the Cloudflare dashboard: For production issues, the Cloudflare Workers dashboard shows request logs, error rates, and execution times. If the Workers function is throwing errors, the dashboard will show you the stack traces.

Check Prometheus/Grafana: Monitoring runs on Altair-Link — Prometheus at 10.42.0.199:9090 and Grafana at 10.42.0.199:3000. Historical metrics can show when an issue started and correlate it with other events (deploys, container restarts, resource spikes).

Troubleshooting

Admin Area Returns Plain “Not Found”

ViewTransitions and DOMContentLoaded

Script Scope: is:inline define:vars vs Module Scripts

Inline Script Type Annotations Causing Unexpected token ':'

/admin/system-test Repeated DOM Errors (style / appendChild)

Console Error Spam: Null Crashes from View Transitions Race Condition

Build Failure: Wrong Relative Import Path in Subdirectory Pages

Build Failures: SSR Bundle Exclusions

Cloudflare Access Auth Issues

API Proxy Timeouts

Content Not Rendering

Dev Server vs Production Differences

Tailscale Connectivity

Docker Container Restarts

Build Swarm Connectivity

Health Monitor Diagnostics

Diagnostic Panels

Self-Healing

Single-Service Retry

Instructions Panel

General Debugging Tips

Script Scope: `is:inline define:vars` vs Module Scripts

Inline Script Type Annotations Causing `Unexpected token ':'`

`/admin/system-test` Repeated DOM Errors (`style` / `appendChild`)