Troubleshooting
Common issues and fixes for Arcturus-Prime development, deployment, and infrastructure
Troubleshooting
This page covers the issues that come up repeatedly when developing, deploying, or operating Arcturus-Prime. If something breaks and you don’t know where to start, check here first.
Admin Area Returns Plain “Not Found”
Symptom: Every /admin/* route returns a plain text “Not Found” response — no styled 404 page, no layout, no navigation. Just the words “Not Found” on a blank page. /admin/sandbox and /admin/deployments still work.
Cause: DEMO_MODE is set to a truthy value (1, true, yes, or on) in the Cloudflare Pages production environment variables. When demo mode is active, the middleware blocks all /admin/* routes except /admin/sandbox and /admin/deployments.
Code path:
src/middleware.tsline ~96:if (isDemoMode()) { ... }src/lib/demo-mode.ts:isDemoMode()reads theDEMO_MODEenv varisRestrictedInDemoRoute()matches any route starting with/adminthat isn’t in the allowlist (/admin/sandbox,/admin/deployments)- Middleware returns
new Response('Not Found', { status: 404 })— the plain text response
Fix: Remove DEMO_MODE from the CF Pages production environment, or set it to an empty string / false. The admin area will work again on the next deployment.
CF Pages Dashboard → Settings → Environment Variables → Production
→ Delete or set DEMO_MODE to empty/false
Why this keeps happening: The credentials vault (credentials.md) lists DEMO_MODE: true in both Production and Preview environment variable sections. Any automated process or AI session that syncs environment variables from the vault to CF Pages will silently enable demo mode on production, locking out the entire admin area. This has caused multiple outages.
Prevention:
DEMO_MODEmust be absent orfalsein the CF Pages production environment- The credentials vault should mark
DEMO_MODEas demo-subdomain-only with a clear warning - After any bulk env var sync to CF Pages, verify
DEMO_MODEis not truthy in production - The diagnostic giveaway is the plain text “Not Found” — a real Astro 404 would show a styled page with navigation
ViewTransitions and DOMContentLoaded
Symptom: JavaScript functionality works on initial page load but breaks when navigating between pages. Loading spinners get stuck. Interactive components stop responding after the first client-side navigation.
Cause: Arcturus-Prime uses Astro’s <ViewTransitions /> (Client Router) in both CosmicLayout.astro and BaseLayout.astro. When ViewTransitions handles a navigation, it performs a client-side page swap rather than a full page reload. This means DOMContentLoaded only fires once — on the initial full page load — and never fires again on subsequent navigations.
Fix: Replace all DOMContentLoaded listeners with astro:page-load:
// WRONG — breaks on client-side navigation
document.addEventListener('DOMContentLoaded', () => {
init();
});
// ALSO WRONG — the readyState pattern doesn't help
if (document.readyState === 'loading') {
document.addEventListener('DOMContentLoaded', init);
} else {
init();
}
// CORRECT — fires on both initial load and ViewTransitions navigations
document.addEventListener('astro:page-load', () => {
init();
});
This applies to ALL script tags: <script>, <script is:inline>, and <script define:vars> in pages, components, and layouts.
History: This bug affected 37 files across 23 pages, 6 components, and 3 layouts when ViewTransitions was added. Every file had to be audited and updated. The /blog/ page was the most visible breakage — the post list showed a permanent loading spinner when navigated to via a client-side link.
How to check: Search the codebase for DOMContentLoaded:
grep -r "DOMContentLoaded" src/
If that returns any matches in .astro, .js, or .ts files, they need to be updated.
Script Scope: is:inline define:vars vs Module Scripts
Symptom: A JavaScript variable defined in a <script is:inline define:vars> block throws ReferenceError: _variableName is not defined at runtime, even though the variable is clearly declared in the same component file.
Cause: Astro components can contain multiple <script> blocks, and each one is a separate JavaScript scope. A <script is:inline define:vars={{ ... }}> block runs as a separate inline script in the browser — its const/let/var declarations are not visible to a separate <script> (module) block in the same .astro file.
<!-- Block 1: inline script with server data -->
<script is:inline define:vars={{ myData }}>
const parsed = JSON.parse(myData);
// ❌ 'parsed' only exists in THIS script block
</script>
<!-- Block 2: module script -->
<script>
console.log(parsed); // ❌ ReferenceError: parsed is not defined
</script>
Fix: Bridge the two scopes via window. Store data from the inline block on window, then read it in the module block:
<!-- Block 1: inline script — parse and expose globally -->
<script is:inline define:vars={{ myData }}>
window.MY_DATA = JSON.parse(myData);
</script>
<!-- Block 2: module script — read from window -->
<script>
const parsed = window.MY_DATA;
</script>
Why this pattern exists: define:vars is the only way to inject server-side (frontmatter) values into client-side scripts. The inline block handles the SSR→client handoff, while the module block handles imports (like import { TendrilGraph } from '@tendril/graph') which is:inline scripts cannot use.
History: This bug broke the KnowledgeGraph on /blog/ after the Tendril config system refactor (2026-02-28, commit 31bf382). Config sections were parsed as const variables in the define:vars block but referenced in the module script where TendrilGraph is imported. The fix stores all parsed config on window.GRAPH_CONFIG in the inline block and reads it back in the module block.
How to check: If a component has both <script is:inline define:vars> and a separate <script>, verify that no const/let/var from the first block is referenced in the second. All cross-block data must go through window.
Inline Script Type Annotations Causing Unexpected token ':'
Symptom: /admin intermittently throws:
Uncaught SyntaxError: Failed to execute 'replaceWith' on 'Element': Unexpected token ':'
Follow-on effects include broken module initializers (for example, knowledge graph container errors) because page scripts fail during client-side navigation.
Cause: TypeScript syntax (for example (r: any) => ...) was included inside a browser-executed inline script. Inline scripts in .astro pages run as plain JavaScript in the browser, so TS-only annotations are invalid and crash at parse time.
Fix: Remove TS annotations from inline scripts and keep the logic JavaScript-safe with null guards.
// WRONG inside inline browser script
(data.results || []).filter((r: any) => !r.synced)
// CORRECT
(data.results || []).filter((r) => !r.synced)
Where this occurred: src/pages/admin/index.astro in the syncToGitHub client-side function.
Prevention:
- Treat
<script is:inline>blocks as browser JavaScript only. - Keep TypeScript types in frontmatter,
.tsfiles, or framework-compiled script contexts. - If a parser error appears from
ClientRouter.*.js, inspect the originating inline script for TS syntax.
/admin/system-test Repeated DOM Errors (style / appendChild)
Symptom: Console spam during or after client-side navigation with errors like:
Cannot read properties of undefined (reading 'style')
Cannot read properties of null (reading 'appendChild')
In some cases this repeats many times and degrades the admin page experience.
Cause: System-test client code executed without strict route/DOM guards. Under ViewTransitions timing, the script can run when required elements are not present yet (or no longer present), and direct DOM mutations then throw.
Fix:
- Scope initialization to the intended route (
/admin/system-testor/system-test). - Verify required elements exist before mutating UI.
- Make renderer tolerant to partial/missing JSON fields.
- Register
astro:page-loadhandler once to avoid duplicate binding.
Where this was fixed: src/pages/admin/system-test.astro (commit 3dde152 on 2026-02-28).
Prevention: For all inline page scripts, treat element lookups as optional and return early if the page root is missing. This is mandatory when using Astro ViewTransitions.
Console Error Spam: Null Crashes from View Transitions Race Condition
Symptom: The browser console fills with hundreds of errors when navigating between admin pages:
Cannot set properties of null (setting 'innerHTML')
Cannot set properties of null (setting 'textContent')
Cannot read properties of null (reading 'classList')
Errors reference functions like loadStatus, loadPipeline, loadStats, loadQueue, loadFollowUps.
Cause: Async race condition between in-flight fetch() callbacks and View Transitions DOM swap. The sequence:
- User visits
/admin/jobs—setInterval(loadAll, 10000)starts polling - Interval fires, multiple
fetch()calls go out - User navigates to
/admin/healthvia View Transitions astro:before-swapfires — interval is cleared- But the already-in-flight
fetch()calls resolve after the DOM swap - Callbacks try
document.getElementById('startStopBtn').innerHTML = ... - Element doesn’t exist on the new page — crash
Clearing the interval only prevents future ticks. It does nothing for requests already in flight.
Fix: Null-guard every getElementById() call in polling functions. Two patterns:
// Pattern 1: Early bail on critical element
async function loadStatus() {
const data = await apiFetch('status');
if (!data) return;
const btn = document.getElementById('startStopBtn');
if (!btn) return; // Page navigated away — bail
btn.innerHTML = data.running ? 'Stop' : 'Start';
}
// Pattern 2: Safe setter for many independent elements
const set = (id, val) => {
const el = document.getElementById(id);
if (el) el.textContent = val;
};
set('statTotal', s.total_applications || 0);
set('statToday', s.today || 0);
Where this was fixed: jobs.astro (5 functions, ~28 calls), email.astro (loadStats), servers/index.astro (added missing interval cleanup). Commit 9fcf10e on 2026-02-28.
Prevention: Every polling function that accesses DOM elements must null-check the result of getElementById(). See the View Transitions doc for the full defensive pattern.
Build Failure: Wrong Relative Import Path in Subdirectory Pages
Symptom: CF Pages build fails with:
Could not resolve "../../layouts/CosmicLayout.astro" from "src/pages/admin/servers/index.astro"
Cause: Pages directly in src/pages/admin/ are 2 directory levels below src/, so they use ../../layouts/CosmicLayout.astro. But pages in subdirectories like src/pages/admin/servers/ are 3 levels deep and need ../../../layouts/CosmicLayout.astro. Using the wrong depth resolves to a nonexistent path (src/pages/layouts/ instead of src/layouts/).
Fix: Count the directory depth from the file to src/ and use the correct number of ../ segments:
| File location | Depth from src/ | Layout import |
|---|---|---|
src/pages/admin/health.astro | 2 | ../../layouts/CosmicLayout.astro |
src/pages/admin/servers/index.astro | 3 | ../../../layouts/CosmicLayout.astro |
src/pages/api/admin/health-check.ts | 3 | ../../../config/... or ../../../lib/... |
Where this occurred: Commit 53bdc9b rewrote servers/index.astro from standalone HTML to use CosmicLayout, but copied the import path from a sibling file one level up. Fixed in commit 3b2f130 on 2026-02-28.
Prevention: When creating pages in subdirectories under admin/, verify the import path resolves correctly. This error passes npm run dev silently but breaks the CF Pages production build.
Build Failures: SSR Bundle Exclusions
Symptom: npm run build fails with errors related to better-sqlite3 or @argonaut/core. Errors typically mention native modules, missing bindings, or import resolution failures.
Cause: These packages contain native Node.js modules that can’t be bundled into the Cloudflare Workers SSR output. They need to be excluded from the Vite bundling process.
Fix: Ensure these packages are listed in the external configuration in astro.config.mjs:
// astro.config.mjs
export default defineConfig({
vite: {
build: {
rollupOptions: {
external: ['better-sqlite3', '@argonaut/core']
}
},
optimizeDeps: {
exclude: ['better-sqlite3', '@argonaut/core']
}
}
});
Both rollupOptions.external and optimizeDeps.exclude need to list these packages. The first prevents them from being bundled in the production build; the second prevents Vite from pre-bundling them in development.
When this happens: Usually after adding a new dependency that transitively depends on a native module, or after updating Astro/Vite versions that change bundling behavior.
Cloudflare Access Auth Issues
Symptom: Admin pages return 401 or 403 errors. Users who should have admin access are denied. The auth middleware rejects valid Cloudflare Access tokens.
Cause: Usually one of the Cloudflare Access environment variables is wrong or missing.
Checklist:
-
Check
CF_ACCESS_AUD: This is the Application Audience tag from your Cloudflare Access application. It’s a 64-character hex string. If it doesn’t match exactly, JWT validation fails silently.# Verify the secret is set in production npx wrangler secret list # Look for CF_ACCESS_AUD in the output -
Check
CF_ACCESS_TEAM_DOMAIN: Must match your Cloudflare Access team domain exactly, including the.cloudflareaccess.comsuffix. -
Check
ADMIN_EMAILS: The email in the JWT must appear in this comma-separated list. Check for whitespace issues —"[email protected], [email protected]"(note the space after the comma) might not match"[email protected]"depending on how the comparison is implemented. -
Check
ALLOW_CF_AUTH_COOKIE_FALLBACK: If the JWT isn’t coming through theCF-Access-Jwt-Assertionheader (common with iframes or embedded contexts), the middleware needs this set totrueto check theCF_Authorizationcookie as a fallback. -
Token expiration: Cloudflare Access tokens have a TTL. If a user has been idle for a long time, their token may have expired. A page refresh triggers a new auth flow.
Local dev note: Authentication is bypassed in local development (npm run dev). If auth works locally but fails in production, the issue is definitely in the production secrets.
API Proxy Timeouts
Symptom: API calls to /api/proxy/*, /api/swarm/*, or /api/gateway/* return timeouts or 502 errors on the live site.
Cause: The proxy routes forward requests to Docker services running on Altair-Link (10.42.0.199). If those services are down, the proxy has nothing to forward to.
Diagnostic steps:
-
Check if Altair-Link is reachable:
ping 10.42.0.199 -
Check Docker services on Altair-Link:
ssh [email protected] "docker ps --format 'table {{.Names}}\t{{.Status}}\t{{.Ports}}'" -
Check specific services:
# Gateway curl -s http://10.42.0.199:8090/health # Command Center curl -s http://10.42.0.199:8093/api/v1/services/public/build-swarm # Prometheus curl -s http://10.42.0.199:9090/-/ready # Loki curl -s http://10.42.0.199:3100/ready -
Check Dozzle for container logs: Dozzle runs on Altair-Link at port 9999 and provides a web interface for viewing Docker container logs:
http://10.42.0.199:9999 -
Restart the specific service:
ssh [email protected] "cd /path/to/service && docker compose restart"
Common cause: Altair-Link runs a lot of Docker containers. Occasionally a container will OOM or hit a resource limit and stop responding. The Docker restart: unless-stopped policy usually handles crashes, but hung containers (alive but not responding) need a manual restart.
Content Not Rendering
Symptom: A new content file doesn’t appear on the site, or renders with missing fields / wrong layout.
Cause: Frontmatter schema mismatch. Every content collection (posts, journal, docs) has a schema defined in src/content/config.ts. If the frontmatter in your file doesn’t match the schema, Astro either silently drops the file or throws a build error.
Checklist:
-
Check required fields: Open
src/content/config.tsand check which fields are required for your content collection. Missing required fields cause silent failures — the file just doesn’t appear. -
Check field types: A
pubDatefield expecting aDatetype won’t accept a string like"2026-02-23"— it needs to be unquoted YAML (pubDate: 2026-02-23). Similarly,tagsmust be an array, not a comma-separated string. -
Check file location: Content must be in the correct directory. A blog post in
src/content/journal/won’t appear in the blog collection. -
Check the slug: If the filename contains characters that Astro can’t convert to a valid URL slug, the content may be silently skipped. Stick to lowercase alphanumeric characters and hyphens.
-
Check for YAML errors: Invalid YAML in frontmatter (missing closing
---, tabs instead of spaces, unquoted special characters) causes silent parse failures. Run the file through a YAML validator if unsure.
Quick test: Run npm run build and check the output. Astro reports content collection errors during build, even if npm run dev silently swallows them.
Dev Server vs Production Differences
Several things behave differently in local development vs production on Cloudflare Workers.
| Behavior | Local Dev (npm run dev) | Production (Cloudflare Workers) |
|---|---|---|
| Authentication | Bypassed — no CF Access | Full Cloudflare Access flow |
| Content source | Local filesystem (node:fs) | Gitea API via GITEA_* env vars |
| KV storage | Miniflare local KV | Cloudflare Workers KV |
| API proxy | Vite dev server proxy | Cloudflare Worker fetch handler |
| Node.js APIs | Available (node:fs, node:path) | Not available (Workers runtime) |
| Env vars | .dev.vars file | wrangler secret |
The big one: Code that uses node:fs to read files works locally but fails in production. If you’re adding a new feature that reads files, it must use the Gitea API in production and node:fs only in development. The content layer handles this abstraction for content collections, but custom file reading needs conditional logic.
// Pattern for dual-mode file reading
if (import.meta.env.DEV) {
// Local development — read from filesystem
const data = await fs.readFile(path, 'utf-8');
} else {
// Production — fetch from Gitea API
const response = await fetch(`${GITEA_API_URL}/repos/${owner}/${repo}/raw/${path}`);
const data = await response.text();
}
Tailscale Connectivity
Symptom: Andromeda hosts (Tarn-Host, Meridian-Host, etc.) are unreachable. Build drones on the remote network can’t connect to the orchestrator. Tailscale pings fail.
Checklist:
-
Check Tailscale is running:
tailscale statusLook for the target host in the output. If it’s missing, the host’s Tailscale daemon might be down.
-
Check subnet routers are advertising: The Milky Way and Andromedas are connected via Tailscale subnet routing. The subnet routers on each network need to be running and advertising their local subnets:
# On the subnet router tailscale status --peers # Look for the subnet routes in the output -
Check if the subnet routes are approved: In the Tailscale admin console, subnet routes need to be explicitly approved. A new subnet route shows as “pending” until an admin approves it.
-
Direct connectivity test:
# Ping via Tailscale IP tailscale ping 100.64.0.27.91 # drone-Tarn # Check if it's a relay connection tailscale ping --verbose 100.64.0.57.110 # dr-Meridian-Host -
DERP relay fallback: If direct connections fail, Tailscale falls back to DERP relay servers. This adds latency (~100ms+) but should still work. If even DERP is failing, there’s likely a firewall blocking UDP/41641 on one side.
Common cause: The remote site (Andromeda/the remote site) occasionally has network interruptions — router restarts, ISP hiccups, or configuration changes. Check if any Andromeda hosts are online before assuming a Tailscale problem.
Docker Container Restarts
Symptom: A Docker service on Altair-Link keeps restarting or has restarted unexpectedly.
Diagnostic:
-
Check container status and restart count:
ssh [email protected] "docker ps -a --format 'table {{.Names}}\t{{.Status}}\t{{.RunningFor}}'" -
Check container logs via Dozzle: Open
http://10.42.0.199:9999in a browser. Dozzle provides a real-time log viewer for all Docker containers. Find the problematic container and check its recent logs. -
Check container logs via CLI:
ssh [email protected] "docker logs --tail 100 container-name" -
Check system resources:
ssh [email protected] "docker stats --no-stream"Look for containers hitting memory limits or consuming excessive CPU.
-
Check Docker daemon logs:
ssh [email protected] "journalctl -u docker --since '1 hour ago'"
Common causes:
- OOM kill: Container exceeds its memory limit. Check
docker inspect container-name | grep Memoryfor limits, and container logs for OOM messages. - Health check failure: Containers with health checks get restarted if the check fails consecutively. Check if the service inside the container is responding to health checks.
- Dependent service down: A container that depends on another service (like a database) will crash-loop if that service is unavailable.
Build Swarm Connectivity
Symptom: Build swarm commands fail. build-swarm status shows the orchestrator as offline or drones as unreachable.
Check the orchestrator:
The orchestrator runs on Izar-Host at 10.42.0.201:8080. If it’s unreachable, nothing in the swarm works.
# Direct check
curl -s http://10.42.0.201:8080/api/status
# Via gateway
curl -s http://10.42.0.199:8090/health
If the orchestrator is down:
# SSH to Izar-Host and check the container
ssh [email protected] "docker ps -a | grep swarm"
ssh [email protected] "docker logs swarm-orchestrator --tail 50"
ssh [email protected] "cd /opt/swarm-orchestrator && docker compose up -d"
Check individual drones:
# Check drone status via orchestrator API
curl -s http://10.42.0.201:8080/api/drones | python3 -m json.tool
# Ping drones directly
ping 10.42.0.203 # drone-Izar-Host
ping 10.42.0.194 # Tau-Host
tailscale ping 100.64.0.27.91 # drone-Tarn
tailscale ping 100.64.0.57.110 # dr-Meridian-Host
Common causes:
- Orchestrator container crashed — restart it on Izar-Host
- Gateway container crashed — restart it on Altair-Link
- Andromeda drones unreachable — check Tailscale connectivity (see above)
- dr-Meridian-Host offline — someone restarted Meridian-Host’s Unraid server at the remote site
- Drone configuration drift — a drone updated its Portage config without syncing with the fleet
Health Monitor Diagnostics
The /admin/health page monitors 12 services across 3 categories (AI, Infrastructure, External APIs). The API (/api/admin/health-check) uses a service registry architecture — all service metadata (icon, env vars, endpoint, per-error-code hints) lives server-side. The frontend is a pure data renderer with zero hardcoded service knowledge.
Diagnostic Panels
Every non-online service card shows an inline diagnostic panel with:
- Error code and meaning: e.g., “HTTP 401 — Unauthorized — token invalid or missing”
- Fix hint: Per-service, per-error-code guidance from the API (e.g., “Relay secret mismatch. Verify FORGE_RELAY_SECRET matches the Workbench VM config.”)
- Environment variables: Which CF Pages env vars control this service
- Endpoint: Which URL path was probed
Self-Healing
The page automatically retries failed services up to 3 times with 8-second delays. Each retry re-probes only the failed service via the ?service=id query parameter. A countdown indicator shows when the next retry will fire. Services that recover get a green flash animation.
Single-Service Retry
Click the retry button on any failed service card to re-probe just that service. The API supports GET /api/admin/health-check?service=forge-relay for targeted re-checks without re-probing all 12 services.
Instructions Panel
The collapsible instructions panel at the top explains monitoring intervals, diagnostic panels, self-healing behavior, and how to configure services (env vars in CF Pages dashboard).
Common degraded states and their causes:
| Service | Typical Error | Cause | Fix |
|---|---|---|---|
| Forge Relay | HTTP 401 | Wrong auth header type | Must use Authorization: Bearer <secret> (not X-Relay-Secret) |
| Pentest Sentinel | HTTP 401 | Wrong auth header or path | Uses X-Api-Key header at /api/health (not Bearer at /health) |
| Meridian-Host | HTTP 404 | Wrong health endpoint path | Health is at /api/health (not /health) |
| Tarn-Host AdminBox | HTTP 404 | Wrong health endpoint path | Health is at /api/health (not /health) |
| Uptime Kuma | HTTP 403/404 | Wrong slug or URL | Slug is public (not heartbeat), URL should be https://status.Arcturus-Prime.com |
| ElevenLabs | HTTP 401 | Endpoint needs elevated scope | Use /v1/voices instead of /v1/user/subscription |
General Debugging Tips
Check Dozzle first: http://10.42.0.199:9999 is the fastest way to see what’s happening inside Docker containers on Altair-Link. It’s a web-based log viewer that shows real-time output from all containers.
Use npm run build over npm run dev for validation: The dev server is more lenient than the production build. Things that work in dev can fail in build — especially content schema validation, SSR compatibility, and import resolution.
Check the Cloudflare dashboard: For production issues, the Cloudflare Workers dashboard shows request logs, error rates, and execution times. If the Workers function is throwing errors, the dashboard will show you the stack traces.
Check Prometheus/Grafana: Monitoring runs on Altair-Link — Prometheus at 10.42.0.199:9090 and Grafana at 10.42.0.199:3000. Historical metrics can show when an issue started and correlate it with other events (deploys, container restarts, resource spikes).