Scripts Reference

The scripts/ directory contains the utility tooling that keeps Arcturus-Prime running — content pipelines, security scanning, image handling, sanitization, and publishing automation. This page documents every script, what it does, and how to use it.

Content Pipeline Scripts

spider-site-blueprint.js

Purpose: Crawls the Arcturus-Prime site and generates a manifest of all routes, components, and their relationships.

node scripts/spider-site-blueprint.js

Output: data/site-manifest.json

The spider walks every route in the Astro project, records which components are used on each page, maps the content collections to their routes, and outputs a structured manifest. This manifest is used by other tools (like the embedding pipeline) to know what content exists and where it lives.

The spider doesn’t hit the live site — it analyzes the source code and Astro configuration to determine routes. It reads astro.config.mjs, scans src/pages/, and resolves dynamic routes from content collections.

When to run: After adding new pages, changing routes, or modifying content collection schemas. The manifest should be regenerated before running build-embeddings.js so the embedding pipeline has an up-to-date list of content.

build-embeddings.js

Purpose: Generates RAG (Retrieval-Augmented Generation) embeddings for all site content.

node scripts/build-embeddings.js

Requires: OPENROUTER_API_KEY environment variable.

This is the RAG pipeline. It reads all publishable content from src/content/ (posts, journal, docs, projects, learn), splits it into chunks (400 words per chunk, 50-word overlap between chunks), sends each chunk to OpenRouter for embedding generation, and writes the output to public/embeddings-index.json.

Chunking strategy:

Chunk size: 400 words — large enough to preserve context, small enough to be specific
Overlap: 50 words — prevents information loss at chunk boundaries
Boundary respect: Chunks break at word boundaries to avoid mid-word splits

The output includes a contentHash (SHA-256 of all source content files) for staleness detection by validate-embeddings.js.

When to run: Manually via npm run build:embeddings when content changes. Not part of the automatic build pipeline (removed from prebuild on 2026-03-01 to reduce deploy times by 30-120s).

auto-embeddings.js

Purpose: Smart wrapper around build-embeddings.js that only rebuilds when content changes.

# Auto-decide
node scripts/auto-embeddings.js

# Check only — exit code 1 if stale, 0 if fresh
node scripts/auto-embeddings.js --check

# Force rebuild even if content unchanged
node scripts/auto-embeddings.js --force

Requires: OPENROUTER_API_KEY environment variable (skips gracefully without it).

Computes a SHA-256 hash of all content files in src/content/ and compares it against the contentHash stored in the existing public/embeddings-index.json. If the hashes match, the rebuild is skipped (~1 second). If they differ, the full build-embeddings.js pipeline runs (~20 seconds).

Not integrated into the build pipeline. Removed from prebuild on 2026-03-01 because the API calls added 30-120s to every deploy. Run manually with npm run build:embeddings or node scripts/auto-embeddings.js when content changes. The validate-embeddings.js postbuild hook warns if embeddings are stale.

validate-embeddings.js

Purpose: Postbuild validation — warns if the public RAG embeddings are stale or missing.

node scripts/validate-embeddings.js

Checks performed:

Verifies embeddings-index.json exists in public/ or dist/
Warns if the index is older than 7 days
Warns if contentHash doesn’t match current content

Advisory only — never blocks the build. Outputs warnings with suggested fix commands.

Integrated into: postbuild hook in package.json.

session-to-blog.js

Purpose: Converts a raw session transcript into a formatted blog post.

node scripts/session-to-blog.js path/to/session-transcript.md

Takes a session transcript (typically from a development or debugging session) and transforms it into a blog post with proper frontmatter, narrative structure, and the Arcturus-Prime voice. The script handles:

Extracting key events and decisions from the session
Reorganizing chronological notes into a narrative arc
Adding frontmatter with appropriate tags and metadata
Flagging sections that need human review (marked with )

The output is a draft, not a finished post. It gets the structure right but always needs a human pass for voice, accuracy, and personality.

quick-publish.js

Purpose: One-command publish pipeline — sanitize, validate, and prepare content for publishing.

node scripts/quick-publish.js src/content/posts/2026-02-23-new-post.md

Runs the full pre-publish workflow:

Validates frontmatter schema against src/content/config.ts
Runs sanitization (calls sanitize_journal_entry.py)
Checks for broken internal links
Validates image references exist
Outputs a summary of changes and any issues found

If validation passes with no issues, the content is ready to commit and deploy. If issues are found, they’re listed with specific line numbers and suggested fixes.

Sanitization Scripts

sanitize_journal_entry.py

Purpose: Applies the Galactic Identity System sanitization to a content file.

python3 scripts/sanitize_journal_entry.py src/content/journal/2026-02-23-entry.md

Reads identity_map.json from the repository root, applies every regex pattern in order to the target file, and writes the result back. This converts real hostnames to star-system names, real IPs to mapped IPs, and real usernames to their fictional equivalents.

The script is idempotent — running it on an already-sanitized file produces no changes, because the sanitized names don’t match the real-name patterns.

Important: This script only handles text content. Screenshots, images, and embedded media with visible real names or IPs must be sanitized manually.

See Galactic Identity System for the full mapping reference.

convert_session_to_journal.py

Purpose: Converts a raw session log into a structured journal entry.

python3 scripts/convert_session_to_journal.py path/to/raw-session.md

Takes unstructured session notes (timestamps, commands, output, observations) and formats them into a journal entry with proper frontmatter, sections, and the standard journal template structure. This is the first step in the session-to-content pipeline — raw notes become a journal entry, which can then optionally be expanded into a blog post.

generate_journal_from_transcript.py

Purpose: AI-assisted journal entry generation from a raw transcript.

python3 scripts/generate_journal_from_transcript.py path/to/transcript.jsonl

Similar to convert_session_to_journal.py, but handles JSONL-format transcripts (raw Claude Code conversation exports) and uses AI assistance to identify the key moments, decisions, and outcomes from the transcript. The output is a draft journal entry that captures the session’s essential content.

The AI assistance helps with:

Identifying which parts of a long transcript are worth documenting
Extracting commands and their outcomes
Summarizing debugging sequences into concise narratives
Suggesting appropriate tags and categories

Content Import and Sync

import-from-obsidian.js

Purpose: Imports content from an Obsidian vault into Arcturus-Prime’s content collections.

node scripts/import-from-obsidian.js ~/Documents/Arcturus-Prime-technical-vault/

Reads Obsidian markdown files, converts Obsidian-specific syntax (wikilinks, callouts, embeds) to standard markdown, maps frontmatter fields to Arcturus-Prime’s content schema, and places the files in the appropriate content collection (src/content/posts/, src/content/journal/, etc.).

Handles:

[[wikilinks]] → standard markdown links
Obsidian callouts → HTML/component equivalents
![[embeds]] → inline content or figure references
Tag format differences (#tag → YAML array)

sync-tendril.js

Purpose: Cross-system content synchronization for the Tendril knowledge graph.

node scripts/sync-tendril.js

Ensures content metadata used by the Tendril knowledge graph is consistent across all content collections. Checks for:

Missing tags in frontmatter
Broken related_posts references (pointing to slugs that don’t exist)
Orphaned tags (used once, potentially a typo)
Category consistency

Outputs a report of issues found. Doesn’t modify files automatically — it reports what needs fixing and you fix it manually.

Security and Validation

scan-content-claims.js

Purpose: Security auditing for content — checks for sensitive information that shouldn’t be published.

node scripts/scan-content-claims.js

Scans all content files for:

Unsanitized IP addresses (10.42.0.x, 192.168.20.x patterns)
Real hostnames that should be sanitized
API keys, tokens, or credentials
Full MAC addresses (should be partially obscured)
Email addresses that aren’t public
File paths containing real usernames

Returns a list of findings with file paths, line numbers, and the matched pattern. Run this before any deploy to catch accidental leaks.

parse-identity-verification.js

Purpose: Identity and PII detection across content files.

node scripts/parse-identity-verification.js

A more thorough scanner than scan-content-claims.js, focused specifically on personally identifiable information. Checks for:

Real names that should be sanitized (user mapping)
Physical addresses or location details beyond city-level
Phone numbers
Account identifiers
Any pattern matching known PII formats

This is the “paranoia pass” — run it when you’re about to publish something that discusses real people or real locations.

find-broken-links.py

Purpose: Link validation across all content files.

python3 scripts/find-broken-links.py

Crawls all content files, extracts internal and external links, and validates them:

Internal links: Checks that the target route exists in the site manifest
External links: Makes HEAD requests to verify the URL is reachable (with rate limiting)
Anchor links: Validates that #section anchors point to real headings

Outputs a report of broken links grouped by file. External link checking can be slow (rate-limited to avoid hammering third-party servers), so there’s a --internal-only flag for quick runs.

Tag Management

consolidate_tags.py

Purpose: Tag cleanup and normalization across all content.

python3 scripts/consolidate_tags.py

Analyzes all tags used across content collections and identifies:

Near-duplicates (e.g., “docker” vs “Docker” vs “docker-compose”)
Tags used only once (potential typos)
Tags that should be merged (e.g., “networking” and “network”)
Inconsistent casing

Outputs a suggested consolidation plan. With the --apply flag, it modifies frontmatter directly (use with care — review the plan first).

# Dry run (default) — show suggestions
python3 scripts/consolidate_tags.py

# Apply changes
python3 scripts/consolidate_tags.py --apply

Image Handling

fetch-images.js

Purpose: Downloads remote images referenced in content and stores them locally.

node scripts/fetch-images.js

Scans content files for external image URLs, downloads them to public/images/, and updates the content to reference the local copies. This ensures the site doesn’t depend on external image hosts that might go down or change URLs.

Images are organized by content type:

public/images/posts/ for blog post images
public/images/journal/ for journal entry images
public/images/docs/ for documentation images

optimize-images.sh

Purpose: Bulk image optimization for web delivery.

bash scripts/optimize-images.sh

Processes all images in public/images/ through optimization:

JPEG: Optimized with mozjpeg at quality 80
PNG: Optimized with pngquant and oxipng
WebP: Generated as alternative format for browsers that support it
AVIF: Generated as alternative format for modern browsers

The script preserves originals and generates optimized versions alongside them. The Astro image pipeline (<Image /> component) selects the best format based on browser support.

When to run: After adding new images via fetch-images.js or manual placement. Not part of the regular build — run it manually when new images land.

Running Order

For a full content pipeline run (new content from session to published post):

convert_session_to_journal.py or generate_journal_from_transcript.py — Raw notes to journal
session-to-blog.js — Journal to blog post (if applicable)
sanitize_journal_entry.py — Apply Galactic Identity System
fetch-images.js — Download any external images
optimize-images.sh — Optimize images for web
scan-content-claims.js — Security audit
parse-identity-verification.js — PII check
find-broken-links.py — Validate links
quick-publish.js — Final validation and preparation
spider-site-blueprint.js — Update site manifest (after deploy)

Not every step is needed every time. A quick journal entry might only need steps 3, 6, and 9. A major blog post with external images gets the full pipeline.

RAG embeddings are a manual step — run npm run build:embeddings when content changes. The validate-embeddings.js postbuild hook warns if embeddings are stale but never blocks the build.