Scripts Reference
Complete reference for all utility scripts in the Arcturus-Prime scripts/ directory
Scripts Reference
The scripts/ directory contains the utility tooling that keeps Arcturus-Prime running — content pipelines, security scanning, image handling, sanitization, and publishing automation. This page documents every script, what it does, and how to use it.
Content Pipeline Scripts
spider-site-blueprint.js
Purpose: Crawls the Arcturus-Prime site and generates a manifest of all routes, components, and their relationships.
node scripts/spider-site-blueprint.js
Output: data/site-manifest.json
The spider walks every route in the Astro project, records which components are used on each page, maps the content collections to their routes, and outputs a structured manifest. This manifest is used by other tools (like the embedding pipeline) to know what content exists and where it lives.
The spider doesn’t hit the live site — it analyzes the source code and Astro configuration to determine routes. It reads astro.config.mjs, scans src/pages/, and resolves dynamic routes from content collections.
When to run: After adding new pages, changing routes, or modifying content collection schemas. The manifest should be regenerated before running build-embeddings.js so the embedding pipeline has an up-to-date list of content.
build-embeddings.js
Purpose: Generates RAG (Retrieval-Augmented Generation) embeddings for all site content.
node scripts/build-embeddings.js
Requires: OPENROUTER_API_KEY environment variable.
This is the RAG pipeline. It reads all publishable content from src/content/ (posts, journal, docs, projects, learn), splits it into chunks (400 words per chunk, 50-word overlap between chunks), sends each chunk to OpenRouter for embedding generation, and writes the output to public/embeddings-index.json.
Chunking strategy:
- Chunk size: 400 words — large enough to preserve context, small enough to be specific
- Overlap: 50 words — prevents information loss at chunk boundaries
- Boundary respect: Chunks break at word boundaries to avoid mid-word splits
The output includes a contentHash (SHA-256 of all source content files) for staleness detection by validate-embeddings.js.
When to run: Manually via npm run build:embeddings when content changes. Not part of the automatic build pipeline (removed from prebuild on 2026-03-01 to reduce deploy times by 30-120s).
auto-embeddings.js
Purpose: Smart wrapper around build-embeddings.js that only rebuilds when content changes.
# Auto-decide
node scripts/auto-embeddings.js
# Check only — exit code 1 if stale, 0 if fresh
node scripts/auto-embeddings.js --check
# Force rebuild even if content unchanged
node scripts/auto-embeddings.js --force
Requires: OPENROUTER_API_KEY environment variable (skips gracefully without it).
Computes a SHA-256 hash of all content files in src/content/ and compares it against the contentHash stored in the existing public/embeddings-index.json. If the hashes match, the rebuild is skipped (~1 second). If they differ, the full build-embeddings.js pipeline runs (~20 seconds).
Not integrated into the build pipeline. Removed from prebuild on 2026-03-01 because the API calls added 30-120s to every deploy. Run manually with npm run build:embeddings or node scripts/auto-embeddings.js when content changes. The validate-embeddings.js postbuild hook warns if embeddings are stale.
validate-embeddings.js
Purpose: Postbuild validation — warns if the public RAG embeddings are stale or missing.
node scripts/validate-embeddings.js
Checks performed:
- Verifies
embeddings-index.jsonexists inpublic/ordist/ - Warns if the index is older than 7 days
- Warns if
contentHashdoesn’t match current content
Advisory only — never blocks the build. Outputs warnings with suggested fix commands.
Integrated into: postbuild hook in package.json.
session-to-blog.js
Purpose: Converts a raw session transcript into a formatted blog post.
node scripts/session-to-blog.js path/to/session-transcript.md
Takes a session transcript (typically from a development or debugging session) and transforms it into a blog post with proper frontmatter, narrative structure, and the Arcturus-Prime voice. The script handles:
- Extracting key events and decisions from the session
- Reorganizing chronological notes into a narrative arc
- Adding frontmatter with appropriate tags and metadata
- Flagging sections that need human review (marked with
<!-- REVIEW -->)
The output is a draft, not a finished post. It gets the structure right but always needs a human pass for voice, accuracy, and personality.
quick-publish.js
Purpose: One-command publish pipeline — sanitize, validate, and prepare content for publishing.
node scripts/quick-publish.js src/content/posts/2026-02-23-new-post.md
Runs the full pre-publish workflow:
- Validates frontmatter schema against
src/content/config.ts - Runs sanitization (calls
sanitize_journal_entry.py) - Checks for broken internal links
- Validates image references exist
- Outputs a summary of changes and any issues found
If validation passes with no issues, the content is ready to commit and deploy. If issues are found, they’re listed with specific line numbers and suggested fixes.
Sanitization Scripts
sanitize_journal_entry.py
Purpose: Applies the Galactic Identity System sanitization to a content file.
python3 scripts/sanitize_journal_entry.py src/content/journal/2026-02-23-entry.md
Reads identity_map.json from the repository root, applies every regex pattern in order to the target file, and writes the result back. This converts real hostnames to star-system names, real IPs to mapped IPs, and real usernames to their fictional equivalents.
The script is idempotent — running it on an already-sanitized file produces no changes, because the sanitized names don’t match the real-name patterns.
Important: This script only handles text content. Screenshots, images, and embedded media with visible real names or IPs must be sanitized manually.
See Galactic Identity System for the full mapping reference.
convert_session_to_journal.py
Purpose: Converts a raw session log into a structured journal entry.
python3 scripts/convert_session_to_journal.py path/to/raw-session.md
Takes unstructured session notes (timestamps, commands, output, observations) and formats them into a journal entry with proper frontmatter, sections, and the standard journal template structure. This is the first step in the session-to-content pipeline — raw notes become a journal entry, which can then optionally be expanded into a blog post.
generate_journal_from_transcript.py
Purpose: AI-assisted journal entry generation from a raw transcript.
python3 scripts/generate_journal_from_transcript.py path/to/transcript.jsonl
Similar to convert_session_to_journal.py, but handles JSONL-format transcripts (raw Claude Code conversation exports) and uses AI assistance to identify the key moments, decisions, and outcomes from the transcript. The output is a draft journal entry that captures the session’s essential content.
The AI assistance helps with:
- Identifying which parts of a long transcript are worth documenting
- Extracting commands and their outcomes
- Summarizing debugging sequences into concise narratives
- Suggesting appropriate tags and categories
Content Import and Sync
import-from-obsidian.js
Purpose: Imports content from an Obsidian vault into Arcturus-Prime’s content collections.
node scripts/import-from-obsidian.js ~/Documents/Arcturus-Prime-technical-vault/
Reads Obsidian markdown files, converts Obsidian-specific syntax (wikilinks, callouts, embeds) to standard markdown, maps frontmatter fields to Arcturus-Prime’s content schema, and places the files in the appropriate content collection (src/content/posts/, src/content/journal/, etc.).
Handles:
[[wikilinks]]→ standard markdown links- Obsidian callouts → HTML/component equivalents
![[embeds]]→ inline content or figure references- Tag format differences (
#tag→ YAML array)
sync-tendril.js
Purpose: Cross-system content synchronization for the Tendril knowledge graph.
node scripts/sync-tendril.js
Ensures content metadata used by the Tendril knowledge graph is consistent across all content collections. Checks for:
- Missing
tagsin frontmatter - Broken
related_postsreferences (pointing to slugs that don’t exist) - Orphaned tags (used once, potentially a typo)
- Category consistency
Outputs a report of issues found. Doesn’t modify files automatically — it reports what needs fixing and you fix it manually.
Security and Validation
scan-content-claims.js
Purpose: Security auditing for content — checks for sensitive information that shouldn’t be published.
node scripts/scan-content-claims.js
Scans all content files for:
- Unsanitized IP addresses (10.42.0.x, 192.168.20.x patterns)
- Real hostnames that should be sanitized
- API keys, tokens, or credentials
- Full MAC addresses (should be partially obscured)
- Email addresses that aren’t public
- File paths containing real usernames
Returns a list of findings with file paths, line numbers, and the matched pattern. Run this before any deploy to catch accidental leaks.
parse-identity-verification.js
Purpose: Identity and PII detection across content files.
node scripts/parse-identity-verification.js
A more thorough scanner than scan-content-claims.js, focused specifically on personally identifiable information. Checks for:
- Real names that should be sanitized (user mapping)
- Physical addresses or location details beyond city-level
- Phone numbers
- Account identifiers
- Any pattern matching known PII formats
This is the “paranoia pass” — run it when you’re about to publish something that discusses real people or real locations.
find-broken-links.py
Purpose: Link validation across all content files.
python3 scripts/find-broken-links.py
Crawls all content files, extracts internal and external links, and validates them:
- Internal links: Checks that the target route exists in the site manifest
- External links: Makes HEAD requests to verify the URL is reachable (with rate limiting)
- Anchor links: Validates that
#sectionanchors point to real headings
Outputs a report of broken links grouped by file. External link checking can be slow (rate-limited to avoid hammering third-party servers), so there’s a --internal-only flag for quick runs.
Tag Management
consolidate_tags.py
Purpose: Tag cleanup and normalization across all content.
python3 scripts/consolidate_tags.py
Analyzes all tags used across content collections and identifies:
- Near-duplicates (e.g., “docker” vs “Docker” vs “docker-compose”)
- Tags used only once (potential typos)
- Tags that should be merged (e.g., “networking” and “network”)
- Inconsistent casing
Outputs a suggested consolidation plan. With the --apply flag, it modifies frontmatter directly (use with care — review the plan first).
# Dry run (default) — show suggestions
python3 scripts/consolidate_tags.py
# Apply changes
python3 scripts/consolidate_tags.py --apply
Image Handling
fetch-images.js
Purpose: Downloads remote images referenced in content and stores them locally.
node scripts/fetch-images.js
Scans content files for external image URLs, downloads them to public/images/, and updates the content to reference the local copies. This ensures the site doesn’t depend on external image hosts that might go down or change URLs.
Images are organized by content type:
public/images/posts/for blog post imagespublic/images/journal/for journal entry imagespublic/images/docs/for documentation images
optimize-images.sh
Purpose: Bulk image optimization for web delivery.
bash scripts/optimize-images.sh
Processes all images in public/images/ through optimization:
- JPEG: Optimized with mozjpeg at quality 80
- PNG: Optimized with pngquant and oxipng
- WebP: Generated as alternative format for browsers that support it
- AVIF: Generated as alternative format for modern browsers
The script preserves originals and generates optimized versions alongside them. The Astro image pipeline (<Image /> component) selects the best format based on browser support.
When to run: After adding new images via fetch-images.js or manual placement. Not part of the regular build — run it manually when new images land.
Running Order
For a full content pipeline run (new content from session to published post):
convert_session_to_journal.pyorgenerate_journal_from_transcript.py— Raw notes to journalsession-to-blog.js— Journal to blog post (if applicable)sanitize_journal_entry.py— Apply Galactic Identity Systemfetch-images.js— Download any external imagesoptimize-images.sh— Optimize images for webscan-content-claims.js— Security auditparse-identity-verification.js— PII checkfind-broken-links.py— Validate linksquick-publish.js— Final validation and preparationspider-site-blueprint.js— Update site manifest (after deploy)
Not every step is needed every time. A quick journal entry might only need steps 3, 6, and 9. A major blog post with external images gets the full pipeline.
RAG embeddings are a manual step — run npm run build:embeddings when content changes. The validate-embeddings.js postbuild hook warns if embeddings are stale but never blocks the build.