RAG Pipeline

The Arcturus-Prime RAG (Retrieval-Augmented Generation) pipeline builds a searchable knowledge base from all site content, enabling AI features to provide contextually grounded responses. The pipeline chunks content, generates vector embeddings, stores them in both a JSON index and SQLite databases, and serves queries through a hybrid search combining vector similarity with BM25 keyword matching.

Embedding Generation

Build Script

Embeddings are built via scripts/build-embeddings.js, the primary pipeline script that processes all content collections and outputs the embedding index. The script is designed to run both locally and in CI/CD.

# Full rebuild of all collections
node scripts/build-embeddings.js

# Rebuild specific collection
node scripts/build-embeddings.js --collection=posts

# Incremental update (only changed files)
node scripts/build-embeddings.js --incremental

# Dry run (show what would be processed)
node scripts/build-embeddings.js --dry-run

Embedding Model

The pipeline uses the openai/text-embedding-3-small model accessed through OpenRouter. This model produces 1536-dimensional vectors and provides strong semantic similarity performance for English text content.

When OpenRouter is unavailable or for local-only builds, the pipeline falls back to Ollama running on localhost:11434 (Capella-Outpost at 10.42.0.100) with the nomic-embed-text model (768-dimensional vectors). The index stores the model name and dimension count, so the search system knows how to handle vectors from either model.

Batch Processing

Embedding requests are batched for efficiency. The pipeline sends 50 text chunks per API call to the embedding endpoint. This reduces HTTP overhead and takes advantage of batch pricing on the provider side.

The batching flow:

Collect all chunks from the current collection
Split into batches of 50
Send each batch to the embedding API
Collect results and match embeddings to their source chunks
Write the batch results to the output index
Log progress: [posts] Batch 3/12: 50 chunks embedded (1.2s)

Rate limiting is handled automatically. If the provider returns HTTP 429, the pipeline waits for the Retry-After duration and resumes. A maximum of 3 retries per batch is attempted before the batch is marked as failed and the pipeline continues with the next batch.

Chunking Strategy

Content is split into chunks before embedding to ensure each vector represents a focused semantic unit. The chunking parameters are tuned for the typical length and structure of Arcturus-Prime content.

Parameters

Chunk size: 400 words
Overlap: 50 words between adjacent chunks
Minimum chunk size: 100 words (shorter fragments are merged with adjacent chunks)

Chunking Algorithm

Frontmatter extraction — YAML frontmatter is parsed and stored as metadata, not included in the chunk text. The title, tags, and description are prepended to the first chunk as context.
Heading-aware splitting — the chunker respects Markdown headings. It prefers to split at heading boundaries when possible, keeping each section together if it fits within the chunk size. This preserves semantic coherence within chunks.
Code block preservation — code blocks (fenced with triple backticks) are never split mid-block. If a code block exceeds the chunk size, it becomes its own chunk regardless of the size limit. This ensures code examples remain intact and usable in retrieved context.
Overlap generation — after splitting, each chunk (except the first) is prepended with the last 50 words of the previous chunk. This overlap ensures that concepts spanning a chunk boundary are represented in at least one chunk.

Chunk Metadata

Each chunk carries metadata for filtering and attribution:

{
  "id": "posts/homelab-proxmox-setup/chunk-3",
  "collection": "posts",
  "slug": "homelab-proxmox-setup",
  "title": "Setting Up Proxmox VE on Bare Metal",
  "chunkIndex": 3,
  "totalChunks": 8,
  "headingPath": ["Installation", "Network Configuration"],
  "wordCount": 387,
  "tags": ["proxmox", "homelab", "networking"]
}

The headingPath field records the Markdown heading hierarchy at the point of the chunk, enabling the search UI to show users exactly where in a document the result came from.

Output: Embeddings Index

JSON Index

The primary output is /public/embeddings-index.json, a JSON file containing all chunks with their embeddings and metadata. This file is served statically and used by the client-side search (for the public site search feature) and by API endpoints that need quick access to the full index without a database query.

The JSON structure:

{
  "model": "openai/text-embedding-3-small",
  "dimension": 1536,
  "builtAt": "2026-02-23T03:00:00Z",
  "collections": {
    "posts": { "chunks": 342, "documents": 47 },
    "journal": { "chunks": 891, "documents": 156 },
    "docs": { "chunks": 234, "documents": 31 },
    "projects": { "chunks": 87, "documents": 12 },
    "learn": { "chunks": 156, "documents": 18 }
  },
  "chunks": [
    {
      "id": "posts/homelab-proxmox-setup/chunk-0",
      "text": "Setting Up Proxmox VE on Bare Metal...",
      "embedding": [0.023, -0.041, 0.017, ...],
      "metadata": { ... }
    }
  ]
}

Collection Priority Weighting

Not all collections are weighted equally in search results. The pipeline assigns priority weights that influence ranking:

Collection	Weight	Rationale
docs	1.5	Documentation is authoritative and should rank highest
posts	1.2	Blog posts are the primary content and frequently queried
learn	1.1	Learning content is structured and educational
projects	1.0	Project pages are reference material
journal	0.8	Journal entries are personal and less broadly relevant

Weights are applied as multipliers to the similarity score during search. A docs chunk with similarity 0.80 would score 1.20 (0.80 * 1.5), while a journal chunk with the same similarity would score 0.64 (0.80 * 0.8).

SQLite Storage

Argonaut RAG Stores

The Argonaut agent maintains its own SQLite-backed RAG stores for server-side search that does not depend on the static JSON index:

packages/argonaut/data/rag-store.db — the general knowledge base containing:

documents table — source document metadata (collection, slug, title, updated timestamp)
chunks table — chunk text, metadata, and embedding vectors (stored as BLOB)
fts_chunks — FTS5 virtual table for BM25 full-text search over chunk text
embeddings_meta — model name, dimension, and build timestamp

packages/argonaut/data/rag-store-blog.db — dedicated store for blog and journal content. Separated from the general store to allow independent reindexing of the high-churn content (journals are added frequently) without touching the more stable docs and project content.

SQLite Schema

CREATE TABLE documents (
  id TEXT PRIMARY KEY,
  collection TEXT NOT NULL,
  slug TEXT NOT NULL,
  title TEXT,
  tags TEXT,  -- JSON array
  updated_at TEXT NOT NULL,
  chunk_count INTEGER NOT NULL
);

CREATE TABLE chunks (
  id TEXT PRIMARY KEY,
  document_id TEXT NOT NULL REFERENCES documents(id),
  chunk_index INTEGER NOT NULL,
  text TEXT NOT NULL,
  heading_path TEXT,  -- JSON array
  word_count INTEGER NOT NULL,
  embedding BLOB NOT NULL  -- Float32Array as binary
);

CREATE VIRTUAL TABLE fts_chunks USING fts5(
  text, content=chunks, content_rowid=rowid
);

Embeddings are stored as raw Float32Array binary blobs for minimal storage overhead and fast loading. The SQLite database uses WAL mode for concurrent read/write access.

Hybrid Search

The search system combines two retrieval methods for robust results.

Vector Similarity Search

The vector search computes cosine similarity between the query embedding and all stored chunk embeddings:

The query text is embedded using the same model as the index
Cosine similarity is computed against every chunk embedding
Results are sorted by similarity score descending
The top-K results (default K=10) are returned

For the JSON index (used by public search), the cosine similarity computation runs in the browser using JavaScript. For the SQLite stores (used by Argonaut), the computation runs server-side in Node.js with optimized Float32Array operations.

BM25 Keyword Search

BM25 search uses SQLite’s FTS5 extension for full-text keyword matching:

The query text is tokenized
FTS5 performs BM25-ranked full-text search over the fts_chunks virtual table
Results are sorted by BM25 score descending
The top-K results are returned

BM25 search excels at finding exact matches for technical terms, error messages, command names, and proper nouns that vector search may treat as semantically similar to unrelated content.

Fusion

Results from both search methods are merged using Reciprocal Rank Fusion (RRF):

RRF_score(chunk) = sum(1 / (k + rank_in_list)) for each list containing chunk

Where k is a constant (default: 60) that controls how much weight is given to rank position. Chunks that appear in both the vector and BM25 result sets receive a combined score that boosts them above chunks appearing in only one set.

After fusion, the collection priority weight is applied, and the final top-K results are returned as context for the AI model.

Search Parameters

The search API accepts configurable parameters:

Parameter	Default	Description
`k`	10	Number of results to return
`vectorWeight`	0.7	Weight for vector search scores in fusion
`bm25Weight`	0.3	Weight for BM25 scores in fusion
`collections`	all	Filter to specific collections
`minScore`	0.1	Minimum score threshold for inclusion

RAG Manifest

The manifest file at packages/argonaut/data/rag-manifest.json tracks the index state:

{
  "version": 2,
  "model": "openai/text-embedding-3-small",
  "dimension": 1536,
  "lastFullBuild": "2026-02-20T03:00:00Z",
  "lastIncrementalUpdate": "2026-02-23T01:30:00Z",
  "documents": {
    "posts/homelab-proxmox-setup": {
      "lastModified": "2026-02-15T10:00:00Z",
      "chunkCount": 8,
      "indexedAt": "2026-02-20T03:01:23Z"
    }
  },
  "stats": {
    "totalDocuments": 264,
    "totalChunks": 1710,
    "indexSizeMB": 42.3
  }
}

The manifest enables the --incremental flag in the build script. When running incrementally, the script compares each file’s lastModified timestamp against the manifest and only re-embeds files that have changed. This reduces a full rebuild from several minutes to seconds for typical daily updates.

Auto-Update System

The public RAG index (public/embeddings-index.json) auto-updates on every deploy through a three-part system that ensures Argonaut always has current knowledge without wasting API calls on unchanged content.

Content Hash Tracking

Every embeddings build computes a SHA-256 hash of all content files across the five indexed collections (posts, journal, docs, projects, learn). The hash is stored in the index metadata as contentHash:

{
  "version": 1,
  "model": "openai/text-embedding-3-small",
  "dimensions": 1536,
  "builtAt": "2026-02-25T00:09:27.113Z",
  "contentHash": "1b576eca4753db1b...",
  "totalChunks": 676,
  "chunks": [...]
}

The hash covers file paths and contents, so renaming, moving, or editing any content file changes the hash. Adding or removing files also changes it.

Prebuild: Smart Auto-Rebuild

scripts/auto-embeddings.js runs as the first step of npm run build (via the prebuild hook). It:

Computes the current content hash from all src/content/ files
Reads the stored contentHash from the existing embeddings-index.json
If the hashes match — skips (no API calls, no rebuild, ~1 second)
If the hashes differ — runs build-embeddings.js to regenerate (~20 seconds)
If no API key is available — skips gracefully with a warning (won’t break dev builds)

# Check mode (no rebuild, exit code 1 if stale)
node scripts/auto-embeddings.js --check

# Force rebuild even if content unchanged
node scripts/auto-embeddings.js --force

# Normal mode (auto-decides)
node scripts/auto-embeddings.js

This runs automatically on every Cloudflare Pages deploy. The OPENROUTER_API_KEY secret is available in the CF Pages build environment.

Postbuild: Freshness Validation

scripts/validate-embeddings.js runs after astro build (via the postbuild hook). It verifies:

The embeddings index exists in the build output
The contentHash matches current content (warns if stale)
The builtAt timestamp is within 7 days (warns if old)

Validation is advisory — it warns but never blocks the build.

Three-Tier RAG Isolation

The RAG system is organized into three isolated tiers with distinct privacy levels, content sources, and access controls:

Aspect	Public	Safe	Private
Storage	`embeddings-index.json` (static JSON)	`rag-store-blog.db` (SQLite)	`rag-store.db` (SQLite)
Content	`src/content/` (published site content)	8 sanitized vaults (technical)	All 10 vaults including personal
Embedding model	OpenRouter `text-embedding-3-small`	Ollama `nomic-embed-text`	Ollama `nomic-embed-text`
Update trigger	Auto on deploy (content hash)	Manual ingest from admin UI or CLI	Manual ingest from admin UI or CLI
Chat interface	`/ask`, ChatWidget	`/admin/argonaut/chat` (Safe mode)	`/admin/argonaut/chat` (Full), `/admin/personal`
Privacy	Fully public	Admin-only, sanitized, safe for any AI provider	Admin-only, raw content, local providers only
Sanitization	N/A (already public)	identity_map.json + secret detection at ingest	None — passwords and secrets preserved

Safe tier vaults (8): dev-vault, Arcturus-Prime-technical, argo-os-docs, knowledge-vault-sanitized, jobspy, build-swarm, tendril, laforceit-vault. Custom vaults can be added via the admin UI.

Private tier includes all Safe vaults plus: main (personal, ~2900 files) and knowledge-base (~3000 files). Contains raw, unsanitized content — passwords, API keys, and personal data are preserved so you can query them.

Safe tier sanitization: At ingest time, content is sanitized in-memory using identity_map.json (148 patterns for hostnames, IPs, users, paths) plus 22 secret-detection regexes (API keys, passwords, SSH keys, connection strings, bearer tokens, emails). Original vault files are never modified. This makes the Safe tier database safe to expose through any AI provider, including cloud services that may train on input.

The public tier never has access to vault content, private notes, or unsanitized data. For complete operational instructions, see the RAG Operations Guide.

Admin UI (/admin/rag)

The RAG admin page at /admin/rag provides a three-tier dashboard for managing all RAG stores. Each tier is displayed as a distinct card with stats, vault sources, actions, and chat links.

Public RAG Tier Card

Shows live status of the public RAG index by reading a lightweight metadata sidecar (embeddings-meta.json, ~200 bytes) instead of the 15 MB index file.

Field	Description
Model	Embedding model used (`text-embedding-3-small`)
Dimensions	Embedding vector size (1536)
Total Chunks	Number of content chunks in the index
Size	File size of the full index
Built	Relative time since last build (hover for absolute)
Content Hash	Truncated SHA-256 of all source content

Collection pills show per-collection chunk counts (posts, journal, docs, projects, learn).

Freshness indicator: Green “Fresh” if built within 12 hours, yellow “Stale” between 12-48 hours, red “Old” beyond 48 hours.

Rebuild & Deploy button triggers a Cloudflare Pages deploy via CF_DEPLOY_HOOK_URL. The deploy runs the full build pipeline including auto-embeddings.js, which checks the content hash and only regenerates embeddings if content has actually changed.

Safe RAG Tier Card

Displays SQLite store statistics from rag-store-blog.db — document count, chunk count, embedded count, and per-collection breakdowns. In dev mode, shows vault sources with live file counts, vault management (add/remove custom vaults), and an “Ingest New Docs” button. Includes an “Audit for Leaks” button that scans all chunks for secrets that may have slipped past sanitization. When adding vaults, a warning banner explains that Safe tier content will be sanitized via identity_map.json.

Private RAG Tier Card

Same layout as Safe tier but for rag-store.db. Includes all Safe vaults plus the personal and knowledge-base vaults. Shows a “Local Only” privacy badge. Ingest button available in dev mode only. No audit scan — Private tier content is raw by design.

Embedding Service Status

Compact Ollama status indicator in the page header showing online/offline state and available models.

Vault Scanning

In dev mode, the dashboard performs a vault scan on load — counting .md files per vault directory to show source coverage. This uses the vault_scan API action which recursively counts markdown files without reading content.