RAG Pipeline
Embedding generation, chunking strategy, hybrid search, SQLite storage, and admin management for the Arcturus-Prime RAG system
RAG Pipeline
The Arcturus-Prime RAG (Retrieval-Augmented Generation) pipeline builds a searchable knowledge base from all site content, enabling AI features to provide contextually grounded responses. The pipeline chunks content, generates vector embeddings, stores them in both a JSON index and SQLite databases, and serves queries through a hybrid search combining vector similarity with BM25 keyword matching.
Embedding Generation
Build Script
Embeddings are built via scripts/build-embeddings.js, the primary pipeline script that processes all content collections and outputs the embedding index. The script is designed to run both locally and in CI/CD.
# Full rebuild of all collections
node scripts/build-embeddings.js
# Rebuild specific collection
node scripts/build-embeddings.js --collection=posts
# Incremental update (only changed files)
node scripts/build-embeddings.js --incremental
# Dry run (show what would be processed)
node scripts/build-embeddings.js --dry-run
Embedding Model
The pipeline uses the openai/text-embedding-3-small model accessed through OpenRouter. This model produces 1536-dimensional vectors and provides strong semantic similarity performance for English text content.
When OpenRouter is unavailable or for local-only builds, the pipeline falls back to Ollama running on localhost:11434 (Capella-Outpost at 10.42.0.100) with the nomic-embed-text model (768-dimensional vectors). The index stores the model name and dimension count, so the search system knows how to handle vectors from either model.
Batch Processing
Embedding requests are batched for efficiency. The pipeline sends 50 text chunks per API call to the embedding endpoint. This reduces HTTP overhead and takes advantage of batch pricing on the provider side.
The batching flow:
- Collect all chunks from the current collection
- Split into batches of 50
- Send each batch to the embedding API
- Collect results and match embeddings to their source chunks
- Write the batch results to the output index
- Log progress:
[posts] Batch 3/12: 50 chunks embedded (1.2s)
Rate limiting is handled automatically. If the provider returns HTTP 429, the pipeline waits for the Retry-After duration and resumes. A maximum of 3 retries per batch is attempted before the batch is marked as failed and the pipeline continues with the next batch.
Chunking Strategy
Content is split into chunks before embedding to ensure each vector represents a focused semantic unit. The chunking parameters are tuned for the typical length and structure of Arcturus-Prime content.
Parameters
- Chunk size: 400 words
- Overlap: 50 words between adjacent chunks
- Minimum chunk size: 100 words (shorter fragments are merged with adjacent chunks)
Chunking Algorithm
-
Frontmatter extraction — YAML frontmatter is parsed and stored as metadata, not included in the chunk text. The title, tags, and description are prepended to the first chunk as context.
-
Heading-aware splitting — the chunker respects Markdown headings. It prefers to split at heading boundaries when possible, keeping each section together if it fits within the chunk size. This preserves semantic coherence within chunks.
-
Code block preservation — code blocks (fenced with triple backticks) are never split mid-block. If a code block exceeds the chunk size, it becomes its own chunk regardless of the size limit. This ensures code examples remain intact and usable in retrieved context.
-
Overlap generation — after splitting, each chunk (except the first) is prepended with the last 50 words of the previous chunk. This overlap ensures that concepts spanning a chunk boundary are represented in at least one chunk.
Chunk Metadata
Each chunk carries metadata for filtering and attribution:
{
"id": "posts/homelab-proxmox-setup/chunk-3",
"collection": "posts",
"slug": "homelab-proxmox-setup",
"title": "Setting Up Proxmox VE on Bare Metal",
"chunkIndex": 3,
"totalChunks": 8,
"headingPath": ["Installation", "Network Configuration"],
"wordCount": 387,
"tags": ["proxmox", "homelab", "networking"]
}
The headingPath field records the Markdown heading hierarchy at the point of the chunk, enabling the search UI to show users exactly where in a document the result came from.
Output: Embeddings Index
JSON Index
The primary output is /public/embeddings-index.json, a JSON file containing all chunks with their embeddings and metadata. This file is served statically and used by the client-side search (for the public site search feature) and by API endpoints that need quick access to the full index without a database query.
The JSON structure:
{
"model": "openai/text-embedding-3-small",
"dimension": 1536,
"builtAt": "2026-02-23T03:00:00Z",
"collections": {
"posts": { "chunks": 342, "documents": 47 },
"journal": { "chunks": 891, "documents": 156 },
"docs": { "chunks": 234, "documents": 31 },
"projects": { "chunks": 87, "documents": 12 },
"learn": { "chunks": 156, "documents": 18 }
},
"chunks": [
{
"id": "posts/homelab-proxmox-setup/chunk-0",
"text": "Setting Up Proxmox VE on Bare Metal...",
"embedding": [0.023, -0.041, 0.017, ...],
"metadata": { ... }
}
]
}
Collection Priority Weighting
Not all collections are weighted equally in search results. The pipeline assigns priority weights that influence ranking:
| Collection | Weight | Rationale |
|---|---|---|
| docs | 1.5 | Documentation is authoritative and should rank highest |
| posts | 1.2 | Blog posts are the primary content and frequently queried |
| learn | 1.1 | Learning content is structured and educational |
| projects | 1.0 | Project pages are reference material |
| journal | 0.8 | Journal entries are personal and less broadly relevant |
Weights are applied as multipliers to the similarity score during search. A docs chunk with similarity 0.80 would score 1.20 (0.80 * 1.5), while a journal chunk with the same similarity would score 0.64 (0.80 * 0.8).
SQLite Storage
Argonaut RAG Stores
The Argonaut agent maintains its own SQLite-backed RAG stores for server-side search that does not depend on the static JSON index:
packages/argonaut/data/rag-store.db — the general knowledge base containing:
documentstable — source document metadata (collection, slug, title, updated timestamp)chunkstable — chunk text, metadata, and embedding vectors (stored as BLOB)fts_chunks— FTS5 virtual table for BM25 full-text search over chunk textembeddings_meta— model name, dimension, and build timestamp
packages/argonaut/data/rag-store-blog.db — dedicated store for blog and journal content. Separated from the general store to allow independent reindexing of the high-churn content (journals are added frequently) without touching the more stable docs and project content.
SQLite Schema
CREATE TABLE documents (
id TEXT PRIMARY KEY,
collection TEXT NOT NULL,
slug TEXT NOT NULL,
title TEXT,
tags TEXT, -- JSON array
updated_at TEXT NOT NULL,
chunk_count INTEGER NOT NULL
);
CREATE TABLE chunks (
id TEXT PRIMARY KEY,
document_id TEXT NOT NULL REFERENCES documents(id),
chunk_index INTEGER NOT NULL,
text TEXT NOT NULL,
heading_path TEXT, -- JSON array
word_count INTEGER NOT NULL,
embedding BLOB NOT NULL -- Float32Array as binary
);
CREATE VIRTUAL TABLE fts_chunks USING fts5(
text, content=chunks, content_rowid=rowid
);
Embeddings are stored as raw Float32Array binary blobs for minimal storage overhead and fast loading. The SQLite database uses WAL mode for concurrent read/write access.
Hybrid Search
The search system combines two retrieval methods for robust results.
Vector Similarity Search
The vector search computes cosine similarity between the query embedding and all stored chunk embeddings:
- The query text is embedded using the same model as the index
- Cosine similarity is computed against every chunk embedding
- Results are sorted by similarity score descending
- The top-K results (default K=10) are returned
For the JSON index (used by public search), the cosine similarity computation runs in the browser using JavaScript. For the SQLite stores (used by Argonaut), the computation runs server-side in Node.js with optimized Float32Array operations.
BM25 Keyword Search
BM25 search uses SQLite’s FTS5 extension for full-text keyword matching:
- The query text is tokenized
- FTS5 performs BM25-ranked full-text search over the
fts_chunksvirtual table - Results are sorted by BM25 score descending
- The top-K results are returned
BM25 search excels at finding exact matches for technical terms, error messages, command names, and proper nouns that vector search may treat as semantically similar to unrelated content.
Fusion
Results from both search methods are merged using Reciprocal Rank Fusion (RRF):
RRF_score(chunk) = sum(1 / (k + rank_in_list)) for each list containing chunk
Where k is a constant (default: 60) that controls how much weight is given to rank position. Chunks that appear in both the vector and BM25 result sets receive a combined score that boosts them above chunks appearing in only one set.
After fusion, the collection priority weight is applied, and the final top-K results are returned as context for the AI model.
Search Parameters
The search API accepts configurable parameters:
| Parameter | Default | Description |
|---|---|---|
k | 10 | Number of results to return |
vectorWeight | 0.7 | Weight for vector search scores in fusion |
bm25Weight | 0.3 | Weight for BM25 scores in fusion |
collections | all | Filter to specific collections |
minScore | 0.1 | Minimum score threshold for inclusion |
RAG Manifest
The manifest file at packages/argonaut/data/rag-manifest.json tracks the index state:
{
"version": 2,
"model": "openai/text-embedding-3-small",
"dimension": 1536,
"lastFullBuild": "2026-02-20T03:00:00Z",
"lastIncrementalUpdate": "2026-02-23T01:30:00Z",
"documents": {
"posts/homelab-proxmox-setup": {
"lastModified": "2026-02-15T10:00:00Z",
"chunkCount": 8,
"indexedAt": "2026-02-20T03:01:23Z"
}
},
"stats": {
"totalDocuments": 264,
"totalChunks": 1710,
"indexSizeMB": 42.3
}
}
The manifest enables the --incremental flag in the build script. When running incrementally, the script compares each file’s lastModified timestamp against the manifest and only re-embeds files that have changed. This reduces a full rebuild from several minutes to seconds for typical daily updates.
Auto-Update System
The public RAG index (public/embeddings-index.json) auto-updates on every deploy through a three-part system that ensures Argonaut always has current knowledge without wasting API calls on unchanged content.
Content Hash Tracking
Every embeddings build computes a SHA-256 hash of all content files across the five indexed collections (posts, journal, docs, projects, learn). The hash is stored in the index metadata as contentHash:
{
"version": 1,
"model": "openai/text-embedding-3-small",
"dimensions": 1536,
"builtAt": "2026-02-25T00:09:27.113Z",
"contentHash": "1b576eca4753db1b...",
"totalChunks": 676,
"chunks": [...]
}
The hash covers file paths and contents, so renaming, moving, or editing any content file changes the hash. Adding or removing files also changes it.
Prebuild: Smart Auto-Rebuild
scripts/auto-embeddings.js runs as the first step of npm run build (via the prebuild hook). It:
- Computes the current content hash from all
src/content/files - Reads the stored
contentHashfrom the existingembeddings-index.json - If the hashes match — skips (no API calls, no rebuild, ~1 second)
- If the hashes differ — runs
build-embeddings.jsto regenerate (~20 seconds) - If no API key is available — skips gracefully with a warning (won’t break dev builds)
# Check mode (no rebuild, exit code 1 if stale)
node scripts/auto-embeddings.js --check
# Force rebuild even if content unchanged
node scripts/auto-embeddings.js --force
# Normal mode (auto-decides)
node scripts/auto-embeddings.js
This runs automatically on every Cloudflare Pages deploy. The OPENROUTER_API_KEY secret is available in the CF Pages build environment.
Postbuild: Freshness Validation
scripts/validate-embeddings.js runs after astro build (via the postbuild hook). It verifies:
- The embeddings index exists in the build output
- The
contentHashmatches current content (warns if stale) - The
builtAttimestamp is within 7 days (warns if old)
Validation is advisory — it warns but never blocks the build.
Three-Tier RAG Isolation
The RAG system is organized into three isolated tiers with distinct privacy levels, content sources, and access controls:
| Aspect | Public | Safe | Private |
|---|---|---|---|
| Storage | embeddings-index.json (static JSON) | rag-store-blog.db (SQLite) | rag-store.db (SQLite) |
| Content | src/content/ (published site content) | 8 sanitized vaults (technical) | All 10 vaults including personal |
| Embedding model | OpenRouter text-embedding-3-small | Ollama nomic-embed-text | Ollama nomic-embed-text |
| Update trigger | Auto on deploy (content hash) | Manual ingest from admin UI or CLI | Manual ingest from admin UI or CLI |
| Chat interface | /ask, ChatWidget | /admin/argonaut/chat (Safe mode) | /admin/argonaut/chat (Full), /admin/personal |
| Privacy | Fully public | Admin-only, sanitized, safe for any AI provider | Admin-only, raw content, local providers only |
| Sanitization | N/A (already public) | identity_map.json + secret detection at ingest | None — passwords and secrets preserved |
Safe tier vaults (8): dev-vault, Arcturus-Prime-technical, argo-os-docs, knowledge-vault-sanitized, jobspy, build-swarm, tendril, laforceit-vault. Custom vaults can be added via the admin UI.
Private tier includes all Safe vaults plus: main (personal, ~2900 files) and knowledge-base (~3000 files). Contains raw, unsanitized content — passwords, API keys, and personal data are preserved so you can query them.
Safe tier sanitization: At ingest time, content is sanitized in-memory using identity_map.json (148 patterns for hostnames, IPs, users, paths) plus 22 secret-detection regexes (API keys, passwords, SSH keys, connection strings, bearer tokens, emails). Original vault files are never modified. This makes the Safe tier database safe to expose through any AI provider, including cloud services that may train on input.
The public tier never has access to vault content, private notes, or unsanitized data. For complete operational instructions, see the RAG Operations Guide.
Admin UI (/admin/rag)
The RAG admin page at /admin/rag provides a three-tier dashboard for managing all RAG stores. Each tier is displayed as a distinct card with stats, vault sources, actions, and chat links.
Public RAG Tier Card
Shows live status of the public RAG index by reading a lightweight metadata sidecar (embeddings-meta.json, ~200 bytes) instead of the 15 MB index file.
| Field | Description |
|---|---|
| Model | Embedding model used (text-embedding-3-small) |
| Dimensions | Embedding vector size (1536) |
| Total Chunks | Number of content chunks in the index |
| Size | File size of the full index |
| Built | Relative time since last build (hover for absolute) |
| Content Hash | Truncated SHA-256 of all source content |
Collection pills show per-collection chunk counts (posts, journal, docs, projects, learn).
Freshness indicator: Green “Fresh” if built within 12 hours, yellow “Stale” between 12-48 hours, red “Old” beyond 48 hours.
Rebuild & Deploy button triggers a Cloudflare Pages deploy via CF_DEPLOY_HOOK_URL. The deploy runs the full build pipeline including auto-embeddings.js, which checks the content hash and only regenerates embeddings if content has actually changed.
Safe RAG Tier Card
Displays SQLite store statistics from rag-store-blog.db — document count, chunk count, embedded count, and per-collection breakdowns. In dev mode, shows vault sources with live file counts, vault management (add/remove custom vaults), and an “Ingest New Docs” button. Includes an “Audit for Leaks” button that scans all chunks for secrets that may have slipped past sanitization. When adding vaults, a warning banner explains that Safe tier content will be sanitized via identity_map.json.
Private RAG Tier Card
Same layout as Safe tier but for rag-store.db. Includes all Safe vaults plus the personal and knowledge-base vaults. Shows a “Local Only” privacy badge. Ingest button available in dev mode only. No audit scan — Private tier content is raw by design.
Embedding Service Status
Compact Ollama status indicator in the page header showing online/offline state and available models.
Vault Scanning
In dev mode, the dashboard performs a vault scan on load — counting .md files per vault directory to show source coverage. This uses the vault_scan API action which recursively counts markdown files without reading content.