RAG Data Landscape
Complete map of all vault sources, embedding databases, model variants, and backup archives powering the four-tier RAG system
RAG Data Landscape
Complete map of every data source, embedding database, and backup in the RAG system. Updated as new vaults are added or models change.
Vault Sources
All source data originates from Obsidian vaults and Arcturus-Prime content directories. These are the raw inputs to the ingestion pipeline.
Obsidian Vaults (~/Vaults/)
| Vault | Collection Name | Files | Size | Tier(s) |
|---|---|---|---|---|
Arcturus-Prime-technical | Arcturus-Prime-technical | 1,801 | 80 MB | Knowledge, Vaults, Private |
knowledge-vault-sanitized | knowledge-sanitized | 1,020 | 35 MB | Knowledge, Vaults, Private |
argo-os-docs | argo-os-docs | 624 | 9.3 MB | Knowledge, Vaults, Private |
dev-vault | dev-vault | 860 | 1.2 GB | Knowledge, Vaults, Private |
ai-context | ai-context | 70 | 332 KB | Knowledge, Vaults, Private |
build-swarm | build-swarm | 23 | 152 KB | Knowledge, Vaults, Private |
career | career | 19 | 144 KB | Knowledge, Vaults, Private |
tendril | tendril | 18 | 284 KB | Knowledge, Vaults, Private |
jobspy | jobspy | 11 | 68 KB | Knowledge, Vaults, Private |
laforceit-vault | laforceit | 8 | 40 KB | Knowledge, Vaults, Private |
main | personal | 4,442 | 5.8 GB | Vaults, Private |
conversation-archive | conversation-archive | 3,044 | 147 MB | Vaults, Private |
test | test-conversations | 2 | 80 MB | Vaults, Private |
Arcturus-Prime Content (src/content/)
| Directory | Collection Name | Files | Size | Tier(s) |
|---|---|---|---|---|
src/content/docs | Arcturus-Prime-docs | 108 | 1.3 MB | Knowledge, Vaults, Private |
src/content/posts | Arcturus-Prime-posts | 76 | 972 KB | Knowledge, Vaults, Private |
src/content/journal | Arcturus-Prime-journal | 73 | 572 KB | Knowledge, Vaults, Private |
src/content/projects | Arcturus-Prime-projects | 8 | 56 KB | Knowledge, Vaults, Private |
src/content/configurations | Arcturus-Prime-configs | 1 | 12 KB | Knowledge, Vaults, Private |
src/content/learn | Arcturus-Prime-learn | 3 | 24 KB | Knowledge, Vaults, Private |
External Sources
| Source | Collection Name | Files | Size | Tier(s) |
|---|---|---|---|---|
| Legal Paperwork | legal-paperwork | 3,624 | 17 GB | Private only |
Vaults NOT in Any Tier
These exist on disk but are intentionally excluded from RAG ingestion:
| Vault | Location | Reason |
|---|---|---|
~/Vaults/Instructions | N/A | Prompt templates, not knowledge |
~/Vaults/RAG | N/A | Meta/config for RAG itself |
~/Vaults/Arcturus-Prime | N/A | Contains credentials |
Embedding Databases
All SQLite databases live in packages/argonaut/data/ (gitignored).
Active Databases
| Database | Tier | Model | Dims | Docs | Chunks | Size |
|---|---|---|---|---|---|---|
rag-store-blog.db | Knowledge | qwen3-embedding:0.6b | 1024 | 3,778 | 33,101 | 290 MB |
rag-store-vaults.db | Vaults | nomic-embed-text | 768 | 8,297 | 132,151 | 1.1 GB |
rag-store.db | Private | nomic-embed-text | 768 | 10,440 | 166,183 | 1.5 GB |
Backup / Comparison Databases
| Database | Tier | Model | Dims | Purpose |
|---|---|---|---|---|
rag-store-blog-nomic.db | Knowledge | nomic-embed-text | 768 | A/B comparison baseline |
rag-store-vaults-nomic.db | Vaults | nomic-embed-text | 768 | Pre-upgrade backup |
rag-store-private-nomic.db | Private | nomic-embed-text | 768 | Pre-upgrade backup |
Public Tier (Deployed)
| File | Chunks | Size | Model | Dims |
|---|---|---|---|---|
public/embeddings-index.json | 775 | 16.1 MB | OpenRouter text-embedding-3-small | 1536 |
Embedding Models
Installed on Local GPU (RTX 4070 Ti)
| Model | Tag | Size | Dimensions | Context | MTEB Retrieval | Speed |
|---|---|---|---|---|---|---|
| qwen3-embedding | :0.6b | 639 MB | 1024 | 32K | 61.82 | ~8 chunks/s |
| qwen3-embedding | :latest (8b) | 4.7 GB | 4096 | 32K | 66.27 | ~2 chunks/s |
| nomic-embed-text | :latest | 274 MB | 768 | 8K | 49.01 | ~25 chunks/s |
Important: The :latest tag for qwen3-embedding maps to the 8b model (4096-dim). Always use :0.6b explicitly to get the 1024-dim model.
Benchmark Results (10-query test)
| Store | Model | Avg Top-1 Score | Avg Search Time |
|---|---|---|---|
| Knowledge (33K chunks) | qwen3-0.6b | 0.809 | 3,031 ms |
| Private (166K chunks) | nomic | 0.814 | 48,142 ms |
| Vaults (132K chunks) | nomic | 0.770 | 27,177 ms |
qwen3 delivers comparable relevance at much faster search times due to smaller DB and dimensions.
Archive on AllShare
All databases and source mirrors are archived on /mnt/AllShare/rag/ (2.0 TB NTFS3 partition, ~1.5 TB free).
/mnt/AllShare/rag/
├── manifest.json # Full inventory of all collections and databases
├── databases/ # All .db files (active + backups)
│ ├── rag-store-blog.db # Knowledge tier (qwen3)
│ ├── rag-store-blog-nomic.db # Knowledge tier (nomic backup)
│ ├── rag-store-vaults.db # Vaults tier
│ ├── rag-store-vaults-nomic.db # Vaults nomic backup
│ ├── rag-store.db # Private tier
│ └── rag-store-private-nomic.db # Private nomic backup
└── sources/ # Mirrored vault sources
├── dev-vault/ # 1.2 GB
├── Arcturus-Prime-technical/ # 80 MB
├── personal/ # 5.8 GB
├── legal-paperwork/ # 17 GB
└── ... (20 collections total)
Policy: Never delete backups from AllShare unless explicitly instructed.
Tier Composition
How tiers build on each other
Public (775 chunks)
└── Blog posts, journal, docs, projects, learn
└── Embedded with OpenRouter text-embedding-3-small (cloud)
└── Deployed as static JSON to CF Pages CDN
Knowledge / Safe (33,101 chunks)
└── All 10 Obsidian knowledge vaults
└── All 6 Arcturus-Prime content directories
└── Sanitized via identity_map.json (148 patterns)
└── Embedded with qwen3-embedding:0.6b (local GPU)
└── Safe for external AI providers
Vaults (132,151 chunks)
└── Everything in Knowledge
└── + personal vault (5.8 GB)
└── + old knowledge base (147 MB)
└── + test conversations (80 MB)
└── + Arcturus-Prime configs + learn
└── NOT sanitized — raw content
└── Embedded with nomic-embed-text (local GPU)
└── Local access only
Private / Full (166,183 chunks)
└── Everything in Vaults
└── + legal-paperwork (17 GB, 3,624 files)
└── NOT sanitized — passwords, keys preserved
└── Embedded with nomic-embed-text (local GPU)
└── Local access only
Build & Re-embed Commands
cd ~/Development/Arcturus-Prime
# Build specific tier (ingest + embed)
npx tsx packages/argonaut/scripts/build-blog-rag.ts --tier knowledge
npx tsx packages/argonaut/scripts/build-blog-rag.ts --tier vaults
npx tsx packages/argonaut/scripts/build-blog-rag.ts --tier private
# Embed-only (skip file scanning)
npx tsx packages/argonaut/scripts/build-blog-rag.ts --tier knowledge --embed-only
# Re-embed with different model (creates a copy)
npx tsx packages/argonaut/scripts/re-embed-db.ts \
--source rag-store-blog.db \
--output rag-store-blog-nomic.db \
--model nomic-embed-text
# Benchmark/compare databases
npx tsx packages/argonaut/scripts/test-rag-search.ts --benchmark
npx tsx packages/argonaut/scripts/test-rag-search.ts --compare --query "tailscale vpn"
npx tsx packages/argonaut/scripts/test-rag-search.ts --list # show all discovered DBs
Configuration
Vault sources are defined in packages/argonaut/src/rag/vault-config.ts. Custom vaults can be added via data/vault-config.json:
{
"knowledge": [
{ "collection": "my-vault", "path": "/path/to/vault", "sourceType": "vault" }
],
"private": []
}
The build script auto-discovers custom vaults and includes them in the appropriate tier.