How Many Dimensions Is That Vector?
The Problem
I have three RAG databases. Same documents, same 294,252 chunks, same 22 collections. Different embedding models.
| Database | Model | Dimensions | Size |
|---|---|---|---|
| rag-store-vaults.db | qwen3-embedding:8b | 4,096 | 5.8 GB |
| rag-store-vaults-nomic.db | nomic-embed-text | 768 | 2.3 GB |
| rag-store-vaults-qwen06b.db | qwen3-embedding:0.6b | 1,024 | 2.4 GB |
Three models because I wanted to compare quality vs. speed. The 8B parameter model produces 4,096-dimensional vectors and takes ~23 seconds per query across 294K chunks. The nomic model produces 768-dimensional vectors and takes ~9 seconds. The 0.6B model sits in the middle at 1,024 dimensions.
The CLI to search looks like:
node dist/cli/index.js search "query" \
--db /mnt/AllShare/rag/databases/rag-store-vaults.db \
--model qwen3-embedding:8b
You have to specify the model. Get it wrong, and you’re sending a 768-dimensional query vector against a database full of 4,096-dimensional vectors. The cosine similarity math still runs — it just returns garbage. No error. No warning. Just irrelevant results scored at near-zero similarity, which looks like “nothing matched your query” rather than “you’re comparing apples to hypercubes.”
I got this wrong more often than I want to admit. Switched between databases for testing, forgot to switch the model flag, spent 10 minutes wondering why my query about Gentoo recovery returned results about Traefik proxy configuration.
The Fix: Read the Vectors
Each embedding vector is stored as a BLOB in SQLite. The vector dimensions are fixed per model — every chunk embedded by nomic is exactly 768 floats. Every chunk embedded by qwen8b is exactly 4,096 floats.
So: grab one non-null embedding from the database. Count the dimensions. Map it to a model.
function detectEmbeddingModel(db: Database): string {
const row = db.prepare(
'SELECT embedding FROM chunks WHERE embedding IS NOT NULL LIMIT 1'
).get();
if (!row?.embedding) return 'unknown';
const dims = new Float32Array(row.embedding).length;
switch (dims) {
case 768: return 'nomic-embed-text';
case 1024: return 'qwen3-embedding:0.6b';
case 4096: return 'qwen3-embedding:8b';
default: return `unknown-${dims}d`;
}
}
Now the CLI works without the --model flag. Open the database, read the first embedding, infer the model, embed the query with the same model, search. If you do specify --model, it validates that your choice matches what’s stored.
Dimension Validation
The auto-detection solved the search side. But the ingestion side had the same problem.
If you embed 100,000 chunks with nomic (768d) and then accidentally re-embed some chunks with qwen8b (4,096d), you get a database with mixed dimensions. Some vectors are 768 floats. Some are 4,096 floats. The cosine similarity function doesn’t check — it’ll compare a 768d vector against a 4,096d vector by treating the extra dimensions as zero.
The results aren’t wrong in a way that’s obvious. They’re wrong in a way that makes you think the embedding quality is bad. “Maybe nomic just doesn’t handle this query well.” No — the query vector has 768 dimensions and the stored vector has 4,096, and you’re computing similarity across a dimensional mismatch.
Added validation to the embedding pipeline:
function setEmbeddings(chunks: Chunk[], embeddings: Float32Array[]): void {
const expectedDims = detectDimensionsFromExisting(db);
for (const embedding of embeddings) {
if (expectedDims && embedding.length !== expectedDims) {
throw new Error(
`Dimension mismatch: expected ${expectedDims}, got ${embedding.length}. ` +
`Wrong embedding model?`
);
}
}
// ... store embeddings
}
Now if you try to embed with the wrong model, it fails immediately instead of silently corrupting the database.
Batch Embedding
While I was in the embedding code, I noticed the embedder was calling Ollama one chunk at a time:
// Before: one HTTP request per chunk
for (const chunk of chunks) {
const response = await ollama.embed({ model, input: chunk.text });
chunk.embedding = response.embedding;
}
Ollama’s /api/embed endpoint accepts an array of inputs. One HTTP request, multiple embeddings back. The model loads once, processes a batch, returns them all.
// After: batch embedding
const texts = chunks.map(c => c.text);
const response = await ollama.embed({ model, input: texts });
// response.embeddings is an array matching the input order
For test batches of 1,000 chunks, the batch approach is roughly 3x faster. Less HTTP overhead, better GPU utilization, fewer model load/unload cycles.
The Performance Picture
The real bottleneck isn’t embedding — it’s search. Every query scans all 294,252 vectors with brute-force cosine similarity. O(n) with no index.
| Model | Dimensions | Query Time | Relative |
|---|---|---|---|
| nomic | 768 | ~9 seconds | 1x |
| qwen06b | 1,024 | ~12 seconds | 1.3x |
| qwen8b | 4,096 | ~23 seconds | 2.5x |
nomic is 2.5x faster than qwen8b, and in my testing, the quality difference is negligible. For interactive queries during writing — where I’m fact-checking dates or looking up specific incident details — 9 seconds is tolerable. 23 seconds means I lose my train of thought.
The real fix is an approximate nearest neighbor index (HNSW or IVF). That would bring queries under 3 seconds regardless of dimension count. But that requires either compiling the sqlite-vss extension or building an in-memory index at startup. It’s on the list.
For now, auto-detection at least prevents the “wrong model, garbage results” failure mode that cost me more debugging time than I want to admit.
The Composite Index
One more thing. The vector search query was:
SELECT * FROM chunks WHERE embedding IS NOT NULL AND active = 1
No index on that combination. SQLite was doing a full table scan, checking both conditions on every row, before even getting to the cosine similarity math.
Added a composite index:
CREATE INDEX idx_chunks_active_embedded
ON chunks(active, embedding)
WHERE embedding IS NOT NULL;
This doesn’t help with the O(n) vector comparison — you still compare against every embedded chunk. But it eliminates the “find all active embedded chunks” scan that happens before the comparison starts. Small improvement.
The 9-second nomic query is now about 8 seconds. Not transformative. But free.
The Three Fixes
Auto-detection, dimension validation, batch embedding. Combined, they fixed three categories of silent failure:
- Querying with the wrong model → garbage results, no error
- Embedding with the wrong model → corrupted database, no error
- Single-chunk embedding → 3x slower than necessary, no error
The common thread: everything “worked.” No crashes, no exceptions, no error messages. Just quietly wrong results, quietly corrupted data, and quietly wasted time.
The most dangerous bugs are the ones where everything looks fine.