ChromaDB Version Hell: When Your Vector Database Speaks a Different Dialect

ChromaDB Version Hell

The Problem

I have a legal research RAG system. Four ChromaDB collections. 105,904 documents indexed — Colorado statutes, case law, regulations, court rules. The embeddings work. The semantic search works. Everything is fine.

Until I try to access the database from a different machine.

The collections were indexed on one host running ChromaDB 1.5.2. My development environment was running ChromaDB 0.5.23. Same Python code. Same collection names. Same database path mounted over the network.

chromadb.errors.InvalidCollectionException

That’s when I learned: ChromaDB’s HNSW binary index format changed between versions, and there is no migration path.

The HNSW Binary Format Problem

ChromaDB uses HNSW (Hierarchical Navigable Small World) graphs for approximate nearest neighbor search. The index is stored as binary files on disk. When ChromaDB 1.x shipped, they changed the binary serialization format.

There’s no error message that says “hey, this index was written by a different version.” You just get crashes, corrupted results, or opaque exceptions. The database opens. The collections list. But the moment you try to query vectors, everything breaks.

0.4.x: Works. Original binary format. But has numpy 2.x incompatibility — if your system has numpy >= 2.0, chromadb 0.4.x crashes on import with a C extension error. You need to pin numpy < 2.0, which conflicts with everything else in your Python environment.

0.5.x / 0.6.x: Introduced a _type key in collection metadata. If your collections were created without this key (by an older version), you get:

KeyError: '_type'

Deep in the deserialization code. No migration script. No “run this to fix your metadata.” Just a KeyError that makes the collection unreadable.

1.5.x: New HNSW binary format. Reads its own indexes fine. Cannot read indexes created by 0.4.x or 0.5.x. And in some configurations, segfaults instead of raising a clean error.

Yes. Segfaults. The Python process just dies.

The Migration Matrix

Source VersionTarget VersionResult
0.4.x0.5.xKeyError: '_type'
0.4.x1.5.xSegfault or corrupt reads
0.5.x1.5.xHNSW format mismatch
1.5.x0.5.xIncompatible downgrade

There is no in-place upgrade. You can’t run a migration script. The official answer is “re-index everything.”

For a 105,904-document collection, re-indexing means re-computing all embeddings. Depending on your embedding model, that’s hours of compute.

What I Actually Did

I matched the version. The indexes were created with ChromaDB 1.5.2, so every environment that touches those files needs to run 1.5.2.

# In the project's virtualenv
pip install chromadb==1.5.2

And then it worked. All four collections loaded. Semantic search returned results. No segfaults, no KeyErrors, no format mismatches.

The “fix” was: don’t run different versions against the same data.

The numpy Trap

If you’re on ChromaDB 0.4.x and thinking “I’ll just stay on the old version,” you’ll hit this:

numpy.core.multiarray.failed to initialize...

numpy 2.0 changed its C API. ChromaDB 0.4.x’s compiled extensions expect the old API. You can pin numpy<2, but numpy 2.x has been the default for a while now. Every time you create a fresh virtualenv or update dependencies, numpy pulls in 2.x and ChromaDB 0.4.x breaks.

So you can’t stay on old ChromaDB (numpy breaks it) and you can’t upgrade ChromaDB (the index format breaks it). You’re stuck.

Unless you re-index.

The Right Approach

If I were starting fresh with ChromaDB today:

Pin versions explicitly. Not just chromadb>=0.4 in your requirements.txt. Pin the exact version: chromadb==1.5.2. And document why.

Store the ChromaDB version alongside the data. I now have a VERSION file in the ChromaDB data directory that records which version created the indexes. Any code that opens the database checks this file first.

Plan for re-indexing. The embeddings are the expensive part. I cache the raw embedding vectors separately — if I need to rebuild a ChromaDB collection, I can load pre-computed vectors instead of re-running the embedding model on 105,904 documents.

Use server mode, not embedded. ChromaDB has a client-server mode where the server manages the index and clients connect over HTTP. This means one process, one version, accessing the data. No version mismatches from different environments.

The Broader Issue

ChromaDB isn’t alone here. Vector databases are young. Schema migrations, binary format stability, version compatibility — these are problems that traditional databases solved decades ago. PostgreSQL can read indexes from versions years apart. SQLite databases created in 2004 still open today.

Vector databases haven’t gotten there yet. And the documentation doesn’t warn you. ChromaDB’s migration guide (if you can find it) doesn’t mention that the HNSW binary format changed, or that downgrading segfaults, or that the _type key was silently added in 0.5.x.

You find out the way I did: it stops working and you spend an afternoon figuring out why.

The Checklist

If you’re using ChromaDB and planning to upgrade, or moving data between environments:

  1. Check the ChromaDB version on the machine that created the indexes
  2. Check the ChromaDB version on the machine that will read them
  3. If they don’t match exactly, expect problems
  4. If you must upgrade, re-index from cached embeddings (not from the ChromaDB collection)
  5. Pin your ChromaDB version in requirements.txt — == not >=
  6. If using numpy, verify compatibility with your ChromaDB version before updating

It’s not complicated. It’s just not documented. And the failure modes (segfaults, KeyErrors deep in deserialization) don’t point you toward “version mismatch” as the root cause.

Hopefully this saves someone an afternoon.