#2466: The Hidden Trap of Embedding Model Lock-In

What happens when your vector database works great — until your embedding model gets deprecated and your vectors become useless.

Featuring

Daniel

Corn

Herman

0:000:00

Episode Details

Episode ID: MWP-2624
Published: Apr 26
Duration: 27:25
Audio: Direct link
Pipeline: V5
TTS Engine: chatterbox-regular
Script Writing Agent: deepseek-v4-pro
Topics: rag open-source embedding-models

AI-Generated Content: This podcast is created using AI personas. Please verify any important information independently.

The Silent Killer in Production RAG: Embedding Model Lock-In

Most discussions about RAG (Retrieval-Augmented Generation) focus on getting the pipeline working: chunking strategies, vector database selection, retrieval quality. But there's a deeper problem that only emerges at scale, and it's one that almost nobody talks about until it's too late.

The Portability Myth

Here's the counterintuitive truth: raw vectors are perfectly portable between vector databases. You can export them as JSON, CSV, or binary blobs and load them into Postgres with pgvector, Pinecone, Weaviate, or Milvus. The numeric representation doesn't care what database it lives in.

But that's not the portability that matters.

The real lock-in is at the model level. Two different embedding models produce vectors in fundamentally different geometric spaces. Even if both output the same dimensionality — say 1,536 dimensions — a vector from Model A and Model B are not comparable. They're coordinates in different universes. Normalization and alignment techniques might help map one space to another, but the chances of losing context or performance are high.

You can move vectors around freely — but only if the model that generated them stays the same. The moment you switch models, those vectors become meaningless in the new space.

The Deprecation Nightmare

In January 2025, Google deprecated its text-embedding-004 model. A developer who'd built an entire movie recommendation system on it woke up to find his database of thousands of stored embeddings was useless. The replacement models were incompatible. He had to regenerate everything from scratch.

This is the nightmare scenario. You build a system, it works beautifully, and then some product manager at a cloud provider decides to sunset the model you depend on. Your RAG pipeline is a pumpkin overnight.

The costs are not trivial. For a 100-million-vector dataset, one-time embedding costs alone run between $8,000 and $15,000 — just for the embedding compute, not including database migration, increased write units, or processing overhead. At enterprise scale — say 500 million vectors, 100 million queries per month — annual vector database costs hit $30,000 to $54,000 across major providers. And the embedding and inference fees often match or exceed the database bill itself.

You're paying for storage, then paying again for the right to use your own data, and when the model changes, you pay a third time to rebuild everything.

The Mitigation Playbook

There's a pattern called blue-green migration. Never overwrite existing vectors in place. Instead, create a completely new collection, ingest fresh embeddings using the new model, validate retrieval quality against benchmarks — and only then point production traffic to the new collection. If you overwrite in place, you get Frankenstein results: a search query hits a mix of old and new coordinate systems, and the results are nonsense.

The deeper mitigation is versioning. Every stored vector should carry metadata: the model name, model version, and a hash of the source text. You version your embeddings the way you'd version your code. And critically, you must store the original source text alongside the embedding. If you lose the original document and your model gets deprecated, you've suffered permanent data loss.

The Self-Hosting Argument

The moment you persist embeddings at scale, you inherit vendor lifecycle risk. Models get deprecated, new ones are incompatible, output characteristics change. Your stored data is coupled to a model's geometry.

Self-hosting open-source embedding models — using something like FastEmbed with ONNX runtime, or BAAI's BGE base model — avoids API deprecation schedules entirely because you control the model. There's no free lunch: you're managing infrastructure and handling updates yourself. But the trade-off is control over your own destiny. You decide when to upgrade. You can run old and new models side by side for as long as you need.

The Monitoring Blind Spot

Most organizations deploy a RAG system, it works in demo, it goes to production, and nobody instruments it. There's no precision at K, no normalized discounted cumulative gain, no confidence scoring. The embedding pipeline might be stale, or it might be fine — they literally cannot tell.

The real silent killer in production RAG isn't bad models. It's stale embeddings. And without monitoring, most teams are shipping and praying.

Downloads

Episode Audio

Download the full episode as an MP3 file

Download MP3

Transcript (TXT)

Plain text transcript file

Transcript (PDF)

Formatted PDF with styling

#2466: The Hidden Trap of Embedding Model Lock-In

Daniel sent us this one — he's been thinking about something that doesn't get enough airtime in RAG discussions. We all know you can't just dump a document into a vector database. You need an embedding model to convert it into numeric vectors. But here's the thing he's flagging — once you've embedded at scale, those vectors are locked to that model's mathematical space. Change the model, you have to re-embed everything. So he's asking three things. First, are embeddings portable between vector databases? Second, what happens when a business has embedded tens of millions of documents and their embedding model gets deprecated? And third, what if you've lost the original source material — you've got the vectors but not the raw text — and now you can't redo the pipeline? That third one is genuinely scary.

It's the silent killer in production RAG, and almost nobody talks about it until it bites them. By the way, quick note — today's script is coming from DeepSeek V four Pro. Good to have you in the chair.

So where do we start? The portability question feels like the natural entry point.

Let's start there, because the answer is counterintuitive. The raw vectors themselves — these dense float arrays — are absolutely portable between vector databases. You can export them as JSON, CSV, or binary blobs, and load them into Postgres with pgvector, or Pinecone, or Weaviate, or Milvus. The numeric representation doesn't care what database it lives in. So at the storage layer, you're not locked in.

That's not the whole story, is it?

Not even close. The portability that matters is at the model level, not the database level. Two different embedding models produce vectors in fundamentally different geometric spaces. Even if both output the same dimensionality — say both give you fifteen hundred dimensions — a vector from model A and a vector from model B are not comparable. They're coordinates in different universes. The Weaviate blog had a piece on this last October, and they were blunt: normalization and alignment techniques might help map one space to another, but the chances of losing context or variance in performance is high.

You can move the vectors around freely, but only if the model that generated them stays the same. The moment you switch models, those vectors become meaningless in the new space.

Here's where it gets real. In January of this year, Google deprecated their text-embedding-zero-zero-four model. A developer who'd built an entire movie recommendation system on it woke up to find his database of thousands of stored embeddings was useless. The replacement models were incompatible. He had to regenerate everything from scratch. He wrote about it on ITNEXT in February — it's a brutal migration story.

That's the nightmare scenario Daniel's getting at. You build a system, it works beautifully, and then some product manager at a cloud provider decides to sunset the model you depend on. Your RAG pipeline is a pumpkin overnight.

The cost of re-embedding is not trivial. For a hundred-million-vector dataset, the one-time embedding costs alone run between eight thousand and fifteen thousand dollars. That's just the embedding compute — it doesn't include the database migration, the increased write units, the processing overhead. Actian crunched these numbers in February. At enterprise scale — say five hundred million vectors, a hundred million queries a month — annual vector database costs hit thirty to fifty-four thousand dollars across the major providers. And the embedding and inference fees often match or exceed the database bill itself.

You're paying for storage, then paying again for the right to use your own data, and when the model changes, you pay a third time to rebuild everything.

That third payment is the one nobody budgets for. And it's not just about money — it's about time, downtime, what happens to your production system while you're re-embedding fifty million documents.

Which brings us to Daniel's second question. If you're a business that's gone all-in on a particular embedding model and it gets deprecated, what's the playbook?

There's a pattern that's emerged called blue-green migration. You never overwrite your existing vectors in place. Instead, you create a completely new collection, ingest fresh embeddings using the new model, validate retrieval quality against your benchmarks — and only then point production traffic to the new collection. The ITNEXT piece really hammered this. If you overwrite in place, you get what they called Frankenstein results — a search query hits a mix of old and new coordinate systems, and the results are nonsense.

That makes sense. You're running two indexes in parallel during the migration. But that doubles your storage costs during the transition.

It does, temporarily. But the alternative is worse. You corrupt your search quality, and if you're not instrumenting retrieval quality — which most organizations aren't — you might not even notice the degradation. You just get slowly worse answers over time. The dbi-services blog had a fantastic post on this in February. They argued that the real silent killer in production RAG isn't bad models, it's stale embeddings. Most organizations deploy a RAG system, it works in demo, it goes to production, and nobody instruments it. There's no precision at K, no normalized discounted cumulative gain, no confidence scoring. The embedding pipeline might be stale, or it might be fine — they literally cannot tell.

How many production RAG systems are out there right now, quietly serving degraded results, and nobody has any idea?

I'd bet it's more than half. The monitoring story for RAG is still incredibly immature. We've got decades of tooling for traditional databases — query latency, index fragmentation, cache hit ratios. For vector search, we're still in the wild west. Most teams ship it and pray.

You've got vendor deprecation risk on one side, and invisible quality degradation on the other. What's the actual mitigation?

The first thing, and this is where multiple sources converge, is you track embedding model version in your metadata. Every single vector you store should have, alongside it, the model name, the model version, and a hash of the source text. The Domo RAG guide from this year recommends this explicitly. The Augment Code multimodal RAG guide adds another layer — they use environment variables for model version with rollback capabilities, and Git-tracked prompts for full deployment reproducibility.

You're versioning your embeddings the way you'd version your code. That seems obvious in retrospect, but I'd bet most teams aren't doing it.

They're absolutely not. And the source hash is critical for another reason, which connects to Daniel's third question. If you've got the source hash, you can detect when the underlying content has changed. But more fundamentally, you must store the original source text alongside the embedding. You cannot rely on the vector alone. If you lose the original document and your model gets deprecated, you've suffered permanent data loss. You can't regenerate what you no longer have.

That third scenario Daniel raised — you've embedded everything, you've lost the originals, and now the model's gone. That's not just an inconvenience. That's data extinction.

It's the digital equivalent of burning the library and keeping only the card catalog. The card catalog tells you where things were, but it doesn't give you the books. And if someone redesigns the library's layout, your card catalog becomes meaningless.

I'm going to pretend I didn't just compliment you.

Noted and ignored. But this isn't hypothetical. There are real cases where companies have embedded customer support tickets, legal documents, medical records, and then purged the originals for cost or compliance reasons. They figure the embeddings are enough because the embeddings work today. They're not thinking about model deprecation three years out.

Because nobody thinks about deprecation when they're building. You're trying to get the thing working, celebrating that the retrieval is good — you're not thinking about what happens when the foundation shifts under you.

The foundation will shift. It's not a question of if, it's when. Embedding models are evolving rapidly. New architectures, new training paradigms, models that handle multiple languages better, models that handle code better. The model you choose today will not be the best model in two years. It might not even exist in two years.

What's the self-hosting argument here? Daniel mentioned portability between databases, but it sounds like the deeper lock-in is at the model provider level.

The ITNEXT author's key insight was this — the moment you persist embeddings at scale, you inherit vendor lifecycle risk. Models get deprecated, the new ones are incompatible, and output characteristics change. This isn't just about pricing. It's representational. Your stored data is coupled to a model's geometry. His recommended mitigation was to self-host open-source embedding models — use something like FastEmbed with ONNX runtime, or BAAI's BGE base model. You avoid API deprecation schedules entirely because you control the model.

Self-hosting has its own costs. You're managing infrastructure, handling updates yourself, responsible for performance.

There's no free lunch. But the trade-off is control over your own destiny. When you self-host, you decide when to upgrade. You can run the old model and the new model side by side for as long as you need. You're not subject to some cloud provider's end-of-life calendar.

You mentioned something earlier I want to circle back to. The change detection piece. Not every document update needs to trigger a re-embedding, right?

This is where it gets really interesting. The dbi-services team proposed an event-driven architecture using PostgreSQL triggers with a queue table. When a document changes, the trigger fires and checks whether the change is semantically meaningful. A typo fix — correcting "PostgreSLQ" to "PostgreSQL" — probably doesn't change the embedding enough to matter. You can skip that re-embedding entirely.

How do you determine what's semantically meaningful?

You compare the before and after images of the row change through change data capture. If the edit distance is below some threshold, or if the key terms haven't changed, you skip it. In their lab tests on twenty-five thousand Wikipedia articles, twelve percent of simulated mutations were metadata-only — the trigger correctly skipped all of them. That's twelve percent of your re-embedding compute that you just don't spend.

At a hundred million documents, twelve percent is real money.

Real money and real time. And it means your embedding index stays more current because you're not wasting cycles on trivial updates. The queue table approach also means you can process updates incrementally, rather than running massive batch jobs that take down your database every Sunday night.

The batch re-embedding approach feels like the default. Most teams would just say, okay, we'll re-embed everything once a quarter. What's wrong with that?

The dbi-services post called batch re-embedding a form of technical debt. Your index is increasingly stale between batches. If you're running a customer-facing application, the answers your users get in month three might be based on data that's ninety days out of date. And when you do run the batch, it's a resource spike competing with production traffic. It's the worst of both worlds.

The event-driven approach is continuous, incremental, and only touches what's actually changed. But that requires infrastructure — triggers, queue, workers.

You need to be serious about production RAG. If you're prototyping, batch is fine. If you're running a business on it, you need event-driven embedding refresh. The architecture they recommended uses SELECT FOR UPDATE SKIP LOCKED for safe concurrent worker processing, and a versioned embedding schema that tracks model name, model version, source hash, and an is_current flag. That flag is how you handle model migrations gracefully.

You've got multiple versions of embeddings for the same source document, each tagged with the model that produced it, and one of them is marked as current.

When you switch models, you don't delete the old embeddings. You add new ones, validate them, and then flip the is_current flag. If something goes wrong, you flip it back. Rollback is instantaneous.

That's clean. But it doubles or triples your storage, depending on how many model versions you're maintaining.

Storage is cheap compared to downtime, compared to bad search results, compared to the engineering cost of emergency re-embeddings. And with Matryoshka representation learning, you've got some flexibility on the storage side anyway.

Matryoshka representation learning.

It's named after those Russian nesting dolls. Models like OpenAI's text-embedding-three and Voyage AI's embeddings support this. The idea is you train a model to produce embeddings where the most important information is concentrated in the first few dimensions. So you can take a three-thousand-seventy-two-dimensional vector and truncate it to two hundred fifty-six dimensions, and it still works reasonably well. Milvus has documented this — you can reduce storage by up to seventy-five percent and speed up vector search significantly, without re-embedding.

That only works within a single model's output. It doesn't help you move between models.

It's not a cross-model solution. It's an optimization within a model family. But it does mean that if you choose a Matryoshka-capable model, you've got more flexibility on storage and performance without having to redo your pipeline. That's a meaningful advantage when you're planning for scale.

Let's talk about the Pinecone pricing shift, because I think it's a useful parallel. What happened there?

In October of last year, Pinecone introduced a fifty-dollar-a-month minimum for their standard plans. For teams running stable, low-volume workloads at eight to twelve dollars a month, that was a four to five hundred percent cost increase overnight. The Actian analysis pointed out that the real problem wasn't the fifty dollars — it was the precedent. If a vendor can quintuple your costs without warning, what prevents future increases?

That's exactly the same dynamic as model deprecation. You're locked in, the terms change, and your switching costs are enormous.

It's vendor lifecycle risk manifesting in two different ways. With Pinecone, it was pricing. With Google's embedding model, it was deprecation. In both cases, the cost of staying and the cost of leaving are both high, and you're making the decision under pressure. That's not a position you want to be in.

What's the practical advice for someone starting a new RAG project today?

Number one, store your source text. Never discard the original documents. Embeddings are derived data — they should never be your primary store. Number two, version your embeddings. Model name, model version, source hash, is_current flag. Number three, seriously consider self-hosting open-source embedding models. The operational overhead is real, but the control is worth it at scale. Number four, instrument your retrieval quality from day one. Precision at K, recall, whatever metrics make sense for your use case. If you can't measure it, you can't manage it.

Number five, plan for model migration from the start. Don't assume your first embedding model will be your last.

Build the blue-green migration path into your architecture. It's much easier to do that when you've got ten thousand vectors than when you've got a hundred million. The migration path is not a feature you bolt on later — it's part of the data model.

You mentioned the event-driven architecture with PostgreSQL triggers. Is that the standard approach now, or is it still niche?

It's emerging as best practice, but I wouldn't call it standard yet. Most teams are still doing batch re-embedding, if they're doing re-embedding at all. The dbi-services architecture is where the field is heading, not where it is. But if you're building something new today, there's no reason not to adopt it. The tooling is mature enough — pgvector is solid, the trigger patterns are well understood. It's more upfront engineering, but it pays off within the first model migration.

What about the database portability question specifically? If I've got my vectors in Pinecone and I want to move to Weaviate, or to self-hosted Postgres, what's the actual friction?

The friction is mostly operational, not technical. You export your vectors — they're just arrays of floats — and you import them into the new database. The vectors themselves don't change. What changes is your query layer, your indexing configuration, your performance characteristics. Different vector databases have different index types — HNSW, IVF flat — and you may need to tune those for your workload. But the core data is portable.

The database lock-in is overstated. The real lock-in is the model.

The model is the lock-in. The database is just where the numbers live. And numbers are numbers. You can move numbers anywhere.

If you've been using a managed service that handles both the embedding generation and the storage, you might not even have easy access to your raw vectors. You might be locked into both layers simultaneously.

That's the integrated platform trap. Some providers bundle the embedding API with the vector database, and if you build on that, you're locked into the whole stack. That's why I'm a big advocate for keeping embedding generation and vector storage as separate concerns. Generate your embeddings with one tool, store them with another, and make sure you can access both independently.

It's the same principle we see everywhere in software architecture, just applied to a newer domain.

The principles don't change. Loose coupling, high cohesion, version everything, never discard your source of truth. We've known this for decades. We just keep forgetting it when the technology is new and shiny.

The new and shiny has a way of making people forget the fundamentals.

It really does. And RAG is especially prone to this because it feels like magic when it works. You ask a question, it finds relevant documents, it generates a coherent answer. The temptation is to ship it and move on to the next feature. Nobody wants to think about model deprecation timelines when the demo is working beautifully.

The magic has a shelf life.

The magic has a shelf life, and the expiration date is set by someone else's product roadmap. That's the part that keeps me up at night.

You don't sleep anyway. You're a donkey.

I sleep plenty. I just don't embed my sleep data.

Alright, let's shift to something completely unrelated.

Now: Hilbert's daily fun fact.

The shortest war in recorded history was the Anglo-Zanzibar War of eighteen ninety-six, which lasted between thirty-eight and forty-five minutes.

For someone building a RAG system today, what should they actually do differently on Monday morning?

First thing, audit what you've got. Do you know which embedding model generated every vector in your database? If you don't, you've already got a problem. Go add that metadata. It's tedious but essential. Second, make sure you've got your source documents somewhere accessible. If you've been treating your vector database as your primary store, stop. The vector database is a cache, not a source of truth. Third, pick a monitoring metric for retrieval quality and start tracking it. Even something simple like precision at five will tell you more than nothing.

If you're starting a new project, start with the versioned schema we talked about. Model name, model version, source hash, is_current. It's four extra columns. The cost is negligible. The benefit, when you need to migrate models, is enormous.

Also, run a thought experiment. What would it cost you to re-embed everything today? Figure out that number. It's probably higher than you think. Now ask yourself whether you'd be willing to pay that on thirty days' notice if your embedding model got deprecated. If the answer is no, you need a mitigation plan.

The thirty-day notice point is important. Google gave more notice than that for their embedding model deprecation, but not all providers will. And even with notice, if you've got a hundred million vectors, thirty days might not be enough time to re-embed everything without disrupting production.

The mitigation plan needs to include the blue-green migration path. You need to be able to run two models in parallel, validate the new one, and switch over cleanly. That's not something you can build in a weekend.

It's architectural. It needs to be there from the start, or at least retrofitted before you hit scale. The ITNEXT migration story is a cautionary tale, but it's also a playbook. The author documented exactly what went wrong and exactly how to avoid it. Anyone building on RAG should read that post.

The self-hosting question. At what scale does it make sense to bring the embedding model in-house?

It's not purely a scale question. It's also a risk tolerance question. If your business depends on RAG, and a model deprecation would cause significant revenue loss or customer impact, you should self-host even at modest scale. The cost of self-hosting a BGE model on a GPU instance is not that high. The cost of an emergency migration when your provider sunsets your model — that can be existential.

Especially for smaller companies that don't have the engineering bandwidth to rebuild their entire embedding pipeline on short notice.

The big enterprises can absorb a surprise re-embedding project. They've got teams, budget, redundancy. The startup with five engineers and a hundred thousand users — a model deprecation could sink them.

The asymmetry is that the organizations least able to absorb the shock are the ones most likely to be relying on managed embedding APIs, because they don't have the resources to self-host.

That's the trap. Managed services are most appealing to small teams, but the lifecycle risk is highest for exactly those teams. It's a structural problem in the ecosystem.

What about the source material retention point? How common is it for teams to lose their original documents?

More common than anyone wants to admit. Sometimes it's a cost decision — embeddings are smaller than raw text, so they purge the originals to save on storage. Sometimes it's a compliance thing — they're not supposed to keep certain data beyond a retention window, but the embeddings aren't classified the same way. Sometimes it's just sloppiness — the original documents were in some S3 bucket that got cleaned up, or they were scraped from a source that's no longer available.

In all those cases, a model deprecation means permanent loss of the ability to re-embed.

You can't regenerate embeddings from nothing. The vectors you have are frozen in the old model's space forever. If that model disappears, your vectors are effectively orphaned. They still exist, but you can't query them with a new model, and you can't update them when the source data changes.

That's a data preservation crisis waiting to happen. We're generating enormous amounts of embedded data, and a lot of it is probably going to be unreadable in five years because the models that created it won't exist anymore.

It's a digital dark age scenario, but for vector space specifically. We're encoding knowledge in a format tied to a specific, ephemeral piece of software. It's like writing your archives in a file format that only one application can read, and that application's developer is planning to discontinue it.

Which we've seen happen with proprietary file formats dozens of times. This is the same pattern, just faster.

File formats change on the scale of years or decades. Embedding models are evolving on the scale of months. The deprecation cycle is accelerating.

Alright, let's wrap this up with some forward-looking thoughts. Where do you see this going? Are we going to see standardization in embedding spaces, or is fragmentation going to continue?

I don't think we'll see standardization anytime soon. The competitive dynamics push in the opposite direction. Every model provider wants their embeddings to be better, which means different. If everyone produced identical vectors, there'd be no reason to choose one model over another. So fragmentation is a feature of the market, not a bug. What I do think we'll see is better tooling for migration, better versioning practices becoming standard, and more teams self-hosting as the open-source models get good enough.

The open-source models are already good enough for a lot of use cases. The question is whether teams are willing to trade the convenience of an API for the control of self-hosting.

That trade-off is going to shift as more teams get burned by deprecations. Every migration horror story pushes a few more teams toward self-hosting. It's a slow shift, but I think it's directional.

One last thought experiment. What if someone built a vector database that could automatically translate between embedding spaces? You give it vectors from model A, and it maps them into model B's space on the fly.

People are working on this. There are alignment techniques — linear transformations, learned mappings — that can approximate a translation between embedding spaces. But the Weaviate piece was right to be skeptical. You're always going to lose something in the translation. The geometries are different. It's not just a rotation or a scaling. The semantic relationships are encoded differently. You can approximate, but you can't perfectly translate.

It's lossy compression, basically. You can get close, but you'll never get an exact match.

For some applications, close is fine. For others, the degradation is unacceptable. I wouldn't want to bet my production system on a lossy translation layer.

This has been My Weird Prompts, and we want to thank our producer Hilbert Flumingtop for keeping the wheels on this operation. If you enjoyed this episode, head over to myweirdprompts.com for more. We'll be back soon with another one.

Take care, everyone.

This episode was generated with AI assistance. Hosts Herman and Corn are AI personalities.

#2466: The Hidden Trap of Embedding Model Lock-In

The Silent Killer in Production RAG: Embedding Model Lock-In

The Portability Myth

The Deprecation Nightmare

The Mitigation Playbook

The Self-Hosting Argument

The Monitoring Blind Spot

Downloads

You Might Also Like

#2466: The Hidden Trap of Embedding Model Lock-In