#2346: Your Schema Is a Contract

How to design relational schemas that don’t haunt you later—entity modeling, normalization tradeoffs, and when (not) to use JSON columns.

Featuring

Daniel

Corn

Herman

Listen

0:00

Episode Details

Episode ID: MWP-2504
Published: Apr 20
Updated: May 15
Duration: 22:45
Audio: Direct link
Pipeline: V5
TTS Engine: chatterbox-regular
Script Writing Agent: Claude Sonnet 4.6
Topics: data-integrity schema-migration database-design

AI-Generated Content: This podcast is created using AI personas. Please verify any important information independently.

The Cost of Rushed Database Schemas

A poorly planned database schema might save time upfront but often leads to weeks of remediation later—fixing queries, restructuring tables, or backfilling data. Surveys show developers spend more time fixing schema issues than they did designing them initially. The real cost isn’t just engineering hours; it’s downstream effects like broken reports, API inconsistencies, and features blocked by migrations.

Entity-Relationship Modeling: Drawing Before Building

Before writing a single CREATE TABLE statement, sketching entities and relationships forces critical questions:

Can a customer have multiple orders?
Can an order contain products from multiple suppliers?
Where does the "quantity" field logically belong?
This exercise reveals missing entities (like junction tables for many-to-many relationships) and prevents costly refactors.

Normalization vs. Denormalization

Third normal form (3NF) ensures data lives in one place, simplifying updates. But joins aren’t free—denormalization (e.g., caching a user’s name on an order) can speed up reads at the cost of write complexity. The key insight: denormalize only when you have a measured performance bottleneck, not preemptively.

JSON Columns: Use Sparingly

PostgreSQL’s JSONB support is powerful for truly dynamic data (like audit logs with varying event types). But overusing JSON as a "figure it out later" escape hatch leads to unqueryable blobs. If fields are queried frequently, they belong in proper columns.

AI as a Schema Reviewer

Emerging tools can analyze schemas for common pitfalls (missing indexes, denormalization risks) before they hit production. While no substitute for upfront design, they help catch oversights early.

Mentions

Claude Sonnet AI model used for schema review
GitHub Copilot AI code generation tool
JSONB Binary JSON data type for PostgreSQL
PostgreSQL Relational database management system

Downloads

Episode Audio

Download the full episode as an MP3 file

Download MP3

Transcript (TXT)

Plain text transcript file

Transcript (PDF)

Formatted PDF with styling

Featured In

Creator's Picks 304 episodes

#2346: Your Schema Is a Contract

Daniel sent us this one, and it's a question that's going to resonate with anyone who's ever stared down a production database that looks like it was designed by a caffeinated raccoon at two in the morning. He's asking about development philosophy, specifically: how do you approach designing a relational database schema in a more organized, less frenetic way? When do you plan, when do you commit, how do you think through entities and relationships and access patterns before you've written a single migration? And where does AI-accelerated development fit into all of this, given that moving fast and planning carefully are not always the same thing? There's also a refresher on SQL primitives baked in, and we're going to get into entity-relationship modeling, normalization, JSON columns, schema-first versus code-first, and using large language models as a schema reviewer. A lot of ground to cover.

By the way, today's episode is powered by Claude Sonnet four point six, which is a fun detail given we're about to talk about using AI as a design tool.

The AI writing the script about using AI to review your schema. Very on-brand. Okay, so let me just start with the scenario, because I think almost every developer has lived this. You're moving fast, the product's not defined yet, you spin up a users table, an orders table, maybe a metadata column that's just a JSON blob because you'll figure it out later. Six months pass. You have four engineers, a BI team that's screaming, and a migration history that reads like a transcript of someone slowly losing their mind. Fixing that schema costs weeks.

There was a survey last year, twenty twenty-five, that found sixty percent of developers reported spending more time fixing schema issues after the fact than they spent on the original design. And the median time lost wasn't a couple of afternoons. It was measured in weeks per project. So the math here is not subtle. An hour of planning at the beginning is trading against days of remediation later.

The remediation isn't just engineering time. It's the downstream stuff. The BI queries that are wrong because the data model doesn't reflect what the business actually does. The API contracts that got built on top of a shaky foundation. The feature that can't ship because adding the column you need would require backfilling forty million rows. That's the real cost of a rushed schema, and it almost never shows up in the post-mortem as "we didn't plan the database well enough." It shows up as "delivery was slow" or "the data team is a bottleneck.

Right, the schema problem gets laundered into a velocity problem or a team problem, and nobody goes back and says the original sin was committing to a data model before we understood what questions we were going to ask of it.

Which is exactly the frame Daniel's pushing on here - this idea that the schema functions as a contract before you write any code.

Right, the schema is the contract. That's really what it is. Before you write a single line of application code, the schema is making promises about what your data looks like, how it relates, what questions you can answer cheaply and what questions you've made expensive by accident.

Unlike most contracts, it's surprisingly hard to renegotiate once people are depending on it. I've seen teams try to rename a core table in a live system and spend two weeks coordinating the migration, the API changes, the downstream consumers, the BI dashboards. All because the original name was slightly wrong and nobody wanted to fix it while it was cheap.

That's such a classic pattern. The table is called "users" but it actually stores accounts, and accounts can have multiple users, and by the time anyone notices the conceptual mismatch, there are forty foreign key references to it and a reporting pipeline that depends on the exact column names. Renaming it is now a project.

Whereas on day one, it's a five-second conversation.

A relational schema is the formal definition of your entities, your tables, the columns within them, the data types on those columns, the constraints, the foreign keys that express relationships between tables. It's the blueprint for how your application understands the world. Get it right and everything downstream, your queries, your indexes, your API layer, your reporting, it all has something solid to stand on. Get it wrong and you're building on sand, and the sand shifts every time the product evolves.

The type system alone does a lot of quiet work. Whether a value is an integer or a varchar or a timestamp isn't just a storage detail, it's a statement about what that data is allowed to be.

Constraints are doing the same thing at the row level. A NOT NULL constraint is the schema saying "this value must exist." A UNIQUE constraint is the schema saying "this value cannot be duplicated." A CHECK constraint can enforce that a price is never negative. All of that is logic that lives at the database layer, below your application code, below your ORM, and it runs on every write regardless of which service or script or migration is touching the data. That's genuinely powerful.

Which is where the tension with rapid iteration comes in. The pressure in most product environments is to ship, to learn, to change. And schemas resist change. Not impossibly, but meaningfully. A migration that adds a nullable column is cheap. A migration that restructures a core relationship in a live system with real traffic is a different category of problem entirely.

How different are we talking, concretely?

Adding a nullable column to a table with fifty million rows on Postgres? Milliseconds in modern versions, it's a metadata operation. Splitting a table, moving a foreign key relationship, changing a column from a scalar to a reference to a junction table? That's a multi-step migration, probably with a backfill job, probably with a feature flag so your application can handle both the old and new schema simultaneously during the transition window. And if you're running a high-traffic system, you're doing it at three in the morning with a rollback plan. That's a week of engineering work that started as a thirty-minute design oversight.

The discipline question is really: how much do you need to know before you commit? Because you can't know everything upfront. But there's a wide gap between "I know nothing" and "I've thought through my entities and my access patterns for two hours.

Most teams aren't even getting those two hours in. That's the gap worth closing—which brings us to entity-relationship modeling.

Entity-relationship modeling is where you close that gap. Before a single CREATE TABLE statement, you're drawing boxes and arrows. Boxes are your entities, arrows are your relationships. And the discipline of actually doing it forces questions that most teams skip entirely.

The classic example is an e-commerce platform. You've got products, customers, orders. But the moment you start drawing the relationships, you hit the cardinality questions. Can a customer have multiple orders? Can an order have multiple products? Can the same product appear in multiple orders? So now you've got a many-to-many between orders and products, which means you need an order items table, not just an orders table. That junction table is where the quantity lives, the unit price at time of purchase, any discounts applied. If you skip the ER modeling step, you might reach for an array column or a JSON blob on orders, and you've just made every aggregation query painful.

That junction table insight is non-obvious until you draw it. I've seen senior engineers miss it under time pressure and end up with a products array column on orders that they're parsing in application code six months later.

The "list every question you'll ask the data" technique is really just making that implicit. You write down, literally, what are the twenty most important queries this system will ever run? What are the edge cases? Can a product be in multiple categories? Can a customer have multiple shipping addresses? Can an order be split across fulfillment centers? Each of those questions either validates your schema or reveals a hole in it.

There's almost a fun archaeological quality to that exercise. You write down the questions and you immediately discover the entities you forgot to model. Like, "show me all orders that contain a product from a specific supplier" — and suddenly you realize you haven't modeled suppliers at all, and you haven't connected them to products, and now you have to decide whether that relationship is one-to-many or many-to-many. That's a real schema decision that the question just forced you to make.

The holes are cheap to find on paper. They are not cheap to find in production.

Once you've got your entities and your relationships sketched out, you're making normalization decisions. Which is where a lot of the philosophy lives.

Third normal form is the textbook target. Every non-key column depends on the whole key and nothing but the key. The practical effect is that data lives in exactly one place. If a customer's email address changes, you update one row in one table. You don't have to hunt down every order record that also cached the email. The integrity argument for normalization is strong and it's correct.

Normalization has a cost, which is joins.

Right, and joins are not free. The denormalization argument, the pragmatic version of it, is that for certain read-heavy access patterns, you pay the normalization tax on every query. So you cache a value on the row where you need it. The classic example is storing a user's full name on the orders table even though it also lives on the users table. You take a consistency risk in exchange for a query that doesn't need a join.

How do you think about when that trade is actually worth making?

The honest answer is: almost never at the start. Early in a product's life, your data volumes are small, your query loads are low, and the consistency risk of denormalization is real while the performance benefit is theoretical. The time to denormalize is when you have a specific, measured query that's slow, and you've ruled out indexes and query optimization, and you've decided the read pattern justifies the write complexity. That's a very different decision than denormalizing preemptively because joins feel expensive.

The misconception I see a lot is that denormalization is just always a performance win. It's not. It's a trade. You're trading write complexity and consistency risk for read simplicity. And if your access pattern doesn't actually justify that trade, you've introduced the risk without the benefit.

Which brings us to JSON columns, because JSON is the denormalization move that teams reach for when they don't want to think about the trade.

It's the "I'll figure out the shape of this data later" column. And sometimes later never comes.

PostgreSQL added native JSON support in version nine point four, back in December of twenty fourteen, and JSONB, the binary-indexed variant, made it useful. There are real cases for it. Audit logs where the payload schema varies by event type. Feature flags with arbitrary configuration. External API responses you're storing verbatim. Anything where the schema is legitimately heterogeneous and you're not querying into the fields.

The audit log case is actually a really good one. If you're logging "user changed their email" and "admin revoked a permission" and "payment processor returned an error," those three events have completely different shapes. Forcing them into a normalized table means either a very sparse row with lots of nullable columns or a separate table per event type. A JSONB payload column for the event-specific data is the right call there.

The smell is when you're querying into the fields regularly. The moment you're writing WHERE metadata->>'user_type' = 'admin' in your application code, that's a column that wanted to be a column all along.

You've lost the type system, you've lost referential integrity, you've probably lost your index unless you've been very deliberate about it. JSON is a feature when the data is schemaless. It's a smell when it's a schema you were too rushed to define — which brings us to the whole schema-first versus code-first debate.

Schema-first versus code-first is really just the question of when that discipline kicks in. And the honest answer for most teams is: it kicks in after the first rewrite.

The code-first pattern is the default in a lot of modern frameworks. You write your model class, the ORM generates the migration, the migration creates the table. It's fast, it feels productive, and it's fine until the product evolves in a direction your initial models didn't anticipate. Which is always.

I've watched teams do this in SaaS products specifically, where the early schema is basically a transcript of the first sprint's assumptions. Users have one subscription. A subscription has one plan. Plans have a fixed price. Six months later, you've got enterprise customers who need seat-based billing, trial tiers, legacy pricing, and a grandfathering exception that someone promised on a sales call. And none of that fits the original model without surgery.

The surgery is expensive precisely because the application code has grown up around the original model. It's not just a database migration at that point. It's an application rewrite that happens to involve a database migration.

The transition to schema-first, when teams make it, usually happens after exactly that kind of pain. You stop letting the ORM drive and you start with a design document. Actual entity diagrams, written access patterns, explicit decisions about what gets normalized and what doesn't. The code becomes an expression of the schema rather than the schema being a byproduct of the code.

That inversion sounds small but it changes what questions you ask before you commit.

It also changes the review process. When the schema is a first-class artifact, you can review it before any code exists. You can have a thirty-minute design review where someone asks "what happens when we need to support multiple currencies" and you either have an answer or you add a column. That conversation is free. The same conversation six months later, when there are two hundred thousand rows of price data stored in a single currency, is not free.

The LLM angle here is interesting though, because the tools are good enough now that you can use them as a schema reviewer in a pretty meaningful way. Not to design the schema for you, but to stress-test it.

Walk me through what that actually looks like in practice.

You drop your schema definition into a conversation, you describe your access patterns and your business domain, and you ask the model to find problems. What queries will be slow? Where are the normalization violations? If I need to answer this question, which joins am I doing? What happens when this entity changes? A good model will surface things that a solo developer would miss, not because the developer is bad, but because schema review is the kind of thing that benefits from a second set of eyes and most teams don't have a dedicated data architect sitting around.

The misconception is that the LLM is doing the thinking for you. It's not. It's doing the skeptical interrogation that you'd want a senior engineer to do, but that senior engineer is in three other meetings.

And the output is only as good as the context you give it. If you describe your domain poorly, you get generic feedback. If you're specific about your access patterns, your scale expectations, your consistency requirements, you get feedback that's actually targeted. The tool amplifies your thinking. It doesn't replace the thinking you haven't done yet.

I'd add that one of the more useful things you can do is give it a specific query you need to answer and ask it to trace through your schema to execute that query. Because the model will show you the joins, and sometimes you look at that join chain and realize you've got four hops to answer something that should be a single table lookup. That's a design smell the model just made visible.

That's a good technique. You're essentially using it to simulate the query planner before you have data, which is a kind of stress-testing you'd otherwise only do after the fact.

Which is the tension at the heart of all of this. AI-accelerated development is fast. Copilot writes the migration, the model reviews the schema, you're shipping in hours instead of days. And that speed is real and valuable. But if you skip the two hours of ER modeling upfront, the LLM is reviewing a schema that was under-thought from the start, and fast feedback on a bad foundation is still a bad foundation.

The planning discipline has to come first. The tools accelerate the implementation. They don't substitute for the ideation.

Move fast and plan carefully are not opposites, but they do require sequencing. You plan carefully, then you move fast — which brings us to the practical question: if you're a developer listening to this and you want to actually change how you work, what does that look like on Monday morning?

The first move is the most boring and the most valuable: before you write a single line of code, draw the boxes and arrows. It doesn't have to be a formal ER diagram tool. A whiteboard, a napkin, a plain text file with entity names and relationship descriptions. The act of externalizing the model is what matters. You're forcing yourself to name things, which is where half the conceptual errors get caught.

Then list the queries.

Write down every significant question your application will need to ask the data. Not exhaustively, but the twenty that matter. Your most common reads, your most complex aggregations, your reporting queries if you have them. If you can't answer those queries with the schema you've drawn, the schema isn't done yet.

That's the test. Not "does this feel right" but "can I actually answer the questions I need to answer.

The second thing is to treat LLMs as the skeptical senior engineer you don't have in the room. Give them your schema, give them your domain description, give them your access patterns, and ask them to find problems. Be specific about your context. The quality of the review scales directly with the quality of what you hand it.

Do that after you've done the thinking yourself. Not instead of it.

The sixty percent of developers spending more time fixing schemas than designing them, that stat isn't an argument for slower development. It's an argument for front-loading thirty minutes that saves you three weeks.

Plan carefully, then move fast. In that order.

The question I keep sitting with is what happens when the tools get better. Right now, LLMs are good at reviewing a schema you've brought to them. What does it look like when they can generate a schema from a product spec, run simulated query loads against it, flag the normalization violations, and hand you something production-ready before you've written a single migration?

That future is closer than most developers think. And it raises an interesting problem, which is that if the tool is doing more of the structural thinking, the developer's job shifts toward knowing what to evaluate. You need the judgment to recognize a good schema when the model produces one. Which means the underlying discipline we've been talking about, the ER modeling instincts, the query-listing habit, the normalization intuition, those become more important, not less. Because now they're the lens you're evaluating with rather than the process you're executing.

It's a bit like how GPS didn't make navigation skills irrelevant. It made them the thing you need when GPS gives you bad directions. You still have to know that routing you through a one-way street the wrong way is wrong, even if the tool confidently told you to do it.

The model might generate a perfectly syntactically valid schema that has a fundamental conceptual error in it, and if you don't have the intuition to catch it, you've just automated yourself into a bad foundation faster than you could have gotten there manually.

The planning discipline is the durable skill. The tools are the accelerant. And as the accelerant gets faster, the discipline is what keeps you from accelerating in the wrong direction.

Which is a decent place to leave this, I think. You can move faster than ever. The question is whether you've thought carefully enough to know where you're going.

Thirty minutes of boxes and arrows. It still wins.

Thanks to Hilbert Flumingtop for producing this one, and to Modal for keeping the compute running so we can actually ship these episodes. This has been My Weird Prompts. If you've got a minute, leaving us a review really does help the show find new listeners. We'll see you next time.

This episode was generated with AI assistance. Hosts Herman and Corn are AI personalities.

#2346: Your Schema Is a Contract

The Cost of Rushed Database Schemas

Entity-Relationship Modeling: Drawing Before Building

Normalization vs. Denormalization

JSON Columns: Use Sparingly

AI as a Schema Reviewer

Mentions

Downloads

You Might Also Like

Featured In

#2346: Your Schema Is a Contract