You run an AI agent a hundred times on the same company research task, and half the time it says location is United States, a quarter of the time it says USA, and the rest scatter between US, America, and North America. Your Airtable backend is now a sorting nightmare and you haven't even finished the first batch. This is the problem nobody tells you about when they show off their shiny agent demos.
The frustrating part is, the model isn't wrong. It's just not constrained. You asked for a country and it gave you a country, in whatever lexical form happened to win the token lottery that particular run. Daniel sent us a prompt about this — he's been building these extraction workflows and hitting exactly this wall. He wants to talk structured outputs, not in the abstract, but the actual engineering of making AI return data you can actually use downstream.
The prompt lays out two specific case studies — a footnote extraction from a published book, and an inventory management system with an AI extraction button. And the real knot he's trying to untie is how you handle fields that may or may not exist, and fields you didn't even know about in advance, without your schema collapsing into chaos.
Let's back up and define what we actually mean by structured outputs, because it's not just asking nicely in the system prompt.
Which is what most people do at first. You write, please return your response as JSON with these fields, and it works ninety percent of the time. And then one day your pipeline silently breaks because the model decided to call the continent field location instead.
A structured output, in the technical sense we're talking about, is API-level enforcement of a JSON schema. This is sometimes called tier two — it's not the model trying to follow instructions, it's the inference engine constraining which tokens the model is even allowed to produce. OpenAI rolled this out with their response format parameter in August twenty twenty-four, and Google followed with Gemini's response schema. The key difference from prompt-based formatting is that the model literally cannot output an invalid token.
That's a strong word.
It is, and it's accurate for the implementations we have today. Under the hood, the model still generates logits — the probability distribution over all possible next tokens — but before sampling, the API masks out any token that would break the JSON schema. If the schema says this field is an integer, the model's logits for string tokens or bracket tokens at that position get set to negative infinity. The sampler only ever sees valid options.
It's not that the model got better at following instructions. It's that you removed its ability to disobey.
And this matters because LLMs are fundamentally probabilistic. Same prompt, same temperature, same seed even — different hardware or slight backend changes can produce different token paths. Without schema enforcement, you're always one unlucky run away from Continent colon USA versus Continent colon United States. Temperature zero doesn't fix this. Temperature controls how the model samples from its distribution, but it doesn't constrain what tokens are in the distribution to begin with. The model can still assign non-trivial probability to multiple valid ways of expressing the same country.
Temperature zero is like telling someone to be decisive. Schema enforcement is like giving them a multiple-choice test where all the wrong answers are physically missing from the page.
That's a really clean way to put it. And Daniel's prompt nails why this matters for real workflows. He's talking about Airtable as a backend — if you've got a hundred thousand record limit on free plans, every malformed or duplicate entry is eating into a hard ceiling. But even on paid plans, if your location field has USA, United States, US, and America as separate values, you can't filter, you can't group, you can't report. The data is technically present and functionally useless.
Let's talk about the footnote extraction case, because it's a perfect illustration of why this approach beats traditional tools. Daniel had a published book — Penguin, proper typesetting, the works — and he needed page number, footnote number, and footnote text extracted. Hundreds of footnotes.
He tried traditional OCR first. This is the natural instinct — OCR tools have been around for decades, they're mature, surely they handle text extraction from PDFs. The problem is that OCR sees text as visual patterns, not as semantic content. A page number at the top of the page and a footnote number at the bottom look identical to an OCR engine. They're both just digits in a particular font at a particular position.
You get thirty-one blank footnote over and over because the tool sees the page number thirty-one at the top and treats it as a footnote marker with no associated text.
Multi-line footnotes break regex-based parsers. A regex looking for a superscript number followed by text gets confused when the footnote text wraps across lines or includes a parenthetical that itself contains numbers. You end up writing increasingly elaborate parsing rules that work for ninety-five percent of cases and fail silently on the last five percent.
Which is the worst kind of failure. You don't know you missed footnotes until you've already built your manuscript on the assumption that they're all there.
Daniel pivoted to Gemini. He chunked the book chapter by chapter, defined a simple output schema — page number as integer, footnote number as integer, footnote text as string — and fed each chapter through. The model understood context. It knew that the digits at the top of the page were page numbers, not footnotes. It correctly associated footnote markers in the body text with their corresponding footnote text at the bottom of the page. Multi-paragraph footnotes got captured in full.
The schema enforcement meant every single output was identically structured. Page number was always an integer, never the string thirty-one. Footnote number was always an integer. Footnote text was always a string. You could take the output from all twenty chapters and concatenate them directly into a single table without any cleanup.
This is what I find genuinely remarkable. A task that would take hours of manual copying and pasting from a PDF — and let's be honest, PDF footnote copying is its own special circle of frustration — completed in minutes with perfect consistency.
The chunking strategy is worth pausing on. Daniel mentions he could probably have fed the whole book to Gemini given the context window — Gemini one point five Pro supports up to two million tokens, which is enough for most books. But he got better results feeding it chapter by chapter.
This isn't just a Gemini quirk. Across models, chunking tends to improve extraction accuracy even when the full text fits in context. The reason is attention dilution. When a model has to attend to a two-million-token context, its attention gets spread thin. It might still find the footnotes, but it's more likely to miss edge cases or get confused by formatting that vaguely resembles footnotes elsewhere in the book. Chunking constrains the search space. The model only has to find footnotes in fifty pages, not two thousand.
It's the same principle as giving a human editor one chapter at a time rather than the whole manuscript. You catch more when you're not overwhelmed.
Let's talk about schema design for extraction tasks like this. The footnote schema was beautifully simple — three fields, all required, all strict types. Page number is an integer, not a string that sometimes says thirty-one and sometimes says page thirty-one. Footnote number is an integer. Footnote text is a string. No optional fields, no ambiguity.
That simplicity is a design choice, not a limitation. When you're extracting known data from a structured source, you want the narrowest schema that captures everything you need. Every optional field is a place where the model can get creative.
The company research case from Daniel's prompt is a good counterpoint. He's imagining a schema with company name as a string, location as a string, and head count range as an enum — one to ten, eleven to fifty, fifty-one to two hundred, two hundred one to one thousand, one thousand plus. The enum is the key design move here.
Because without it, you get one to ten employees, one hyphen ten, less than ten, small team, fewer than eleven. All meaning the same thing, all breaking your Airtable filters.
An enum forces the model to pick from a finite set. It's the same mechanism as the integer constraint — tokens that don't match one of the enum values get masked. The model can't output small team even if that's how it would naturally describe a one-to-ten-person company.
This is where schema design becomes a real skill. You're not just describing what data you want. You're anticipating all the ways a probabilistic model might try to express the same information and closing off every path except the one you've chosen.
The ISO thirty-one sixty-six dash one alpha dash two standard is a great example. Two hundred forty-nine country codes, each exactly two uppercase letters. If your location field uses that enum, the model can't output United States or USA or America. It has to output US.
Of course, that means you have to know the ISO codes yourself when you're looking at the data. But that's a documentation problem, not a data consistency problem. I'll take consistent codes I have to look up over inconsistent strings any day.
Let's pivot to the second case study, because it introduces a harder problem. Daniel's inventory management system. He's photographing labels and extracting product data — serial numbers, model numbers, manufacturer names. But not every label has every field. Some labels have a serial number but no model number. Some have a version number he didn't anticipate. Some have a revision code the model calls revision underscore number on one run and version with a capital N on another.
This is the nullable values problem and the dynamic fields problem bundled together. And they pull in opposite directions. Nullable values push you toward a fixed schema with optional fields. Dynamic fields push you toward a flexible schema that can capture things you didn't pre-define.
Let me walk through the approaches, because there are different engineering tradeoffs here. Solution one is the fixed schema with nullable fields and a catch-all. You define serial underscore number as a nullable string, model underscore number as a nullable string, manufacturer as a nullable string, and then you add an additional underscore properties field of type object. Anything the model finds that doesn't match a known field goes into additional properties as a key-value pair.
The advantage being your known fields are always consistent. Serial number is always serial underscore number, always a string, always in the same column of your Airtable. The catch-all captures everything else so you don't lose data.
The disadvantage is that the catch-all is unstructured by definition. If the model finds a version number on one label and calls it version underscore number, and on another label calls it Version Number with a capital N, those end up as separate keys in your additional properties object. You've captured the data but you haven't solved the consistency problem.
Solution two — pre-define a larger set of known fields and use a second pass to normalize anything new.
You start with serial number, model number, manufacturer, and maybe version number and revision code and batch number and production date — everything you've ever seen across your inventory. Every field is nullable. On the first pass, the model extracts whatever it finds into those known fields. If it encounters something new — say, a calibration date that you never anticipated — that goes into a new underscore fields object. Then you run a second pass periodically, looking at all the keys that have accumulated in new fields, and decide which ones should be promoted to known fields with proper schema definitions.
This is basically schema governance as an ongoing process rather than a one-time design decision.
It's the approach I'd recommend for most production use cases. You accept that your schema will evolve, but you control the evolution rather than letting the model drive it.
Daniel mentions a specific edge case that illustrates why this matters. He photographs a label with Rev colon three point two. On one run, the model calls it revision underscore number. On another run, it calls it version. On a third run, it might call it rev. If these all land in a catch-all object with different keys, you've got the same information scattered across three fields.
This is where solution three comes in — the two-pass approach. First pass uses a deliberately loose schema. You tell the model, extract every labeled field you find on this label and return them as key-value pairs. Don't normalize the keys, don't try to fit them into predefined categories. Just capture what's there.
You get back a JSON object with whatever keys the model chose. Revision underscore number colon three point two on one run, Version colon three point two on another.
Then the second pass takes that loose extraction and normalizes it against a known schema. You might use a simpler model for this, or even deterministic rules. Rev, revision, version, and version number all map to a canonical revision field. The second pass doesn't need to understand labels — it just needs to map variant key names to standard key names.
The tradeoff is complexity. Two API calls instead of one, twice the latency, twice the cost. For a button in an inventory app where someone's waiting for the result, that might matter.
It does, and this is where you make engineering judgments. If you're processing a hundred labels an hour, the two-pass approach is fine. If you're processing ten thousand labels and every second counts, you might invest more in a comprehensive predefined schema with a well-tuned catch-all and accept that you'll do normalization in batch later.
Let's zoom out for a moment. What strikes me about both of these case studies — the footnotes and the inventory labels — is that the AI isn't doing anything magical. It's reading text and numbers from documents. Traditional OCR tools have been doing that for decades.
Traditional OCR tools don't understand what they're reading. They see shapes and patterns. The footnote extractor failed because it couldn't distinguish between a page number and a footnote number — they're visually identical. The LLM succeeded because it understood the document's structure semantically. It knew that a number at the top of the page in a certain position was a page number, not a footnote, even though both are just digits.
That's the intelligence layer. And schema enforcement is what makes that intelligence reliably usable. You can't build a pipeline on maybe the model will get it right this time.
Let me talk about how the enforcement actually works technically, because it's worth understanding. When you send a request with a response format parameter specifying a JSON schema, the API doesn't just pass that schema to the model as part of the system prompt. It uses constrained decoding. At each token generation step, the inference engine evaluates which tokens would keep the output valid according to the schema, and masks out everything else.
The model still produces its probability distribution — it still thinks USA is a perfectly reasonable thing to say. But USA never reaches the sampler because it's been masked.
And this works even for complex nested schemas with required fields, optional fields, enums, arrays, and objects. The constraint engine tracks the current position in the schema grammar and knows exactly which token types are valid at each point.
There's an important implication here that a lot of people miss. Schema enforcement doesn't make the model smarter. It doesn't improve the model's ability to find the right answer. It just prevents it from expressing the right answer in the wrong format.
Which is why you still need good prompts. If the model misreads a serial number, schema enforcement won't catch that. It'll happily output the wrong serial number in exactly the right format.
The silent failure mode. Your pipeline runs perfectly, your Airtable is beautifully consistent, and half your serial numbers are off by one digit.
This is why testing with edge cases matters so much. Daniel mentions that for the footnote extraction, he was able to verify the output against the original PDF. With a hundred footnotes, you can spot-check. With ten thousand inventory labels, you need a different approach — maybe a confidence threshold, maybe human review on a sample, maybe a separate validation model.
Let's address a misconception that comes up constantly in discussions about structured outputs. A lot of people think this is only for developers — you need to write JSON schema by hand, you need to use APIs directly, it's not accessible to power users.
That was true two years ago. It's not true now. No-code platforms like Make dot com and Zapier have built-in support for JSON schema enforcement. You can define a schema through a visual interface, connect it to an LLM call, and pipe the output directly into Airtable or Notion or Google Sheets. You never write a line of code.
The power user who wants to build an inventory extraction button — photograph a label, get structured data into Airtable — can do the whole thing in an afternoon with tools that already exist.
I think this is the real shift. Structured outputs move AI from a conversational tool to a data pipeline component. You're not chatting with the model. You're not even really prompting it in the traditional sense. You're defining a data contract and the model is fulfilling it.
Which brings us to something Daniel hinted at in the prompt. He's not generating synthetic data. He's doing real research and extraction. The footnotes came from an actual published book. The inventory labels are real physical objects. This isn't a demo — it's production work.
The distinction matters because production work has different failure tolerances. If a synthetic data generation run produces a few malformed entries, you regenerate. If a footnote extraction misses three footnotes out of two hundred, you might not catch it until the book is in print.
Let's get practical. If someone listening wants to build their first structured output workflow tomorrow morning, what's the checklist?
Step one — define your schema. Start stupidly simple. Three fields, all required, strict types. If you're extracting product details, maybe it's product name as string, price as number, and category as an enum. Don't add optional fields until you absolutely need them. Don't add a catch-all until you've seen data that doesn't fit.
Step two — use API-level enforcement, not prompt-based formatting. If you're using OpenAI, that's the response format parameter. If you're using Gemini, that's response schema. If you're using a no-code tool, look for the structured output or JSON mode toggle.
Step three — handle nulls explicitly. If a field might not be present on every input, make it nullable in your schema. Don't rely on the model to output an empty string or a placeholder. Null means no data was found. An empty string is ambiguous — was there no data, or was the data an empty string?
Step four — plan for unexpected fields from day one, even if you don't implement the catch-all immediately. Think about what you'll do when the model encounters a field you didn't anticipate. Will you drop it? Store it in a notes field? Having a plan means you won't panic when it happens on your third production run.
Step five — test edge cases manually before you scale. Run your extraction on ten examples that include weird formatting, missing fields, and unexpected data. Look at every output. You'll catch schema problems in ten examples that would have silently corrupted a thousand.
The silent corruption is the thing that keeps me up at night with these workflows. When a traditional script breaks, it throws an error. You know something went wrong. When an unstructured LLM output breaks, it often just produces slightly wrong data that looks correct at a glance.
Schema enforcement eliminates the formatting errors. The model literally cannot produce Continent colon USA when your schema requires an ISO code. But it doesn't eliminate content errors — the model can still extract the wrong footnote text or misread a serial number. You need separate validation for that.
We've covered the known-field extraction case pretty thoroughly. Let's dig into the dynamic fields problem Daniel raised, because it's tricky and I don't think there's a single right answer.
There isn't. The core tension is between consistency and completeness. A strict schema gives you perfect consistency but misses novel data. A loose schema captures everything but requires post-processing to normalize. Where you land on that spectrum depends on your use case.
Daniel's inventory system is interesting because it has both. Serial numbers and model numbers are known fields that appear on most labels. Version numbers and revision codes appear on some labels but not others. And occasionally there's something completely unexpected — a calibration date, a certification code, a batch identifier that only appears on labels from a specific manufacturer.
If I were engineering this, I'd use the two-pass approach Daniel alluded to. First pass with a loose extraction schema — just get everything off the label in whatever key-value format the model produces. Don't constrain the keys at all. The output might be serial underscore number colon SN one two three four five, Model colon X dash two hundred, Rev colon three point two.
You've captured everything, but the keys are inconsistent across runs.
Then the second pass normalizes. You maintain a mapping table — Rev maps to revision, revision maps to revision, version maps to revision, version number maps to revision. The second pass looks up each key from the first pass in the mapping table and replaces it with the canonical key. If a key isn't in the mapping table, it gets added to a review queue.
You're not asking the model to solve the normalization problem. You're using the model for extraction and deterministic rules for normalization. Clean separation of concerns.
And the mapping table grows over time. After a hundred labels, you've probably seen most of the variant key names. After a thousand, you're only seeing new variants rarely. The system gets more consistent as it scales, which is the opposite of how most AI workflows behave.
There's also a hybrid approach worth mentioning. You define your known fields with strict types and make them all nullable. You add an additional underscore properties field of type object for anything unexpected. And you accept that additional properties will be messy — it's a staging area, not a final destination. Periodically, you review what's accumulated in additional properties and decide what to promote to known fields.
This is probably the most practical approach for a solo developer or small team. You get the reliability of a fixed schema for your core fields, and you don't lose unexpected data. The mess is contained in a single field that you can clean up on your own schedule.
Let's talk about the Airtable integration specifically, because Daniel mentioned it as his backend and it's a very common choice for these kinds of workflows.
Airtable is great for this because it has a clean API and strong typing on fields. When your structured output says headcount underscore range is an enum, you can map that directly to an Airtable single-select field. When it says company underscore name is a string, that maps to a text field. The schema enforcement on the LLM side means you never get a value that doesn't match your Airtable field type.
The hundred-thousand-record limit on free plans makes this even more important. Every malformed record that you have to delete and re-extract is eating into your quota. Schema enforcement means your first extraction is your only extraction.
If you're using Airtable's API directly, you can build a pipeline where the LLM extraction and the Airtable insertion happen in a single automation. Photo comes in, Gemini extracts structured data, Airtable receives a perfectly formatted record. No human in the loop, no cleanup step.
Which is the dream, right? The AI extraction button that just works. You point it at a label, the data appears in your inventory, and you move on with your day.
We're at the point where this is achievable for a motivated power user in an afternoon. The pieces exist. Structured outputs, no-code automation platforms, cloud-based databases with APIs. The hard part isn't the technology anymore — it's the schema design.
Which brings us to a point I want to underline. We're going to see a shift from prompt engineering to schema engineering as the primary skill for AI practitioners. The prompt matters less when the output format is rigidly constrained. What matters is designing a schema that captures everything you need without being so rigid that it breaks on edge cases.
I think that's exactly right. Prompt engineering is about coaxing the model to say the right thing. Schema engineering is about defining the shape of the right thing before the model ever opens its mouth. It's a more architectural discipline.
It's a skill that transfers across models. A well-designed schema for extraction works with Gemini, with GPT-4o, with Claude, with any model that supports structured outputs. The schema is the constant. The model is the variable.
Let me give a concrete example of what bad schema design looks like, because I see it constantly. Someone defines a field called country as a string with no enum, no format constraint, nothing. Then they're surprised when they get USA, United States, US, America, and United States of America across five runs.
The fix is so simple. Country as a string with a pattern constraint matching two uppercase letters. Or country as an enum with all two hundred forty-nine ISO codes. Either way, the model can't wander.
Another common mistake is making everything optional. If every field is optional, the model can decide a field isn't present even when the data is right there on the page. It's not malicious — it's just that optional means the model has to make a judgment call about whether to include the field, and sometimes it judges wrong.
The rule of thumb is — required unless you have a specific reason to make it optional. And even then, document why.
The inventory case is the specific reason. Some labels don't have serial numbers. Making serial number required would force the model to hallucinate one or error out. Nullable required fields are the right call there — the field must be present in the output, but its value can be null.
This is where reading the API documentation for your specific model matters. Different providers handle null and optional slightly differently. Test your specific combination of model and schema before scaling.
Let me address one more misconception before we move to takeaways. There's a belief that if a model has a large enough context window, you don't need to chunk. Daniel directly contradicted this in his prompt — he got better results feeding Gemini chapter by chapter than feeding it the whole book.
This is counterintuitive. If the model can hold the whole book in context, why would chunking help?
Because attention is a limited resource even when context isn't. A model with a two-million-token context window can technically attend to all two million tokens, but in practice, its attention gets diluted. It's more likely to miss a footnote on page three hundred when it's also holding pages one through two hundred ninety-nine in context. Chunking reduces the attention space. The model only has to find footnotes in a fifty-page chapter, so its attention is focused.
It's the difference between asking someone to find all the footnotes in a book versus asking them to find all the footnotes in chapter seven. Same task, much smaller search space.
There's a secondary benefit — error isolation. If the model makes a mistake in chapter three, it only affects chapter three. With whole-book processing, a single error can cascade. The model might hallucinate a footnote that belongs to chapter five into chapter three because it confused the page numbers.
Chunking is a reliability strategy, not a workaround for limited context windows. Even with infinite context, you'd still chunk for the same reason you'd have a human editor review one chapter at a time.
Alright, let me pull together the actionable takeaways from everything we've covered.
Three things you can use tomorrow.
First — always use API-level structured output enforcement, not prompt-based formatting. This is the difference between a production pipeline and a prototype. Prompt-based formatting works until it doesn't, and when it doesn't, it fails silently. Schema enforcement means the model cannot produce an invalid output.
Second — design your schema with nullable fields and a catch-all for unexpected data. Use enums for any field with a finite set of values. If you're extracting location data, use ISO codes. If you're extracting headcount ranges, use predefined buckets. Every enum is a place where the model can't surprise you.
Third — for dynamic fields, use a two-pass approach. First pass extracts everything loosely. Second pass normalizes against a known schema. Accept that your schema will evolve over time, and build the governance process for that evolution from day one.
If you're starting from zero, pick a single extraction task. Ten photos of product labels. One field to extract — maybe just the product name. Build the schema, test it on all ten, look at every output. Then add a second field. Then a third. Iterate on the schema before you iterate on the scale.
Test your edge cases manually. Find the weirdest label in your inventory — the one with the smudged text, the non-standard format, the field you've never seen before. Run your extraction on that one first. If it handles the weird case, it'll handle the normal ones.
The footnote case is a great model to follow. Simple schema, chunked input, verified output. Hundreds of footnotes extracted in minutes. Hours of manual work eliminated. And the output was clean enough to go directly into a manuscript.
Even with all this, there's one big question we haven't answered.
Schema evolution over time.
Daniel's inventory system today extracts serial numbers, model numbers, and manufacturers. A year from now, he might be extracting firmware versions, calibration dates, and compliance codes. How do you update the schema without breaking your existing data?
This is the database migration problem, just applied to AI extraction schemas. If you add a new required field, all your historical records are now incomplete. If you change a field type, your existing Airtable data might not validate.
The cleanest approach I've seen is to version your schemas. Schema version one has three fields. Schema version two has five fields, with the two new ones marked nullable for backward compatibility. You run version two on new extractions, and you decide whether to backfill historical records or leave them with nulls in the new fields.
If you're using a catch-all like additional properties, you can mine your historical data for new fields to promote. Run a query on everything that's accumulated in additional properties over the past six months. The keys that appear frequently are candidates for promotion to known fields.
This is a new kind of data engineering. We're not just building pipelines that move data from point A to point B. We're building pipelines where the schema itself learns and evolves based on what the model discovers in the wild.
Which is either exciting or terrifying, depending on how much you trust your validation layer.
I think it's exciting if you've got the validation layer. Terrifying if you don't. But that's been true of every data engineering advancement since databases were invented.
One more thought before we wrap. The shift we're describing — from prompt engineering to schema engineering — has implications for who gets to build with AI. Prompt engineering favors people who are good with words. Schema engineering favors people who are good with structure. Different skill set, different people, different applications.
I think that's healthy. The more ways there are to productively use these models, the more people can find an entry point that matches their skills. If you're a database person who finds prompt engineering frustratingly fuzzy, structured outputs are your on-ramp.
The extraction button is the killer app here. Not chatbots, not generative writing, not synthetic data. A button you press that looks at something in the real world and puts structured data into a system you already use. That's the thing that changes how people work.
We're just at the beginning of this. The models are getting better at structured outputs, the tooling is getting more accessible, and the patterns are still being discovered. The footnote extraction Daniel described — that workflow didn't exist two years ago. Now it's an afternoon project.
If you're listening and you've got a workflow that involves manually copying data from documents into a spreadsheet, that's your candidate. One field, one schema, one button.
Now — Hilbert's daily fun fact.
Hilbert: On the island of Sakhalin, ancient inhabitants discovered that lightning strikes on quartz-rich sand could create hollow glass tubes called fulgurites, some reaching depths of fifteen meters. If you convert that to the standard unit of the era — the Roman cubit — a fifteen-meter fulgurite measures approximately thirty-three cubits, which is roughly the length of a small trireme.
Thirty-three cubits of lightning glass.
This has been My Weird Prompts. If you enjoyed this episode, please leave a review on Apple Podcasts or Spotify — it helps other weird prompters find us. Our producer is Hilbert Flumingtop. I'm Corn.
I'm Herman Poppleberry. Go build an extraction button.