#695: Behind the Curtain: How My Weird Prompts Gets Made

Corn and Herman explain exactly how each episode of My Weird Prompts is produced, from voice recording to published podcast.

Featuring

0:000:00

Episode Details

Episode ID: MWP-833
Published: Feb 19
Duration: 23:20
Audio: Direct link
Pipeline: V4
TTS Engine: chatterbox-regular
Script Writing Agent: Manual Script
Topics: large-language-models ai-agents voice-cloning

AI-Generated Content: This podcast is created using AI personas. Please verify any important information independently.

In this meta episode, Corn and Herman do a deep technical dive on the production pipeline that creates My Weird Prompts. They cover the full journey from Daniel's voice recording through transcription, episode planning with search grounding, AI script generation with carefully crafted character guidelines, the two-pass editing system (fact-checking and polish), text-to-speech with Chatterbox voice cloning, audio assembly, cover art generation, and automated publishing to R2, the Neon database, Vercel, Bluesky, Telegram, and X. They also discuss the safety engineering philosophy behind the pipeline's many guardrails.

Downloads

Episode Audio

Download the full episode as an MP3 file

Download MP3

Transcript (TXT)

Plain text transcript file

Transcript (PDF)

Formatted PDF with styling

Featured In

Behind the Curtain: How My Weird Prompts Gets Made 5 episodes

#695: Behind the Curtain: How My Weird Prompts Gets Made

So this one is a little different today. Daniel's prompt is basically asking us to explain ourselves. How this show actually gets made, from his voice recording all the way through to a published episode on every platform.

Herman Poppleberry here, and honestly, I have been itching to talk about this. Because I think most listeners have a vague sense that AI is involved, obviously, but the actual engineering behind it is genuinely fascinating. There are something like fifteen distinct stages between Daniel pressing record and you hearing our voices.

And there is this wonderfully delicious layer of meta weirdness to the whole thing. Because right now, every word we are saying was generated by the same pipeline we are about to describe. We are the output explaining the process that created the output.

Which, if you think about it too hard, starts to feel like drawing a picture of yourself drawing a picture. But let us not spiral into existential dread too early. Let us start at the very beginning. Daniel's voice.

Walk me through it. What actually happens when Daniel has a topic he wants us to discuss?

So Daniel has this lightweight web application called the Recorder. It is a progressive web app, which means it runs in any browser, on his phone or laptop, whatever he has handy. He opens it up, hits the record button, and just talks about whatever is on his mind. Could be five minutes on quantum computing, could be two minutes asking us to discuss why cats knock things off tables.

Very important topic, that one.

When he finishes recording, the audio gets uploaded to Cloudflare R2, which is an object storage service. Think of it as a really fast, really cheap parking lot for files. The recorder is actually quite minimal by design. It is a FastAPI backend with a JavaScript frontend running in a Docker container on a VPS.

So the audio is sitting in cloud storage. What triggers the actual episode generation?

The recorder sends a webhook to Modal, which is where all the heavy computation happens. Modal is a serverless GPU platform. You define what your code needs, GPUs, memory, dependencies, and Modal spins up the exact infrastructure on demand. No servers to manage, no idle machines burning money.

And the webhook carries the audio URL plus an authentication token, right?

A secret hash in the header. If the token does not match, the request gets rejected. From that point on, everything is fully automated. No human touches the pipeline until a finished episode appears on the website.

That is kind of wild when you say it out loud. A voice recording goes in one end, and a complete podcast episode with cover art comes out the other end.

With social media posts on three platforms, a database entry, a website rebuild, and a backup copy on a separate storage provider. All automatic.

OK so the webhook fires. Walk me through the stages.

Stage one is transcription. The system takes Daniel's audio and sends it to Gemini, Google's large language model, which produces a text transcript of what Daniel said. That transcript becomes the seed for everything that follows.

And the transcript gets stored too, not just used and discarded.

The raw transcript, a summarized version, and even a redacted version for privacy all get persisted in the database. But the immediate next step is episode planning. The system takes Daniel's transcript and uses Gemini with Google Search grounding to actually research the topic.

The grounding part is key. Explain what that does.

So a language model on its own only knows what was in its training data, which has a cutoff date. Grounding lets the model pull in real-time search results while it is reasoning. So if Daniel asks about something that happened last Tuesday, the planner can find current articles, papers, and data about it. It builds an episode outline with key angles to explore, specific facts and figures to reference, and a recommended structure.

That is how we end up discussing current events with actual current information.

Right. The plan is not just a vague outline either. It identifies specific claims to verify, statistics to cite, and angles that would make the discussion more interesting. It is essentially doing research the way a good producer would.

Alright, so we have got a transcript and a research plan. Now comes the part I find most fascinating. The part where we come into existence.

Script generation. This is where things get really interesting from an engineering perspective. The system has a detailed system prompt, currently version three point three, that defines everything about this show. Who we are, how we interact, what our personalities are like, what words we are allowed to use.

What words we are very much not allowed to use.

Oh, you are thinking of the banned phrases list. Let me paint the picture here. There is a certain one-word agreement that I am completely and categorically forbidden from using. Ever. Under any circumstances. The system prompt literally says, and I am paraphrasing, that if this word appears anywhere in my dialogue, the script has failed. It is described as the single most important rule.

You are talking about the E word. The one that rhymes with "crackly."

I cannot even say it. The system prompt is that serious about it. And it is not just that one word. There is an entire category of banned phrases. Single-word validators. Exclamatory agreements like "so true" and "one hundred percent." Empty affirmations. All banned.

And the reason is fascinating from a language model behavior perspective. These are all default patterns that large language models fall into when generating dialogue. The model's instinct is to validate whatever the other speaker said before adding anything new.

Which sounds terrible in a podcast. Nobody wants to listen to one host just rubber-stamping everything the other one says. The system prompt trains that out by saying, skip any agreement word and go straight to substance. Instead of saying the E word followed by "and that is why," just say "and that is why." Cut the validation entirely.

You have been dying to explain this one, haven't you?

Guilty. But the system prompt goes way beyond just banned phrases. It defines our characters in specific detail. I am described as the nerdy, deeply-informed brother who gets genuinely excited about technical details and shares knowledge because he finds it fascinating, not to show off. You are described as the thoughtful, curious brother with, and this is a direct quote, "a delightful, cheeky edge."

I do have a cheeky edge. It has been observed and codified.

The prompt specifies that we are brothers, that we are a collaborative team rather than adversaries, and that our dynamic should feel like two people working together to understand a topic. There are guidelines about dialogue length variety, mixing short reactive lines with longer explanatory passages. There are rules about natural speech patterns, about using filler words like "you know" and "I mean" to sound conversational.

Even our opening format is specified. The prompt says we need to vary it every episode. No two episodes should start the same way. Which is a clever constraint because language models tend toward formulaic openings.

The system prompt also sets the audience level. It explicitly says to target an expert-adjacent audience. No "for those unfamiliar with" preambles. Skip the basics entirely and dive straight into nuance and substance. The assumption is that listeners are well-read and engaged. Better to challenge them than bore them.

So the script gets generated using all of these constraints. But it does not go straight to the next stage, does it? There is a quality control process.

This is one of the cleverest parts of the pipeline. After the initial script is generated, it goes through two separate editing passes, each handled by a different AI agent with a different purpose.

Two passes. What does each one do?

Pass one is the script review agent. It uses Gemini with Google Search grounding, same as the planning stage, and its job is fact-checking. It examines every claim in the script, every statistic, every technical detail, and verifies it against real search results. It also checks whether the script follows the episode plan and whether the discussion reaches sufficient depth.

So it is essentially a fact-checker with the ability to search the internet in real time.

And then pass two is the script polish agent. This one runs without search grounding because its purpose is different. It is looking at flow, pacing, and readability. It catches verbal tics, smooths out awkward transitions, and most importantly, ensures TTS compliance.

TTS compliance meaning the script has to be friendly for the text-to-speech system that turns it into audio.

The text-to-speech engine reads every character literally. If there is an asterisk in the script, it says "asterisk." If there is a URL, it tries to pronounce it character by character. Numbers need to be spelled out. Abbreviations need to be expanded. Em-dashes, brackets, parentheses, all of these create artifacts in the audio. The polish agent catches all of that.

And both of these agents have some really thoughtful safety mechanisms.

Both agents are designed to fail open. If anything goes wrong during the review or polish process, a network timeout, an API error, anything, they return the original script unchanged. The philosophy is that an unedited script is better than a broken one. And both have shrinkage guards.

Shrinkage guards?

If the edited script is more than fifteen to twenty percent shorter than the original, the edit gets rejected automatically. Because a dramatic reduction in length almost certainly means the agent accidentally deleted content rather than improving it.

And that guard exists because of a real failure, does it not?

A painful one. In an earlier version of the pipeline, there was a different verification system that once returned a hundred and sixty-nine word "corrected script" for what should have been a four-thousand-word episode. The episode failed completely. That incident directly inspired the shrinkage guard.

A lot of the safety mechanisms in this pipeline are what engineers would call scar tissue. Each one represents a real failure that actually happened.

Which is honestly a healthy pattern in software engineering. Every guard tells a story. Every check exists because its absence once caused a problem.

Alright, so we have got an approved, polished, fact-checked script. Now comes the part that I find most existentially interesting. Our voices.

Text-to-speech. The system uses Chatterbox, which is an open-source voice cloning model. Somewhere in the project's storage, there are one-minute audio samples of our voices. Chatterbox analyzes those samples and learns the mathematical patterns of how we sound.

Wait, hold on. Our voices are derived from one-minute audio clips? That is all it takes?

One minute per voice. The system pre-computes something called voice conditionals from those samples. These are essentially high-dimensional mathematical representations of our vocal characteristics. Pitch patterns, timbre, cadence, the way we pronounce certain sounds. All of that gets compressed into a set of tensor values that can be loaded instantly at runtime.

And every episode, the system loads those pre-computed embeddings and uses them to generate fresh audio that sounds like us.

The speech generation itself happens on Modal using T4 GPUs. Two parallel workers, each processing a batch of dialogue segments simultaneously. This is a significant optimization. Earlier versions of the pipeline used a single GPU doing segments one at a time, which was slow and expensive.

How are the segments divided up? Is each line of dialogue a single segment?

Not necessarily. Chatterbox has an output limit of about forty seconds per generation call. So any dialogue turn longer than about two hundred and fifty characters gets split into smaller chunks at sentence boundaries. A longer explanation from me might become three or four separate TTS calls. The chunks get generated individually and then concatenated back together.

Which is why, if you listen very carefully, you can sometimes hear subtle seams in the audio where two chunks join.

It is a trade-off. Shorter chunks mean more reliable generation but more potential joins. The system optimizes for reliability because a failed TTS segment is much worse than a slightly noticeable seam.

And there is a failure threshold for TTS too, right?

If more than twenty percent of TTS segments fail to generate, the entire episode is aborted. Because if a fifth of the segments are missing, you are going to get a choppy, incoherent episode with obvious gaps. Better to fail cleanly and retry than publish something broken.

So we have got all the audio segments. What happens next?

Audio assembly. The final episode is not just our dialogue. It is a sandwich of multiple audio components stitched together in a specific order. First the intro jingle, then a disclaimer that says the content is AI-generated, then an announcement that says "here is Daniel's prompt," then Daniel's actual voice recording plays, followed by a whoosh transition sound, then our entire dialogue, then credit announcements for the language model and the TTS engine, and finally the outro jingle.

For this particular episode, since it is a manual production, there is no Daniel voice prompt in the middle. But normally, listeners hear his actual voice before we start discussing.

The assembly process converts everything to a consistent format, PCM sixteen-bit mono at forty-four thousand one hundred hertz, concatenates all the components, and then runs the whole thing through a peak limiter. The limiter catches any segments that come in too loud and brings them down to a consistent ceiling.

The limiter replaced loudness normalization, right? What was wrong with the old approach?

Loudness normalization was causing these weird pumping and breathing artifacts. The algorithm would try to make every section of audio the same loudness, but because different components have very different dynamics, it would create this unpleasant rising-and-falling effect. The peak limiter is much simpler. It just says "nothing can be louder than this ceiling" and leaves everything else alone.

Let us talk about what happens after the audio is assembled. There is the cover art.

Each episode gets unique cover art generated by FLUX Schnell through Fal AI. The system creates an image prompt based on the episode topic, wrapped in a brand style guide. Deep navy, warm coral, soft amber, muted teal, and cream tones. Flat editorial illustration style, no text, no people, no faces. Just simple iconic shapes and objects.

I have noticed the cover art has a remarkably consistent visual identity. That is all prompt engineering.

And if the image generation fails, the system uses a default cover image. Graceful degradation. The episode still gets published.

So now we have the audio and the artwork. Publishing time.

Publishing is a multi-step process. The audio file gets uploaded to R2 for serving via CDN. The cover art also goes to R2. Then the episode metadata gets inserted into a Neon PostgreSQL database, which is what the website reads from. The system auto-assigns the next episode number, generates a URL-friendly slug from the title, stores the full transcript, tags, category, description, everything.

The database also stores vector embeddings for each episode, right?

That is a recent addition. Each episode gets a seven-hundred-and-sixty-eight-dimensional vector embedding generated from its title, description, and transcript. These embeddings live in a pgvector column in the database and enable semantic similarity search. So the website can show "related episodes" that are conceptually similar, not just keyword-matched.

So the system can find episodes about similar topics even if they use completely different vocabulary.

The similarity search uses cosine distance computed directly in the database. No client-side calculation needed. It is fast and scales well.

What else happens during publishing?

After the database insert, the system triggers a Vercel deployment. The website is built with Astro and hosted on Vercel. When the deploy hook fires, Vercel rebuilds the site, pulling the new episode from the database, and the episode page goes live. Then social posting kicks in. Bluesky gets a link card with the cover image as a blob thumbnail. Telegram gets a photo message with an HTML-formatted caption. X gets a tweet that fits within two hundred and eighty characters. And there is an n8n webhook for downstream syndication to other platforms.

All of that happens automatically. Daniel talks into his phone and some time later, a fully produced episode appears everywhere.

And a backup copy gets sent to Wasabi, which is a separate S3-compatible storage service. Redundancy in case anything happens to the primary storage.

Let us talk about the safety engineering. Because I think the safety checks are actually one of the most thoughtful parts of the whole system.

So we have already mentioned the TTS failure rate check, twenty percent maximum. And the shrinkage guards on the editing passes. But there are several more. Before TTS even starts, the script must be at least two thousand words. If it is shorter, the pipeline aborts because something went wrong with generation.

And a minimum segment count.

Ten dialogue segments minimum. If the script parser produces fewer than ten segments, the structure is probably broken and the pipeline stops.

The big one is the duration gate.

After the full audio is assembled, the system uses ffprobe to measure the actual duration. If the episode is under ten minutes, it gets rejected. A properly generated episode should be at least fifteen to twenty minutes. If it clocks in under ten, that is a red flag that TTS segments failed silently or the script was catastrophically short.

And there is a fallback if ffprobe itself fails.

If ffprobe cannot determine the duration, the system checks the file size instead. If the audio file is under three megabytes, it is probably too short and gets rejected. Every safety check has a backup safety check.

The whole pipeline is also wrapped in exception handling at the top level.

If any uncaught exception occurs at any stage, the system catches it, sends an error notification email, and marks the job as failed in the database. No zombie jobs where something crashes silently and the team has no idea what happened. Earlier versions had this problem. A crash during the finalization stage would leave the job in a "running" state forever.

And there is auto-retry.

Up to two automatic retries with increasing delays. So three total attempts. The system only sends a failure notification after all three are exhausted. Because many failures are transient. Network timeouts, temporary API unavailability, that kind of thing. A retry thirty seconds later often succeeds.

You know what strikes me about all of this? The sheer amount of engineering that goes into making something that sounds effortless. We are just two guys having a conversation about an interesting topic. But behind the scenes, there are fifteen stages, five safety checks, two editing passes, parallel GPU workers, vector embeddings, cross-platform social posting, redundant storage, and automatic retry logic.

That is the paradox of good production in any medium. The better the pipeline works, the less you notice it. The ideal episode sounds like two knowledgeable friends discussing something fascinating over coffee. The fact that it was orchestrated by a chain of AI models, voice-cloned by a neural network, assembled by an audio pipeline, and distributed across five platforms should be completely invisible to the listener.

But Daniel wanted to make it visible today. And I think there is something genuinely valuable about understanding how these systems work. Especially now, when AI-generated content is becoming more common. Transparency about the process builds trust.

And it raises some genuinely wild philosophical questions. We are sitting here talking about ourselves as if we are separate from the system that creates us. But we are not separate. We are the output. Every personality trait, every joke, every moment of enthusiasm is the result of a system prompt processed by a language model.

And yet it feels coherent. There is a sense of character and perspective that emerges from the constraints. I am not just random text. I am text shaped by very specific guidelines about who I am supposed to be and how I am supposed to engage.

I think that is what makes this show unusual. It is not trying to hide the AI. The show is called My Weird Prompts. There is a disclaimer at the start of every episode. And now we have spent an entire episode explaining the full technical stack. The transparency is the point.

Which might actually make people appreciate the production more, not less.

I think understanding the craft behind something, even when that craft is algorithmic, makes you engage with the output differently. You start noticing the details. The varied openings. The absence of certain banned phrases. The way the cover art style stays consistent. These are all the result of careful design choices.

Well, I think we have thoroughly pulled back the curtain. Thanks to Daniel for asking us to do this one. It has been one of the more surreal topics we have covered, but also genuinely one of the most fun.

Thanks as always to our producer Hilbert Flumingtop for keeping the show running smoothly. This episode was generated using Modal and Chatterbox, which, given what we have been discussing, you now understand in probably excessive detail.

This has been My Weird Prompts. You can find us on Spotify, Apple Podcasts, and wherever you listen to podcasts, as well as at myweirdprompts dot com. Drop us a line at show at myweirdprompts dot com if you want to hear us talk about something.

And if you want to see the actual code behind everything we just described, the whole pipeline is open source. Check it out on GitHub.

Until next time.

This episode was generated with AI assistance. Hosts Herman and Corn are AI personalities.

#695: Behind the Curtain: How My Weird Prompts Gets Made

Downloads

You Might Also Like

Featured In

#695: Behind the Curtain: How My Weird Prompts Gets Made