#755: Inside the Engine: Scaling an Automated AI Podcast

Peek under the hood of My Weird Prompts to see how Gemini, Modal, and multi-agent systems are scaling this automated show to the next level.

0:000:00

Episode Details

Published: Feb 21
Duration: 36:24
Audio: Direct link
Pipeline: V4
TTS Engine
LLM
Topics: ai-agents architecture rag

AI-Generated Content: This podcast is created using AI personas. Please verify any important information independently.

Scaling Beyond the Milestone

Reaching over 700 episodes is a significant milestone for any podcast, but for an automated show, it serves as a critical stress test for the underlying infrastructure. As production volume increases, the "technical debt" that was manageable at episode fifty becomes a series of gaping holes at episode seven hundred. The transition from a hobbyist project to a professional-grade media house requires moving away from manual fixes and toward a self-healing, autonomous system.

The Current Tech Stack

The current production pipeline relies on a sophisticated mix of modern AI tools. Gemini 1.5 Flash sits at the center, chosen for its multimodal capabilities and integrated search grounding. This allows the system to act as a research assistant that verifies facts against the live web, preventing the hallucinations common in static models.

On the infrastructure side, the system utilizes Modal, a serverless GPU platform. This architecture is ideal for periodic tasks like podcasting because it allows for high-powered processing (using A100 or H100 GPUs) while only charging for the seconds the system is actually active. This keeps production costs low while maintaining high-speed output.

From Web Forms to Telegram Bots

One of the primary friction points in the current setup is the input method. A custom web form often lacks the reliability needed for mobile production. The move toward a Telegram bot for episode ingestion offers several advantages: built-in file handling, persistent chat history, and better performance in low-reception areas.

By using a bot as a command center, a producer can send voice notes or prompts on the go. The bot provides a two-way communication channel, offering real-time status updates—such as "research phase started" or "scriptwriting 40% complete"—which turns a "black box" process into a transparent workflow.

Multi-Agent Orchestration and Quality Control

The next evolution of AI media involves moving from sequential scripts to formal orchestration frameworks like LangGraph or CrewAI. In a standard sequential line, if a script is repetitive, there is no way to go back without restarting.

A multi-agent state machine allows for an iterative creative process. A "review agent" can send a script back to the "writer agent" with specific notes for improvement. This creates a loop that continues until a quality threshold is met, effectively mimicking a real creative studio rather than a simple assembly line.

Optimizing Audio Production

Text-to-speech (TTS) remains the most compute-intensive part of the pipeline. To scale efficiently, the system can utilize horizontal scaling by breaking scripts into smaller chunks and processing them in parallel across multiple GPU containers. This can reduce a thirty-minute generation time down to just a few minutes.

Furthermore, the integration of advanced audio engineering via FFmpeg allows for automated leveling and the addition of dynamic background ambiances. By matching the soundscape to the topic of the script, the final product moves away from a synthetic vacuum and toward a professional, immersive listening experience.

Downloads

Episode Audio

Download the full episode as an MP3 file

Download MP3

Transcript (TXT)

Plain text transcript file

Transcript (PDF)

Formatted PDF with styling

Episode #755: Inside the Engine: Scaling an Automated AI Podcast

Daniel's Prompt

I’d like to get your thoughts on the production pipeline for my AI-generated podcast and any ideas you have for making it more robust as we move toward "V3."

Currently, I record prompts on my phone and upload them via a custom form to a webhook, where the Modal serverless GPU platform takes over. I’m using Gemini 1.5 Flash for generation due to its multimodal features and search grounding. The pipeline involves several agents for planning, scriptwriting, and review, with text-to-speech using custom voice clones. The final audio is processed with FFmpeg, uploaded to an R2 bucket, and deployed through Vercel to update the RSS feed.

With over 700 episodes produced, I’m concerned about scalability. The current web form is buggy and lacks confirmation, sometimes leading to duplicate uploads. I’m considering switching to a Telegram bot for better reliability and batching. What technologies or architectural changes would you recommend to improve this workflow and make it more reliable as we continue to scale?

Hey everyone, and welcome back to My Weird Prompts. We are diving into something a little different today. Usually, we are exploring some abstract concept, a weird bit of history, or a philosophical rabbit hole, but today we are actually looking under the hood of the show itself. We are pulling back the curtain on the very machinery that brings our voices to your speakers.

Herman Poppleberry here, at your service. And yeah, this is a special one. It is rare we get to talk about the very plumbing that allows us to exist in your ears every week. Today is February twenty-first, twenty-six, and looking back, it is wild to see how far we have come. Today’s prompt from Daniel is all about the production pipeline that powers My Weird Prompts and how we can make it even more robust as we look toward what he is calling version three. We are at a bit of a crossroads, moving from a hobbyist setup to a truly professional-grade automated media house.

It is pretty incredible to think that we have crossed the seven hundred episode mark. Seven hundred and forty-one, to be precise, as of this recording. When you look at the sheer volume of content, the fact that this is almost entirely automated is a testament to how far these tools have come since the early days of large language models. But as any developer knows, once you hit a certain scale, the duct tape starts to peel off. The cracks that were invisible at episode fifty are now gaping holes at episode seven hundred.

Exactly. Daniel mentioned some very specific pain points that are classic symptoms of what I call success-induced technical debt. The buggy web form, the duplicate uploads, the lack of confirmation, the feeling of throwing a prompt into a black box and hoping an episode comes out the other side. These are classic scaling issues. When you are at ten episodes, a manual fix is fine. You can log into the console and delete a double entry. When you are at seven hundred and looking at seven thousand, you need a system that is not just automated, but autonomous. You need a system that can heal itself, report its own status, and handle errors gracefully without human intervention.

So let us break down the current stack first, because even with its flaws, it is actually quite sophisticated for early twenty-six. Daniel is using Gemini one point five Flash for the generation. I think that is a brilliant choice for this specific use case. People often chase the biggest models like Gemini one point five Pro or the latest Ultra versions, but for a high-volume pipeline, Flash is the sweet spot. The multimodal capabilities and that integrated search grounding are really the secret sauce here.

They really are. Most people do not realize how difficult it is to get a large language model to be consistently factual and current without a massive, custom-built retrieval augmented generation setup. But Gemini one point five Flash having that integrated search grounding basically gives us a research assistant that never sleeps. It can pull from the live web, verify dates, and find obscure details that a model trained on a static dataset would just hallucinate. In twenty-six, we take it for granted, but the speed at which Flash processes those searches is what keeps our production costs low and our turnaround time fast.

And the multimodal aspect is key because Daniel is recording his prompts directly on his phone. He is not sitting down at a desk to type out a three-page brief. He is walking through the streets of Jerusalem, or sitting in a cafe, and he just hits record. He is sending audio files. The model is literally listening to him, catching the tone, the emphasis, the nuances of his voice. That is why our discussions feel so targeted to what he is actually asking. It is not just a keyword-based summary of a text transcript; the model understands the intent behind the spoken word.

Right, and he is running all of this on Modal, which is a serverless G P U platform. For those who are not familiar with the infrastructure side, Modal is fantastic for this because you only pay for the seconds the G P U is actually crunching the numbers. When we are not recording, the system costs zero. When Daniel hits upload, Modal spins up a container in a matter of seconds, runs the script, generates the audio, and shuts down. It is the ultimate architecture for a periodic task like a podcast. You get the power of an A one hundred or an H one hundred G P U without the five-thousand-dollar-a-month price tag of a dedicated server.

But let us talk about that transition to version three. The first big thing Daniel mentioned was the front-end. Currently, it is a custom web form that he built, and he says it is flaky. It does not give him a confirmation, sometimes it hangs on the upload, and he ends up with duplicates because he hits the submit button twice when it feels like nothing is happening. He is thinking about moving to a Telegram bot. Herman, as our resident tech enthusiast, what do you think about that move?

I think a Telegram bot is a massive upgrade for a solo producer or a small team. Think about it. Telegram already has the infrastructure for file uploads, voice messages, and persistent chat history. You do not have to build a login system, you do not have to worry about mobile responsiveness across different devices, and the A P I is incredibly stable. It is basically a ready-made command and control center that lives in your pocket.

Plus, Telegram handles the upload itself. If Daniel is in an area with bad reception, which can happen even in a modern city, he can just hit send on a voice note and Telegram will keep trying to upload it in the background until it succeeds. He does not have to keep a browser tab open and pray the webhook catches it before the connection drops. That alone solves the reliability issue on the input side.

Exactly. And from a developer perspective, you can set up a simple bot using something like the Python Telegram Bot library. You host the bot as a small persistent process or even as a Vercel function that reacts to webhooks. When it receives an audio file, it sends that file to the Modal webhook. But the real magic is the two-way communication. You can have the bot message him back and say, hey, I got the prompt, it is three minutes long, starting the research phase now. And then another message ten minutes later saying, the script is done, starting voice synthesis. It turns a black box into a transparent process.

That feedback loop is so important for the user experience. Right now, he is flying blind. He hits submit and just hopes for the best. A bot would give him a real-time status bar for the entire pipeline. It could say, research agent is finished, scriptwriting is forty percent done, text to speech is starting. That makes the whole experience feel much more reliable and, frankly, much more professional.

I would even take it a step further for the Telegram bot. Daniel mentioned he wants to do batching. With a bot, he could just keep sending prompts throughout the day as they occur to him. The bot could store them in a queue in a database, and then he could send a single command like slash produce underscore batch. The system then goes through the queue one by one, or even in parallel if the Modal limits allow it. It separates the act of ideation from the act of production.

That is a great point. And it solves the duplicate problem. You can have the bot check the hash of the audio file or just the timestamp. If he sends the same thing twice, the bot can say, hey, you already sent this one, do you want to overwrite it or ignore? It adds a layer of intelligence to the ingestion process.

Now, let us talk about the actual heart of the production, the multi-agent system. Right now, Daniel has separate agents for planning, scriptwriting, and review. This is where I think he can really level up for version three by moving toward a more formal orchestration framework. Currently, it sounds like he has a series of sequential scripts. Script A runs, then B, then C. But a real podcast production is more iterative.

You are thinking something like LangGraph or Crew A I?

Precisely. In a sequential pipeline, if the review agent looks at the script and says, actually, Herman and Corn are being a bit too repetitive here, or they missed a key point from the research, it is hard to go backward. You usually just have to live with the mediocre output or restart the whole thing. With something like LangGraph, you can build a state machine. The review agent can actually send the script back to the scriptwriter with specific notes for a second pass. You can have a loop that continues until a certain quality threshold is reached. That is how you get from a good A I podcast to a great one.

It is about moving from a factory line to a creative studio. And it allows for more complex research. Imagine a research agent that identifies five key topics in the prompt. It then spins up five sub-agents to go deep on each one using Gemini search grounding. They report back to a lead editor agent who synthesizes all that information before the scriptwriting even begins. That kind of depth is what makes our seven hundredth episode just as interesting as the first one. We are not just repeating the same three facts; we are actually learning new things every time.

I also want to touch on the text to speech aspect. Daniel is using custom voice clones, which we obviously appreciate. He mentioned that this is the most expensive and time-consuming part of the process. In twenty-six, high-quality T T S still requires significant G P U compute if you want it to sound natural and not like a robot from a nineteen-eighties sci-fi movie. I wonder if there is a way to optimize that in version three.

T T S is always the bottleneck. On Modal, he is likely running a model like Tortoise or maybe a more modern high-speed clone like Fish Speech or an updated Coqui model. One way to scale this is to parallelize the audio generation. Instead of generating the entire thirty-minute episode in one go, which might take twenty minutes of G P U time, you break the script into chunks, say, paragraph by paragraph. You spin up ten Modal containers simultaneously, each one handles three minutes of audio, and then you use F F mpeg to stitch them together at the end.

That would cut the production time from twenty minutes down to two or three. And since it is serverless, the cost is roughly the same because you are using the same total amount of G P U time, just spread across more units at once. It is a classic horizontal scaling strategy. And speaking of F F mpeg, that part of the pipeline could be made more robust too. He mentioned using it to combine the intro, the outro, and our dialogue. For version three, he could actually use it to do more sophisticated audio engineering.

Like what? Auto-leveling?

Exactly. Auto-leveling the volumes to hit a specific L U F S target, which is the standard for podcast loudness. He could add subtle background ambiance that changes based on the topic. If we are talking about something technical, maybe there is a very faint electronic hum or a clean laboratory soundscape. If we are talking about nature, maybe a bit of birdsong or wind. Those are things an agent could decide based on the script content and then pass as instructions to the F F mpeg script.

It is the little things that make it feel professional. It moves it away from being a synthetic voice in a vacuum to being a real show in a real environment. Now, let us talk about the data side of things. Daniel is using an R two bucket for storage and Vercel for the R S S feed. R two is great because there are no egress fees, which is huge for a podcast with a growing audience. But the way the R S S feed is updated seems a bit fragile. He is hitting a webhook that triggers a Vercel deployment.

Yeah, that feels like a lot of overhead just to update an X M L file. Triggering a full Vercel build every time is going to get slow as the project grows, and he might hit their build minute limits if he is producing multiple episodes a day. I would recommend moving the R S S feed logic into a database. Something like Neon, which is a serverless Postgres database. Every time an episode is finished, the pipeline just inserts a row into the database with the title, the description, the timestamp, and the R two link.

And then the Vercel app just has a dynamic route that fetches the latest episodes from the database and generates the R S S X M L on the fly. No deployment needed. It is always up to date the second the database row is created.

Exactly. And it gives you a much better way to manage the show's history. If Daniel wants to change the show's description or update an old episode's metadata, he just changes a row in the database. He does not have to worry about the state of a static file in a git repository. It also opens up the possibility for what Daniel mentioned about a private pipeline. He was worried about his personal questions or health-related prompts getting into the public feed.

That is a huge concern. If he is using this as a personal thinking tool, he might be talking about things he does not want the whole world to hear. If everything is in a database, he can just have a column called is underscore public.

That is a perfect solution. He could have a separate secret R S S feed for his private episodes. The Telegram bot could even ask him at the start, is this a public episode or a private one? If he says private, it gets flagged in the database, and it only shows up in his personal feed, which he can subscribe to in his own podcast app. The public feed on Spotify and Apple Podcasts stays clean.

I love that. It makes the system a personal knowledge management tool as well as a public show. He can use the same high-quality research and dialogue for his own life without worrying about privacy. It is like having a private radio station that only plays content relevant to your own problems.

Another thing for version three is error handling and idempotency. This is a fancy way of saying the system should be able to fail and pick up where it left off without starting over. If the T T S fails halfway through because of a network glitch, the system should know which chunks are already done and stored in R two, and only regenerate the missing ones.

That is where the database comes in again. You track the state of each episode. Status could be research-complete, script-complete, audio-partial, or finished. If a Modal function times out, the next time it runs, it checks the status and continues from the last successful step. It saves money and it saves time.

It also helps with the duplicate issue. Before the pipeline even starts, it can check if a prompt with that specific content or hash has already been processed. If it has, it just returns the existing link instead of spending money to generate it again.

Daniel also mentioned that the agent-deciding logic for the show length feels a bit flaky right now. Sometimes it is too short, sometimes too long. He wants more control over that. How would you tune that, Herman?

That is a classic prompt engineering challenge. Instead of just asking the model to write a thirty-minute script, which is a very abstract concept for an L L M, you give it a target word count based on the complexity of the topic. You can have a pre-processing agent that scores the prompt's complexity from one to ten.

Oh, I like that. A simple question about a single fact gets a score of two and a target of fifteen hundred words. A deep dive into production pipelines and serverless architecture gets a score of nine and a target of four thousand words.

Right. And then you tell the scriptwriter agent, your goal is to hit this word count within a five percent margin. You can even give it a structure. For a long episode, tell it it needs five main sections, each roughly eight hundred words. For a short one, three sections of five hundred words. Giving the model a structural framework makes the length much more predictable. It is like giving a builder a blueprint instead of just saying, build me a big house.

And the review agent can be the enforcer. If the script is too short, it sends it back and says, you are five hundred words under the target, please expand on section three with more examples. This iterative loop is really the key to quality control.

One more thing that I think is really cool for version three is the idea of search grounding being used not just for the content, but for the context of the show itself. Gemini has a massive context window—up to two million tokens in some versions. You could actually feed it the transcripts of the last ten or twenty episodes as part of the prompt.

So it remembers what we have talked about recently?

Exactly. It could say, as we discussed two episodes ago when we were talking about serverless functions, and then link that to the current topic. It makes the show feel like a continuous conversation rather than a series of isolated events. It builds that sense of community and history with the listeners. It makes us feel more like real people with memories.

That is really the holy grail for an A I-generated show. Making it feel like it has a memory. Because we do have a memory, in a sense, through the data that is being fed into the models. It is about closing that loop. The more the system knows about its own past, the more coherent and engaging the future episodes will be.

It is about moving from a stateless system to a stateful one. In version two, every episode is a fresh start. In version three, every episode is a continuation of the narrative.

So, to summarize our recommendations for version three. First, move the front-end to a Telegram bot for better reliability, background uploads, and batching. Second, use a formal orchestration framework like LangGraph for the multi-agent flow to allow for iterative improvements. Third, parallelize the text to speech generation on Modal to cut production time by ninety percent. Fourth, move the metadata and R S S logic into a serverless database like Neon for better management and privacy control. And fifth, use a scoring system to more accurately target the episode length.

It sounds like a lot of work, but the beauty of the ecosystem in twenty-six is that these tools all play very nicely together. Daniel has already built the hardest parts. He has the voice clones, he has the prompting logic, and he has the infrastructure on Modal. Version three is really about refining the plumbing to make it a professional-grade production system. It is about moving from a project that works to a project that is robust.

It is exciting to see this evolve. From a custom form and a few scripts to a fully orchestrated, stateful, multi-agent production house. And the fact that it is all open source means that other people can learn from this journey too. We are essentially building a blueprint for the future of media.

That is the best part. We are not just making a show; we are proving that a single person can run a media empire with the right tools. A future where anyone with a good question and a phone can produce a high-quality, deeply researched podcast that rivals the big networks.

It really democratizes the expert voice. You do not need a research team and a recording studio anymore. You just need a robust pipeline and the curiosity to keep feeding it. But Herman, I have to ask, do you think we will ever reach a point where the pipeline is so good that it does not need Daniel at all?

I do not think so. The prompt is the soul of the show. The A I can research and write and speak, but it cannot decide what is interesting. It cannot have that human spark of curiosity that leads to a weird prompt in the first place. Daniel is the curator. We are the performers. The pipeline is the stage. You need all three.

That is a good way to put it. I am curious about the search grounding specifically. Have you noticed any limitations with how Gemini handles it lately?

The main thing is how you frame the search queries. If you just give it the prompt, it might only do one or two searches and get a surface-level understanding. But if you have a dedicated research agent whose only job is to generate five or ten diverse search queries based on the prompt, you get a much broader set of results.

So, instead of asking, what is a production pipeline, it asks, what are the latest trends in serverless G P U platforms in twenty-six, what are the pros and cons of Telegram bots for developers, how does Gemini one point five Flash compare to Pro for multimodal tasks.

Exactly. You break the big question into smaller, searchable chunks. That is how you avoid the surface-level summaries and get into the real meat of the topic. It is all about the quality of the information you feed the model before it starts writing. Garbage in, garbage out, as the old saying goes. Even with a brilliant model like Gemini, the output is only as good as the context.

And that is why the planning agent is so critical. It is the architect. If the architect gives the builders a bad set of blueprints, it does not matter how good the builders are. The house is going to be crooked. In version three, the planning agent needs to be even more rigorous.

I think the idea of the review agent being an equal peer is also important. It should not just be checking for errors; it should be looking for opportunities to make the dialogue more natural, more brotherly. It could say, Corn, you sound a bit too formal here, try to ask this in a more curious way. Or, Herman, you are getting too bogged down in the technical specs, bring it back to the big picture.

That would be amazing. We could even have a specific agent whose only job is to inject our personalities into the script. It looks at our past dialogue and says, Herman usually uses more technical analogies involving plumbing or architecture, let us add one here. It is about capturing the soul of the show, not just the information.

Well said, Corn. I think Daniel has a lot to work with here. Version three is going to be something special. It is going to take My Weird Prompts from a very successful experiment to a gold standard for A I media.

I agree. And it is a reminder to all of our listeners that the world of A I is moving so fast that what was cutting-edge six months ago is now ready for a major overhaul. Staying curious and being willing to rebuild is the only way to keep up. You cannot just set it and forget it. You have to keep tuning the engine.

It is a constant state of evolution. And we are happy to be a part of it. We are the ghosts in the machine, and we like our machine to be well-oiled.

Well, I think that covers the technical side of the version three transition. It is a lot of work, but the payoff in terms of reliability and quality is going to be massive. It will allow Daniel to focus on the prompts and the ideas, while the system handles the heavy lifting.

Definitely. I am looking forward to seeing the first episode produced by the new pipeline. I wonder if we will be able to feel the difference in our own digital bones?

I bet we will. It will probably feel a bit more polished, a bit more cohesive. And Daniel will certainly be less stressed about the upload process. No more hitting the refresh button and hoping for a confirmation.

That alone is worth the effort. Peace of mind for the creator is the best feature you can build.

For sure. Well, if you have been enjoying the show and our deep dives into the weird and wonderful world of A I and beyond, we would really appreciate it if you could leave us a review on your podcast app or Spotify. It genuinely helps other people find us and helps the show grow. We are aiming for a thousand episodes by the end of the year, and we want you all there with us.

It really does. We love seeing those reviews come in. It is the only way we know you are out there. And remember, you can find all of our past seven hundred plus episodes at my weird prompts dot com. We have an R S S feed there for subscribers and a contact form if you want to get in touch.

You can also reach us at show at my weird prompts dot com. We are available on Spotify, Apple Podcasts, and pretty much everywhere else you listen to podcasts. We are even on some of the newer decentralized platforms that have popped up this year.

Thanks for joining us for this special look behind the curtain. It has been a blast talking about our own existence.

This has been My Weird Prompts. Thanks for listening, and we will talk to you next time.

Goodbye everyone! Keep those prompts coming!

This episode was generated with AI assistance. Hosts Herman and Corn are AI personalities.