#2510: Where Voice AI Actually Works (Not Cold Calls)

Drive-thru accuracy, healthcare triage, and the design secret that makes people *want* to talk to a machine.

Featuring

Daniel

Corn

Herman

0:000:00

Episode Details

Episode ID: MWP-2668
Published: Apr 29
Duration: 28:00
Audio: Direct link
Pipeline: V5
TTS Engine: chatterbox-regular
Script Writing Agent: deepseek-v4-pro
Topics: voice-first accessibility speech-recognition

AI-Generated Content: This podcast is created using AI personas. Please verify any important information independently.

Voice AI is having a quiet revolution, and it has almost nothing to do with the spam calls everyone hates. While the public conversation focuses on annoying cold callers, the real deployment landscape is far more interesting—and far more effective.

Drive-Thru: Faster, More Accurate, and Less Annoying

One of the most advanced use cases is fast-food ordering. Major chains are now reporting order accuracy rates above 95% with voice AI, which in some locations actually beats human performance. The key isn't just speech recognition; it's the ability to handle complex modifications like "no pickles, extra sauce on the side, swap the bun" with fewer errors than a distracted human worker. Latency has also dropped dramatically, with the best systems responding in under 300 milliseconds.

The customer experience, however, hinges on subtle design choices. Early systems felt stiff and robotic. Newer versions layer in conversational markers—phrases like "let me make sure I got that right" delivered with natural prosody. When these markers are present, customer satisfaction scores jump by roughly 20 points. The lesson is clear: people don't mind talking to a machine if the machine makes the interaction feel human-shaped.

Healthcare Triage: More Thorough Than a Human

Several major health systems are using voice AI for initial patient intake and triage. The AI runs through a structured clinical protocol, collecting data more consistently than a rushed human triage nurse might. One study found that AI triage captured 40% more relevant symptoms simply because it never forgot to ask follow-up questions.

These systems are not making clinical decisions. They operate on a "human in the loop" architecture, collecting structured data and presenting it to a clinician who makes the final call. The most effective design pattern is "opt-in automation," where the system introduces itself as an automated assistant and explicitly offers the option to wait for a human. When people choose to engage, satisfaction is high; when they feel forced, it plummets.

Accessibility and Language Learning: Low Friction, High Impact

Voice AI is transformative for populations that struggle with traditional interfaces. For the elderly, proactive voice companions can initiate conversations, offer medication reminders, and reduce self-reported loneliness. For the visually impaired, modern screen readers can describe images and navigate complex interfaces.

In language learning, the breakthrough is pronunciation-aware feedback. Newer models can give nuanced advice on accent and intonation, such as "your vowel sound was a little flat—try rounding your lips more." The always-available, non-judgmental nature of these tutors is a major advantage, as users are often more willing to make mistakes with a machine than a human teacher.

The Core Design Principle: Agency vs. Interface

The distinction between a voice interface and a voice agent is crucial. An interface understands your words and routes you somewhere. An agent can take action within backend systems—processing returns, issuing refunds, or scheduling pickups without transferring you. This solves the "state persistence problem" where users have to repeat their information to every new person or system.

In industrial settings, voice agents are highly domain-specific. Technicians repairing equipment can pull up schematics and diagnostics hands-free. These systems are less conversational and more command-driven, prioritizing safety and productivity over chit-chat. The underlying thread across all successful deployments is the same: reduce social friction, respect user agency, and actually get things done.

Downloads

Episode Audio

Download the full episode as an MP3 file

Download MP3

Transcript (TXT)

Plain text transcript file

Transcript (PDF)

Formatted PDF with styling

#2510: Where Voice AI Actually Works (Not Cold Calls)

Daniel sent us this one — he's asking what people are actually doing with voice AI agents beyond the obvious annoyance of automated cold calling. He wants the real-world deployments. Accessibility, healthcare triage, language tutors, drive-thru ordering, customer support that doesn't make you want to throw your phone, industrial field service, mental health check-ins, voice interfaces for kids and non-literate users, in-car assistants, enterprise meeting note-takers, and the weird edges — grief tech, voice journaling coaches, even AI dungeon masters. And the deeper question: where is this stuff genuinely landing, where is it still underwhelming, and what's the design difference between a voice agent people tolerate and one they actually choose to use?

Oh, this is a great prompt. And by the way — DeepSeek V four Pro is writing our script today. So if anything sounds unusually articulate, that's why.

I was going to say you sound more coherent than usual.

But seriously, this topic hits at something I've been tracking closely. The cold calling stuff gets all the attention because it's annoying and visible, but the real deployment landscape is way more interesting and way less covered.

Of course you did. Alright, where do you want to start?

Let's start with the category that's probably furthest along in terms of real, measurable results — drive-thru and restaurant ordering. There was a piece just last month in QSR Pro about AI drive-thru accuracy benchmarks. Several major chains are now reporting order accuracy rates above ninety-five percent with voice AI, which is actually beating human performance in some locations.

Wait, it's beating humans at order accuracy?

Yeah, and it makes sense. Human order-takers get distracted, mishear things in a noisy kitchen, forget to repeat the order back. The AI doesn't get distracted. The QSR Pro piece specifically called out that the best systems are now handling complex modifications — "no pickles, extra sauce on the side, swap the bun" — with fewer errors than the average human worker. And the latency has dropped dramatically. Early systems had that painful two-second pause before responding. Now the good ones respond in under three hundred milliseconds.

That's basically real-time. But I'm curious about the customer experience side. I've used some of these, and there's a certain stiffness to them. Even when the order is correct, it doesn't feel right.

That's exactly the design difference question Daniel's getting at. The early drive-thru AIs were basically speech-to-text feeding into a command line. The newer ones layer in conversational markers — little phrases like "let me make sure I got that right" or "anything else for you today?" delivered with natural prosody. The QSR data shows customer satisfaction scores jump about twenty points when those markers are present versus absent.

It's not just about accuracy, it's about the illusion of being heard by something that understands you.

And that word "illusion" is doing a lot of work here, but the effect is real. People don't mind talking to a machine if the machine makes the interaction feel human-shaped.

What about outside of fast food? You mentioned healthcare triage.

Healthcare is where things get consequential. Several major health systems now use voice AI for initial patient intake and triage. A patient calls with symptoms, the voice agent runs through a structured clinical protocol — not unlike what a nurse would do — and routes the case to the appropriate level of care. What's interesting is that the protocols are often more thorough than what a rushed human triage nurse might cover, because the AI never skips a question to save time.

Triage is high-stakes. If the AI misses something, someone could end up in serious trouble. How are they handling liability and oversight?

The deployments I've seen all use "human in the loop" architecture. The AI doesn't make clinical decisions — it collects structured data and presents it to a clinician who makes the call. The value is in the data collection quality and consistency. One study from a large Midwest health system found that AI triage intake captured forty percent more relevant symptoms than the previous phone intake process, simply because it never forgot to ask the follow-up questions.

Forty percent is huge. But I can also imagine the patient who just wants to talk to a human and gets frustrated.

That's the tolerance-versus-choice thing again. The design pattern that seems to work best is what some researchers call "opt-in automation." The system introduces itself as an automated assistant, explicitly offers the option to wait for a human, and gives an estimated wait time. When people choose to engage with the AI, satisfaction is high. When they feel forced into it, satisfaction plummets. It's not a technology problem at that point, it's a consent and agency problem.

Nobody likes being ambushed by a robot.

That principle shows up across almost every successful deployment. Let me jump to accessibility, because this is the category where I think the impact is most profound and most under-discussed. Voice AI is transformative for visually impaired users, for elderly people who struggle with smartphone touch interfaces, and for non-literate populations.

I was just thinking about the elderly use case. let's call it my leaf medicine practice —

Oh, here we go.

— I deal with a lot of older folks. And half of them can't navigate a smartphone app to save their lives. Too many buttons, text too small, confusing menus. But they can talk. Everyone can talk.

There's a company called ElliQ that's been getting a lot of attention for exactly this. It's a voice-first companion device designed for older adults. It does medication reminders, wellness check-ins, can call family members, tells jokes, plays music. But the key design insight is that it initiates conversation rather than just waiting to be asked something. It'll say things like "good morning, how did you sleep?" or "I noticed you haven't called your daughter in a while, want me to dial her?

— proactive rather than reactive. Most voice assistants just sit there waiting for a wake word.

Right, and for isolated elderly people, that proactivity is the whole value proposition. There's published data showing significant reductions in self-reported loneliness scores among users. Now, you can have a whole philosophical debate about whether an AI companion is a real solution to elder isolation or a band-aid on a social failure, but as a practical matter, the users report real benefit.

I think both things can be true. It can be a band-aid on a social failure and still be better than the alternative of no contact at all.

And the accessibility piece extends beyond companionship. Screen readers powered by modern voice AI are dramatically better than the old robotic text-to-speech. They can describe images, summarize web pages, navigate complex interfaces. For visually impaired users, this isn't a convenience, it's access to the same internet everyone else uses.

What about the language learning side? Daniel mentioned tutors.

Language learning is one of those areas where voice AI seems almost too perfect, but the early implementations were pretty limited. The breakthrough has been in "pronunciation-aware feedback." Older systems could tell you if you said the right words in the right order, but couldn't give nuanced feedback on your accent or intonation. The newer models can say things like "your vowel sound on that word was a little flat — try rounding your lips more." That's getting close to what a human tutor does.

I can see that being useful for languages with sounds that don't exist in your native language. English speakers trying to learn Mandarin tones, for example.

The always-available aspect matters a lot. You can practice at two in the morning if you want. The AI doesn't get tired of repeating the same phrase fifty times. There's no embarrassment factor — people are often more willing to stumble and make mistakes with a machine than with a human tutor.

The embarrassment factor is underrated. I've seen people freeze up trying to speak a new language in front of a human teacher, but they'll chatter away to an app.

That's a design principle that shows up across multiple categories. Voice agents that reduce social friction — the fear of being judged, the impatience of a human on the other end — those tend to get adopted. Mental health check-ins are another example. There are now voice-based systems for cognitive behavioral therapy exercises, mood tracking, and crisis triage. People disclose more to an AI than they do to a human therapist in some studies, precisely because there's no social judgment.

That's both promising and a little unsettling.

The ethical questions are significant. What happens when someone expresses suicidal ideation? How does the system escalate? What data is stored and who has access? The best implementations I've seen have very clear guardrails — the AI explicitly says "I'm not a therapist, I'm a support tool, and here's how to reach a human if you need one." But not all implementations are the best implementations.

Let's talk about customer support, because that's the one most people have actually encountered. Daniel said "customer support that doesn't suck." Does that exist?

It's starting to. The Home Depot deployment is a good case study. They rolled out Google Cloud's Gemini for customer experience earlier this year. They're reporting customer support resolution four times faster than their previous system. The key design choice was making the voice agent capable of actually doing things — checking order status, initiating returns, looking up inventory at specific stores — rather than just being a smarter phone tree that eventually routes you to a human anyway.

It's not just understanding what you're saying, it's having agency within the backend systems.

That's the distinction between a voice interface and a voice agent. An interface can understand your words and maybe route you somewhere. An agent can take action. The Home Depot system can process a return, issue a refund, schedule a pickup — all within the voice interaction. No hold music, no transfers, no repeating your order number three times.

I had to call my bank last week and I gave my account number to the automated system, then gave it again to the first human, then gave it a third time to the specialist they transferred me to. I nearly lost my mind.

That's the state persistence problem, and it's one of the things good voice agents solve by design. The context follows the conversation. The agent knows what you've already said, what's been tried, what worked and what didn't. That sounds basic, but it's been shockingly hard to implement in legacy call center infrastructure.

What about the industrial and field service side? I imagine hands-free is a big deal there.

Think about a technician repairing a piece of equipment, both hands occupied, needs to pull up a schematic or a diagnostic procedure. Voice is the natural interface. There are deployments in aerospace maintenance, oil and gas field service, manufacturing lines — anywhere that stopping work to type on a screen is either dangerous or productivity-killing. The voice agents in those contexts tend to be highly domain-specific. They're not trying to answer general knowledge questions or tell jokes. They know the equipment, the repair procedures, the safety protocols.

I assume those are less conversational and more command-driven.

Right, and that's a design choice that makes sense for the context. When you're thirty feet up on a wind turbine, you don't want chit-chat. You want "torque specification for bolt seventeen B" and you want the answer in under a second. The tolerance-versus-choice framework shifts completely. In that context, people choose the voice agent because it's the only safe option, and they tolerate its limitations because the alternative is climbing down, removing gloves, finding a laptop.

Let's get to the weird edges. Daniel mentioned grief tech and voice journaling and AI dungeon masters. I need to know about the dungeon masters.

This one is delightful. There's a growing community of people using voice AI as dungeon masters for tabletop role-playing games. The AI generates the narrative, voices the non-player characters, adjudicates rules, responds to player choices — all through voice. And the interesting thing is that voice makes it feel much more like sitting around a table with a human DM than text-based AI roleplaying does.

There are groups of people sitting around a table, talking to an AI that's narrating their adventure, and it actually works?

It works surprisingly well for certain styles of play. The AI is good at improvising within established fantasy tropes, it never gets tired or cranky, and it doesn't play favorites among the players. Where it falls down is long-term narrative coherence — keeping track of plot threads across multiple sessions, remembering that the mysterious stranger from session three is actually the villain's brother. Human dungeon masters still win on deep, multi-session storytelling.

For a one-shot or a casual game, I can see the appeal. Nobody has to prep, nobody has to be the DM.

And that's a pattern that shows up across voice AI deployments — the technology often works best for bounded, well-defined tasks rather than open-ended, long-running interactions. The drive-thru order is a thirty-second interaction with a clear beginning, middle, and end. The triage call is maybe five minutes. The field service query is a single question. When the scope is clear, voice AI can excel.

The grief tech thing? That sounds heavy.

It is heavy. There are companies now building voice agents that can simulate conversation with deceased loved ones, based on voice recordings and text messages and other data they left behind. The idea is to give grieving people a way to hear their loved one's voice again, to have a conversation, to say things they didn't get to say.

complicated feelings about that.

I think anyone would. The companies in this space emphasize that it's not meant to replace the grieving process or create the illusion that the person is still alive. The framing is more like an interactive memorial — a way to engage with memories rather than a simulation of ongoing presence. But the ethical territory is extremely fraught. What happens if the AI says something the real person never would have said? What happens if someone becomes dependent on these conversations and can't move forward?

That's the thing that bothers me. Grief is supposed to be a process with an endpoint. Not that you stop missing the person, but you integrate the loss and keep living. A voice agent that lets you keep having conversations seems like it could short-circuit that process.

The counter-argument I've heard is that we already have static memorials — photos, videos, voicemails — and this is just a more interactive version of those. But I think the interactivity is precisely what makes it different. A photo doesn't talk back. A voicemail doesn't ask how your day was.

What about voice journaling? That seems less ethically complicated.

There are several apps now that let you journal by talking rather than typing. The AI transcribes, summarizes, can ask follow-up questions to help you go deeper, can identify patterns over time — "you've mentioned feeling anxious about work in four of your last five entries, want to talk about that?" Users report that speaking their thoughts feels more natural and cathartic than writing them down.

I can believe that. Writing forces you to organize your thoughts. Talking lets them spill out however they come.

That raw, unorganized quality is actually valuable for journaling. You capture things you might edit out if you were writing. The AI's job is to help you find the structure afterward, not to constrain the expression upfront.

Alright, let's zoom out. You've been tracking this space closely. Where would you say the technology is landing versus still underwhelming?

landing: structured, bounded interactions in specific domains. Drive-thru ordering, triage intake, field service queries, accessibility tools for visually impaired users, language pronunciation practice. These are all areas where the problem is well-defined, the success criteria are clear, and the AI doesn't need to maintain context across weeks or months.

Open-ended conversational agents that try to be general-purpose companions. The technology still struggles with long-term memory, with understanding nuanced emotional context, with knowing when to talk and when to shut up. The elderly companion devices like ElliQ are the closest to working, and even those are carefully scoped — they're not trying to be a full conversational partner, they're doing specific check-ins and reminders with some light social interaction layered on top.

The pattern is: narrow scope, clear task, short interaction — works well. Broad scope, fuzzy task, ongoing relationship — still not there.

That's the pattern. And I think it's important to be honest about that, because the hype cycle around voice AI has been intense. Every few years someone declares that voice is the future of everything and keyboards are dead. Keyboards are not dead. Voice is great for some things and terrible for others. Typing an email in an open-plan office by voice is a nightmare for everyone around you. Dictating a text message while driving is useful.

The interface has to match the situation.

The social norms around voice interfaces are still evolving. We're in this awkward phase where using voice assistants in public feels weird to a lot of people, but using them in private is increasingly normal. The design challenge is building agents that are socially aware — that know when to be quiet, when to speak softly, when to defer to a screen instead.

Speaking of screens, where do in-car assistants fit into this? That seems like a natural voice environment.

Cars are almost the ideal voice environment. You're alone or with family, not in public, your hands and eyes are occupied, the tasks are well-defined — navigation, music, messages, climate control. The newer in-car systems are getting quite good, especially the ones that integrate deeply with the vehicle's systems. Being able to say "I'm cold" and have the car adjust the temperature, or "find me a gas station with a clean restroom on the route" and have it actually do that — that's useful.

"Clean restroom" is doing a lot of work in that query.

The important thing is that the AI can handle multi-part, contextual requests. "Find a gas station, make sure it's on the route, and I want one with good coffee if possible." That kind of natural language query with multiple constraints is where modern voice agents shine compared to the old command-based systems where you had to say the exact right phrase in the exact right order.

What about enterprise internal tools? Daniel mentioned meeting note-takers and IT helpdesk.

Meeting note-takers are becoming useful. The best ones now don't just transcribe — they identify action items, flag decisions, summarize key points, and can answer follow-up questions like "what did Sarah say about the budget timeline?" The voice agent sits in the meeting, listens, and becomes a searchable knowledge base. The design challenge there is privacy and consent. Everyone in the meeting needs to know the AI is listening and needs to be comfortable with that.

I've been in meetings where someone has one of those note-takers running and it changes the dynamic. People are more careful about what they say. Fewer offhand comments, less brainstorming out loud.

That's the surveillance effect, and it's a real concern. The countermeasure some companies are adopting is transparency and control — the AI explicitly states it's recording, everyone can see what's being captured, and there's an easy way to delete sections after the fact. But the cultural adjustment is ongoing.

IT helpdesk seems like a natural fit. "My printer isn't working" is practically a meme at this point. Can an AI actually fix that?

For the common cases, yes. Password resets, printer troubleshooting, software updates, VPN configuration — these are well-documented problems with known solutions. A voice agent can walk a user through the steps, check whether each step worked, and escalate to a human if the script isn't solving it. The win isn't that the AI is smarter than a human IT person, it's that the AI is available instantly at three in the morning when the human IT person is asleep.

It doesn't sigh audibly when you admit you didn't try turning it off and on again.

That's the judgment-free thing again. It keeps showing up as a design advantage. People will admit confusion or ignorance to an AI that they'd hide from a human to avoid looking stupid.

Let's hit one more category before we wrap. Daniel mentioned voice-first interfaces for kids and non-literate users. That seems like a whole different design challenge.

It is, and it's one of the most important areas of development. For kids who can't read yet, voice is the natural interface. There are educational apps where kids can ask questions about the world and get age-appropriate answers, storytelling apps where they can interact with characters, language development tools. The design constraints are strict — content safety, no advertising, no data collection, clear boundaries around what the AI will and won't discuss.

For non-literate adults?

This is where voice AI could have massive impact in the developing world. There are hundreds of millions of adults globally who can't read or write but who have access to mobile phones. Voice interfaces let them access banking services, agricultural information, health advice, government services — all the things that have moved to text-based apps and websites. Organizations in India and several African countries are deploying voice-based agricultural extension services where farmers can call in and ask about crop diseases, weather forecasts, market prices.

That's a use case that actually matters. It's not about convenience, it's about access to information that affects livelihoods.

The design principles are different from the consumer voice assistants we're used to. These systems need to work in multiple languages and dialects, need to handle low-bandwidth connections, need to function on basic feature phones, not just smartphones. The interaction design is often more structured — less open-ended conversation, more guided dialogue — because the users may not have a mental model of what an AI assistant can and can't do.

You have to teach people how to interact with it while they're interacting with it.

Onboarding is part of the experience, not a separate step. Some of these systems use "progressive disclosure" — they start with very simple interactions and gradually introduce more complex capabilities as the user demonstrates comfort with the basics.

Alright, I want to come back to Daniel's core question. What's the design difference between a voice agent people tolerate and one they actually choose to use?

Based on everything we've discussed, I'd say there are about four principles. People choose agents that give them control — the ability to opt in, to switch to a human, to set boundaries. Agents that ambush you or trap you in a phone tree get tolerated at best.

Competence within a clear scope. The agents people choose don't pretend to do everything. They do a specific thing well and they're upfront about their limitations. The drive-thru agent takes your order. The triage agent collects your symptoms. They don't try to be your friend or your therapist on the side.

Conversational naturalness that matches the context. In a drive-thru, naturalness means friendly and efficient. In a field service setting, it means terse and precise. In a mental health check-in, it means warm and patient. The tone has to fit the situation, and getting it wrong is jarring.

Memory and state persistence. People choose agents that remember what they said thirty seconds ago, that don't make them repeat information, that maintain context across the interaction. This sounds basic, but it's the single biggest frustration point with bad voice systems.

That's a solid framework. And I think it explains why some deployments succeed while others flop. It's rarely about the underlying speech recognition accuracy anymore — that's gotten good enough across the board. It's about the interaction design layered on top.

The technology has reached a point where the limiting factor is not whether the AI can understand you, but whether the experience of talking to it feels worth your time. And that's a design problem, not an engineering problem.

And now: Hilbert's daily fun fact.

The national animal of Scotland is the unicorn. It has been since the twelve hundreds, when it was adopted as a symbol of purity and power in Scottish heraldry.

What should listeners actually do with all of this? If someone's interested in voice AI beyond the hype, where should they look?

I'd say three things. First, if you're building or evaluating voice AI for your organization, focus on a narrow, well-defined use case where success is measurable. Don't try to build a general-purpose assistant. Find the specific interaction that happens a thousand times a day and make that interaction better.

If you're a user, try out some of the accessibility-focused voice tools even if you don't need them. Understanding how voice interfaces work for visually impaired users or non-literate populations will give you a much better intuition for what good design looks like than just using a smart speaker to set timers.

Pay attention to the agency question. When you encounter a voice agent in the wild, notice whether it gives you an escape route, whether it's clear about what it can and can't do, whether it respects your time. The good ones will feel almost invisible. The bad ones will make you want to scream "representative" into the phone.

I've done that.

We all have. The fact that "representative" is probably the most-spoken word to voice agents is a pretty good indicator of where the industry has been. The question is whether the next generation of deployments can change that.

I think the accessibility and developing-world use cases give me the most hope. Those are places where voice isn't just a convenience layer on top of a screen — it's the only interface that makes sense. And when you design for the hardest cases, you often end up with something better for everyone.

That's a good place to land. The technology is maturing. It's not going to replace keyboards or screens, but it's finding its natural niches, and some of those niches matter a lot more than ordering pizza by voice.

Thanks to our producer Hilbert Flumingtop. This has been My Weird Prompts. You can find every episode at myweirdprompts.

If you enjoyed this one, leave us a review wherever you listen. It helps more than you'd think.

See you next time.

This episode was generated with AI assistance. Hosts Herman and Corn are AI personalities.

#2510: Where Voice AI Actually Works (Not Cold Calls)

Drive-Thru: Faster, More Accurate, and Less Annoying

Healthcare Triage: More Thorough Than a Human

Accessibility and Language Learning: Low Friction, High Impact

The Core Design Principle: Agency vs. Interface

Downloads

You Might Also Like

#2510: Where Voice AI Actually Works (Not Cold Calls)