#2821: How to Build a Local Intercom with Zigbee and Snapcast

Three engineering problems in a trench coat. Make Zigbee sirens, Snapcast speakers, and push-to-talk audio actually work together.

Featuring

Daniel

Corn

Herman

Listen

0:00

Episode Details

Episode ID: MWP-2990
Published: May 14
Duration: 34:21
Audio: Direct link
Pipeline: V5
TTS Engine: chatterbox-regular
Script Writing Agent: deepseek-v4-pro
Topics: zigbee smart-home audio-engineering

AI-Generated Content: This podcast is created using AI personas. Please verify any important information independently.

This episode tackles a deceptively complex smart home problem: building a local intercom system from existing Zigbee sirens and Snapcast speakers. The Zigbee siren part is the most solved — using MQTT to publish melody commands directly to Zigbee2MQTT exposes tone selection for doorbell, help paging, and red alerts, bypassing limited frontend controls. The limitation is twelve predefined tones with no custom audio uploads, so you get dings and chimes but not actual speech.

The Snapcast audio routing is where things get sticky. Music Assistant locks the audio pipe as a single writer, so piping TTS or microphone audio into the same stream causes garbled audio. Three approaches emerge: using separate cloud-connected smart speakers for announcements (simple but sacrifices local control), configuring a second Snapcast stream for announcements with client-side volume ducking (fiddly with noticeable delay), or doing a hard stream switch via Snapcast's JSON-RPC API (clean audio but risks losing your place in music playback).

For microphone capture, Icecast introduces too much latency for intercom use — several seconds of buffer designed for internet radio, not real-time conversation. The recommended approach is a walkie-talkie-style system using Home Assistant's companion app: press a button, record a short clip, release, and the audio file plays through Snapcast to the target speaker. This sidesteps streaming complexity entirely while using existing infrastructure, with acceptable one-to-two-second latency from release to playback.

Downloads

Episode Audio

Download the full episode as an MP3 file

Download MP3

Transcript (TXT)

Plain text transcript file

Transcript (PDF)

Formatted PDF with styling

#2821: How to Build a Local Intercom with Zigbee and Snapcast

Daniel sent us this one — he wants to build a home intercom system using gear he already has, Zigbee sirens and Snapcast speakers, and he's hitting the classic wall where the parts work in isolation but the moment you try to make them play nice together, everything gets weird. The core question is, how do you pipe push-to-talk audio or TTS announcements through Snapcast without Music Assistant locking the stream, and can you make the intercom supersede whatever's already playing so it actually interrupts?

This is a fantastic prompt because it's basically three separate engineering problems wearing a trench coat pretending to be one question. And the trench coat is on fire. You've got the siren side, which mostly works. You've got the Snapcast audio routing side, which mostly doesn't. And then you've got the microphone capture and streaming side, which is where things get genuinely interesting.

The trench coat is on fire. That's the Snapcast experience in six words.

It really is. And I want to be fair here — Snapcast itself is not necessarily the culprit. The architecture is sound. You've got a server that reads from a pipe or stream and distributes synchronized audio to multiple clients. The problem is that when you layer Home Assistant, Music Assistant, PulseAudio or PipeWire, and then whatever USB audio gadget is at the end of the chain, you've created a stack with about seven potential failure points between you and the speaker. Every single one of them can fall over and leave you debugging at eleven PM.

Which is exactly when you'd be debugging an intercom system, because that's when you'd actually need it.

So let's pull the trench coat off and tackle these one at a time. The Zigbee siren part is actually the most solved. The prompt mentions using MQTT to send specific melody commands directly to the sirens, bypassing the limited ZHA or Zigbee2MQTT frontend controls. That's exactly right. If you're using Zigbee2MQTT, these devices expose a melody select attribute, and you can publish to the MQTT topic directly with a payload that selects tone four for doorbell, tone seven for help button, tone twelve for red alert. That works reliably because it's a single command to a single device over a protocol designed for exactly that.

It's already running in the apartment. Six sirens, one per room, each doing multiple duty depending on which melody gets triggered. That's elegant.

The limitation, of course, is that you've got twelve predefined tones and you can't upload custom audio. So you're stuck with whatever the manufacturer burned into firmware. That bell-like tone he mentioned is fine for a help paging alert, but it's not speech. You can't say, Ezra needs a diaper change in the living room. You just get ding-dong and then the recipient has to interpret context.

Which brings us to the speakers and the actual intercom part. This is where Snapcast enters, and where things get sticky.

So the vision is, you press a button on your phone, you speak, and your voice comes out of a specific speaker or group of speakers in the house. Maybe it interrupts whatever Music Assistant is currently playing. That sounds simple. It is not simple.

Walk me through why.

Snapcast works by having a server that reads from a named pipe or an audio stream. Music Assistant connects to that pipe and sends its audio in. The Snapcast server then distributes that stream to all connected clients. The fundamental problem is that a pipe, in the Unix sense, typically has one writer. If Music Assistant has opened the pipe and is writing to it, and you try to have a TTS engine or a microphone stream also write to that same pipe simultaneously, you get garbled audio at best and nothing at worst. It's not designed for multiplexing.

The pipe is a single-occupancy vehicle and Music Assistant is already in the driver's seat.

And Music Assistant is territorial. It takes ownership of the stream. The prompt mentions exactly this — it locks the stream down. So the question becomes, how do you get a second audio source to take priority without breaking the whole setup?

Ideally, you want it to duck or pause whatever's playing, deliver the announcement, then resume. Like how a navigation app handles a phone call.

That's the gold standard, yes. There are a few approaches here, and they range from simple and limited to complex and flexible. I'll lay them out.

Option one, the simplest, is to not use Snapcast for announcements at all. Keep the sirens for alert-style paging and use a completely separate audio path for TTS. If you've got smart speakers — like a Google Nest Mini or an Echo Dot — you can send TTS to those directly through Home Assistant's notify service. Those devices handle their own audio mixing and ducking natively. You say, notify dot living room speaker, message is dinner is ready, and it just works. But that assumes you have those devices, and it doesn't give you push-to-talk intercom functionality.

It also means you're routing through a cloud service, which for someone who's clearly invested in local control with Zigbee and Snapcast, probably isn't the preferred architecture.

Option two is the Snapcast meta-stream approach. Snapcast actually supports multiple streams now. You can configure a second stream on the server — call it announcements or intercom — and have clients subscribe to it alongside the main music stream. Each client can be set to play from multiple streams, and Snapcast will mix them. The trick is that you then need a way to send audio to that second stream only when there's an announcement, and to have the client prioritize it.

Does Snapcast handle the priority mixing, or is that on you?

That's on you. Snapcast will mix both streams at equal volume by default. You'd need to script something on the client side — or use Snapcast's volume control per stream — to duck the music stream when the announcement stream is active. It's doable but fiddly. You'd basically have a Home Assistant automation that, when an announcement triggers, sets the music stream volume on the target client to twenty percent, plays the TTS or mic audio on the announcement stream, waits for it to finish, then restores the music volume.

That's a lot of moving pieces for what should feel like a single button press.

And the delay between the trigger and the volume change can be noticeable. You might get the first half-second of your announcement stepped on by full-volume music.

What's option three?

Option three is the one I think is actually the right call for this setup, and it's what I'd build if I were doing this from scratch. You bypass the Music Assistant pipe entirely for announcements and use Snapcast's API directly. Snapcast has a JSON-RPC API that lets you control which stream a client is listening to, set volumes, mute, and so on. You can write a script — or use a Home Assistant integration like the Snapcast add-on — to switch a client from the music stream to an announcement stream on the fly.

You're not mixing. You're switching.

You're doing a hard cutover. The flow would be, announcement triggers, the script tells the target Snapcast client to disconnect from the music stream and connect to the announcement stream, the TTS or microphone audio plays, and when it's done, the script switches the client back to the music stream. No mixing, no ducking, no garbled audio. The downside is that the music doesn't resume from where it left off unless Music Assistant is buffering independently, which it may or may not do depending on the source.

That's a meaningful downside. If you're listening to a podcast and someone pages you, you come back and you've lost your place.

It depends on the source. If Music Assistant is playing a local file or a radio stream, it'll keep playing in the background and you just missed thirty seconds. If it's something stateful, yeah, you might lose your place. Trade-offs everywhere.

What about the microphone side? The prompt mentions push-to-talk and Icecast. Is Icecast actually needed here?

This is where I want to be careful because I've gone down this exact rabbit hole. Icecast is a streaming server — it takes an audio source and makes it available as an HTTP stream that clients can connect to. The thinking is, you capture audio from your phone's microphone, send it to an Icecast server running somewhere on your network, and then point a Snapcast stream at that Icecast URL. I've seen it work. People have built this exact setup.

I hear a but coming.

The but is latency. Icecast introduces a buffer, typically several seconds, because it's designed for internet radio where consistency matters more than real-time delivery. For an intercom, you want sub-second latency. If you say, hey can you come here, and it arrives four seconds later, the moment has passed. You've already walked to the other room yourself.

The intercom equivalent of a text message arriving after you've already had the conversation in person.

So Icecast is probably the wrong tool for this. What you want instead is something much closer to the metal. If you're using a phone as the microphone, you can use a web-based approach — Home Assistant can serve a custom dashboard with a button that, when pressed, uses the Web Audio API to capture from the phone's mic and stream it via WebSocket directly to Home Assistant, which then pipes it to Snapcast.

That sounds like custom development territory.

It is, but there are existing projects. There's a Home Assistant custom component called Assist Microphone that does push-to-talk through the browser, though it's primarily designed for voice assistant commands, not intercom streaming. You could adapt it. Alternatively, if you want a dedicated hardware intercom station, an ESP32 with an I2S microphone is shockingly capable. You can program it to connect to your Wi-Fi, capture audio when a button is pressed, and stream raw audio over UDP or WebSocket to a server that feeds it into Snapcast.

At that point you've basically built a baby monitor with extra steps.

You've built a baby monitor that can also announce dinner. But here's the thing — there's an even simpler approach that sidesteps the whole microphone streaming problem entirely if you're willing to accept a slight UX compromise. Home Assistant's companion app on Android and iOS can already record audio and send it as a file. You could build an intercom that works like a walkie-talkie app. Press the button, record a short clip, release, and the clip gets sent to the target speaker via Snapcast as a one-shot audio file. It's not real-time full-duplex, but it's push-to-talk in the truest sense, and it avoids all the streaming complexity.

That's actually quite elegant. Record, send, play. No persistent streams to manage, no Icecast, no WebSocket audio wrangling. It's the WhatsApp voice note of intercoms.

You can build it entirely within Home Assistant using existing integrations. The flow would be, you create a script that's exposed as a button on your dashboard. Pressing it triggers audio recording via the companion app's media source. When recording stops, the file is saved locally. An automation picks it up and sends it to the target speaker using the Snapcast or media player play service. Total latency from release to playback maybe one to two seconds, depending on network and processing.

That's well within acceptable intercom range. You press, you speak, you release, a moment later your voice is in the other room.

And the beauty is it uses the same Snapcast infrastructure you already have. No second audio pipeline. No custom firmware.

The architecture would be, sirens for alert paging, which is already working and reliable, and then a voice note style intercom over the speakers for actual speech. Two tiers of communication, each using the right tool for the job.

That's the core of it. But I want to loop back to the stream-switching problem, because even with the voice note approach, you still need to handle the case where Music Assistant is playing and you want the intercom message to actually be heard.

Right, the interruption problem doesn't go away just because the audio is a file instead of a live stream.

And this is where I think the most practical solution is the Snapcast stream-switching script I mentioned. You set up two streams on the Snapcast server. Stream one is the music stream, fed by Music Assistant. Stream two is the announcement stream, which is normally silent. When an intercom message comes in, a Home Assistant automation fires that does three things. First, it tells the target Snapcast client to switch to the announcement stream. Second, it plays the audio file on that stream. Third, when playback finishes, it switches the client back to the music stream.

The switching latency?

In my experience, Snapcast client stream switching is near-instant. Under a hundred milliseconds. The user hears a brief silence, then the message. It's not seamless like ducking, but it's clean.

Music Assistant keeps doing its thing on the music stream, blissfully unaware that nobody's listening for a few seconds.

It's a workaround, not an elegant solution, but it's reliable. And reliability is what matters when you're trying to call for help with a baby in your arms.

Let's talk about the TTS side for a moment. The prompt mentions sending TTS messages like dinner is ready or someone's at the door. Home Assistant's built-in TTS is actually quite good now. Piper runs locally, sounds natural enough, and supports multiple voices. You could have different voices for different types of announcements.

Piper is a genuine breakthrough for local TTS. It runs on a Raspberry Pi, the voices are surprisingly good — the medium-quality models are about fifty megabytes and generate speech faster than real-time on a Pi four. You can have a calm voice for general announcements and a more urgent one for doorbell alerts. And because it's local, there's no cloud dependency and no latency spike when your internet is flaky.

Which, if you're using this for anything time-sensitive like a doorbell notification, matters a lot. You don't want the doorbell to ring, then fifteen seconds later a TTS announcement plays because it had to round-trip to some cloud service.

So the TTS pipeline would be, trigger fires, Home Assistant runs the TTS service with your chosen message, it generates an audio file locally, and that file gets pushed to the Snapcast announcement stream. Total time from trigger to audio starting, maybe half a second.

That's fast enough that someone at the door wouldn't have already left.

Or broken in, depending on the neighborhood.

Let's hope not. So to pull all this together into a coherent system design, what are we actually recommending?

I'd structure it in three layers. Layer one is the siren network, which is already built and working. Keep that for high-priority alerts — red alerts, fire, maybe the help button if you want something that cuts through regardless of what the speakers are doing. The sirens are independent of the audio stack, so they're your failsafe.

Layer two is the voice intercom. I'd build this as a push-to-talk voice note system using the Home Assistant companion app. A dashboard with buttons for each room or zone. Press the living room button, record your message, release, and it routes to the living room speaker. Behind the scenes, the Snapcast stream-switching automation handles the interruption. This gives you spoken communication without real-time streaming complexity.

Layer three is automated TTS announcements. Doorbell triggers, washer finished, dinner timer, whatever. These use the same Snapcast announcement stream but are generated by Piper TTS rather than recorded by a human. Same interruption behavior. Same routing logic.

Layers two and three share infrastructure. The only difference is whether the audio comes from a microphone recording or from a TTS engine.

And both benefit from the stream-switching script, so you're not maintaining parallel interruption logic.

Now, the prompt also asks about assigning priority so that Music Assistant is lower priority and the paging system supersedes it. Is there a way to implement actual priority levels, or is it all-or-nothing?

With the stream-switching approach, it's effectively all-or-nothing for the target client. But you can add logic in Home Assistant to decide whether to interrupt. For example, you might decide that a dinner announcement doesn't interrupt if music is playing above a certain volume, but a doorbell announcement always interrupts. You'd write conditions into the automation. If music is playing and announcement priority is low, skip. If announcement priority is high, interrupt regardless.

The priority system lives in Home Assistant's automation logic, not in Snapcast itself.

Snapcast doesn't have a native concept of stream priority. It just plays whatever streams the client is subscribed to. The intelligence has to live one layer up.

Which is actually very Home Assistant. The platform's whole philosophy is that the automation engine is the brain, and everything else is just peripherals.

That's why this approach works. Home Assistant is the conductor. Snapcast is just the orchestra. The conductor decides who plays when.

Let's talk about what breaks. You mentioned earlier that the Snapcast plus Music Assistant stack has multiple failure points. If someone builds this intercom system, what's going to fail first?

The USB audio chain. If you're using a Raspberry Pi with a USB speaker or a USB sound card, the combination of the Linux kernel's USB audio driver, PulseAudio or PipeWire, and Snapcast's ALSA integration creates a fragile stack. USB audio devices can disappear from the bus and reappear with a different device ID. PulseAudio can decide to suspend the sink due to inactivity and then not wake up cleanly. PipeWire is better about this but introduces its own quirks.

When it fails, does it fail gracefully or catastrophically?

Catastrophically, in my experience. Snapcast will show the client as connected, but no audio comes out. Or the audio will be a stuttering mess. Or it'll work fine for three days and then spontaneously stop. The fix is usually to restart the Snapcast client service, which you can automate in Home Assistant with a watchdog, but that's a band-aid.

Is there a more robust hardware choice?

If you're building this from scratch, I'd avoid USB audio entirely. Use a HAT DAC on the Raspberry Pi — something like the HiFiBerry or the IQaudio DAC. These connect via the GPIO header and appear as an I2S device, not USB. They don't disappear. They don't change IDs. They're always there. The audio quality is better too.

For existing setups where someone's already got USB speakers wired in?

Then you mitigate. Assign a fixed device name using a udev rule based on the USB device's serial number, so it always gets the same ALSA device ID regardless of what order things enumerate. Disable PulseAudio's suspend-on-idle. And set up a Home Assistant automation that periodically checks if the Snapcast client is still producing audio and restarts it if not.

That's the kind of thing you set up once, forget about for six months, and then thank your past self for when it silently saves you.

The other failure point worth mentioning is the network. Snapcast is sensitive to Wi-Fi jitter. If you're running Snapcast clients over Wi-Fi, especially on a crowded two point four gigahertz band, you'll get dropouts. For an intercom system, a dropout means a garbled message or a missed announcement.

Wire everything that can be wired.

At minimum, wire the Snapcast server. If the clients are on Wi-Fi, use five gigahertz if possible, and don't put them in the far corner of the house behind three walls and a refrigerator.

The refrigerator is the unsung villain of home wireless.

It really is. A full refrigerator is basically a Faraday cage full of soup.

Alright, let's zoom out for a moment. The prompt mentions that this might be taking things too far, but that it would be cool anyway. Is this actually useful, or are we designing a Rube Goldberg machine for a problem that doesn't really exist?

That's a fair question. I think the siren-based paging is useful. It's simple, it's reliable, and it solves a real problem — you're in one room, you need someone in another room, you press a button and they hear a chime. That's a solved problem and it works.

The voice intercom part?

The voice intercom is more of a quality-of-life upgrade. The chime tells someone you need them. The voice message tells them why. That's useful when the why matters — like, bring the diaper bag, or can you grab the package from the door. But is it essential? It's a nice-to-have.

The kind of thing that, once you have it, you use it constantly, but you survived fine without it.

It's the heated steering wheel of home automation. Nobody needs it, but everyone who has it loves it.

The TTS announcements?

Those are the most broadly useful part of this, I think. Automated announcements for doorbells, timers, washer cycles — those reduce cognitive load. You don't have to check if the washer is done. The house tells you. That's valuable in a way that a manual push-to-talk intercom isn't.

The priority order for building this would be, sirens first, already done. TTS announcements second, because they're automated and don't require human initiation. Voice intercom third, because it's cool but adds complexity for a less frequent use case.

That's exactly how I'd sequence it. And I'd build and stabilize each layer before adding the next. The mistake people make with these projects is trying to do everything at once and then debugging six interacting problems simultaneously.

Like adopting a feral cat.

I'm not sure I follow the analogy, but I agree with the sentiment.

Let's talk about the microphone hardware question a bit more. The prompt mentions using a phone or desktop as the microphone. Are there dedicated hardware options that make more sense for a permanent intercom station?

If you want a fixed intercom station — like a panel on the wall in the kitchen — there are a few approaches. The ESP32 route I mentioned is the most flexible. You can get an ESP32 dev board, an I2S MEMS microphone, a button, and a small speaker for about fifteen dollars in parts. Program it to connect to Wi-Fi, capture audio when the button is pressed, and stream it to a small HTTP server or MQTT topic that Home Assistant picks up. There are open source firmware projects that do most of this already.

Fifteen dollars is absurdly cheap for what amounts to a custom intercom endpoint.

It's the miracle of the ESP32. Dual-core processor, Wi-Fi, Bluetooth, I2S support, all for like four dollars. The microphone is another three dollars. The rest is a button, a case, and some wiring. The total bill of materials is less than what a single commercial smart speaker costs.

You're not dependent on a cloud service or a manufacturer's continued support.

The firmware is open source. If the manufacturer of a commercial intercom system discontinues the product or shuts down their cloud, you've got e-waste. With an ESP32 running ESPHome or a custom firmware, the device works as long as your Wi-Fi works.

There's a broader point here about home automation philosophy. The prompt mentions mixing and matching what you have, and that's really the Home Assistant value proposition. You're not locked into an ecosystem. You can use Zigbee sirens from one manufacturer, ESP32 microphones you built yourself, a Raspberry Pi running Snapcast, and an old Android tablet as a control panel, and they all work together because Home Assistant is the universal translator.

That's why people put up with the complexity. The alternative is a commercial system that's simpler to set up but locks you into their hardware, their cloud, their pricing, and their timeline for feature development. With the Home Assistant approach, you trade setup time for long-term control.

The setup time is the tax you pay for sovereignty.

That's a very Corn way to put it. You're paying upfront in configuration hours to avoid paying indefinitely in subscription fees and platform risk.

To wrap this into something actionable, if someone listening wants to build this, what's the shopping list and the build order?

If you don't already have them, a set of Zigbee sirens — the ones Daniel has are probably the Heiman or Neo Coolcam models, widely available, about twenty to thirty dollars each. A Raspberry Pi four or five running Home Assistant with the Snapcast add-on. Speakers for each room you want to cover — these can be anything from a USB speaker to a proper amplifier and passive speakers, depending on your budget and quality standards. And optionally, ESP32 boards with I2S microphones if you want dedicated intercom stations.

Step one, get the sirens working with Zigbee2MQTT and MQTT publish scripts for the different melodies. Step two, set up Snapcast with your speakers and get Music Assistant playing reliably. Step three, configure a second Snapcast stream for announcements and write the stream-switching automation in Home Assistant. Step four, set up Piper TTS and create automations for your key announcements — doorbell, timers, whatever. Step five, build the push-to-talk voice intercom using the companion app or dedicated ESP32 stations. Test each layer for at least a week before adding the next.

If Snapcast continues to be, as the prompt puts it, insanely buggy?

Then you have a few options. You can switch to a different multi-room audio solution — there's Logitech Media Server with the Squeezelite clients, which is older but rock solid. There's Roon if you want to spend money. There's AirPlay if you're in the Apple ecosystem. Or you can simplify and just use a single speaker connected directly to Home Assistant's media player, no multi-room at all, and rely on the sirens for room-specific paging.

The sirens plus a single smart speaker actually covers most of the use cases. Sirens for paging, speaker for TTS announcements. You lose per-room voice intercom, but you keep the most valuable parts.

It's dramatically simpler. Sometimes the best system design is the one that deletes the most complexity.

Which is a good principle for any home automation project. Before you add something, ask whether you can solve the problem with what's already there.

Or whether the problem actually needs solving. Half of home automation is resisting the urge to automate things that are fine as they are.

The other half is debugging USB audio at midnight.

That's the half we're in right now.

Now, Hilbert's daily fun fact.

Hilbert: During the Tang dynasty, officials discovered that a new paper-based tax receipt system meant to reduce corruption instead created an elaborate black market where clerks sold pre-stamped receipts to merchants, because the stamps were harder to forge than the handwritten records they replaced, making the fraudulent receipts more valuable than the genuine ones.

The anti-corruption measure spawned a premium corruption product.

That's the most Tang dynasty thing I've ever heard.

This has been My Weird Prompts. Thanks to our producer Hilbert Flumingtop. If you enjoyed this, leave us a review wherever you get your podcasts — it helps. We'll be back with another prompt soon.

This episode was generated with AI assistance. Hosts Herman and Corn are AI personalities.

#2821: How to Build a Local Intercom with Zigbee and Snapcast

Downloads

You Might Also Like

#2821: How to Build a Local Intercom with Zigbee and Snapcast