Picture a forty-five minute all-hands recording. Six people, one conference room mic, the HVAC system doing its thing in the background. Someone tries to run it through a diarization pipeline and what comes back is four speakers instead of six, two of them mislabeled, and a fifteen-minute stretch where the system just... collapses into one speaker because two people were talking over each other. That recording is now useless for anything downstream. No searchable transcript, no per-speaker analytics, nothing.
That scenario is not exotic. That is Tuesday. The demand for accurate speaker identification has exploded because everything downstream depends on it. Meeting summarization, call center analytics, courtroom transcription, podcast indexing, clinical documentation. The moment you need to know not just what was said but who said it, diarization is the load-bearing wall.
The cost of getting it wrong is not just an inconvenience. In a call center context, for example, if your diarization is misattributing agent speech to the customer, your compliance review is now flagging the wrong person. You could be pulling a perfectly good agent for retraining based on something they never said. The downstream consequences scale with how seriously the organization is relying on that output.
Right, and the failure is silent. The pipeline does not throw an error. It hands you a transcript that looks plausible, and you have to already know what to look for to catch that it is wrong.
Daniel sent us this one, and it goes deep. He's asking how PyAnnote actually works under the hood, the full pipeline from segmentation through embedding extraction through clustering. He wants a real comparison with the other tools in this space, NeMo, WhisperX, Kaldi, end-to-end neural approaches like EEND. And then the harder question: if you have a saved voice library, embeddings from prior identification runs, can you build a system that diarizes unknown audio and then maps those detected speaker clusters onto known identities via nearest-neighbor lookup? What does that pipeline look like, enrollment, cosine similarity thresholds, handling speakers you have never seen before, domain mismatch when the microphone or codec changes? How robust is this in practice and where does it fall apart?
That last part is the question I have been wanting to dig into properly for a while.
By the way, today's episode is powered by Claude Sonnet four point six, doing the heavy lifting on the script side.
Alright, let's get into it.
The basic framing first, because it matters for everything that follows. Diarization is not transcription, it is not identification, it is purely the task of partitioning an audio stream by speaker. Who spoke, when. That is the whole job. The reason it is hard is that speech is not clean. Speakers interrupt each other, rooms have acoustics, recording conditions vary wildly, and two people's voices can sit in overlapping frequency ranges.
The thing that makes it genuinely difficult at a signal level is that you are trying to solve two problems simultaneously. You need to detect where speech is happening at all, that is your voice activity detection layer, and then within that speech you need to decide whether consecutive segments belong to the same speaker or a different one. Those two error sources compound. Miss a boundary in the first step and your clustering in the second step inherits that mistake.
It is a bit like trying to sort mail while someone keeps sliding new envelopes into the pile. Every misread at the front of the process shuffles everything that comes after it.
The metric that captures this compounding is diarization error rate, DER. It is the standard benchmark and it penalizes three things simultaneously: missed speech, false alarm speech, and speaker confusion. A system can have low missed speech and still have a terrible DER if it is constantly confusing speakers. Those failure modes are independent and they require different fixes.
Which is where PyAnnote comes in. It was first released in twenty twenty-one by Hervé Bredin and colleagues, and it became the go-to open-source framework partly because it modularizes the pipeline cleanly. You get distinct components for segmentation, for embedding extraction, for clustering, and you can swap pieces in and out.
That modularity is underrated. A lot of earlier systems were monolithic, you took the whole thing or nothing. PyAnnote lets you, say, keep its segmentation model but plug in a different embedding extractor if your domain calls for it. SpeechBrain plays nicely with it for exactly that reason.
The pipeline has distinct stages, each with its own failure surface. That is the architecture we are about to pull apart, starting with the first stage: segmentation.
PyAnnote runs a neural segmentation model that does two things at once: it finds speech versus non-speech boundaries, and it detects speaker change points within the speech regions. The model is a LSTM-based or more recently a transformer-based architecture trained to output a frame-level probability for each speaker being active. So you get, at every twenty-millisecond frame roughly, a score for whether speaker A is talking, speaker B is talking, both, or neither.
That overlap detection is baked in from the start, not bolted on after.
Right, it is not a separate pass. Which matters because a lot of older systems treated overlap as noise and just threw those frames away. PyAnnote at least tries to model it explicitly.
How much of real conversational audio is actually overlapping speech? Because intuitively it feels like a lot, but I am curious what the research says.
More than you would expect in natural conversation. Studies on spontaneous speech corpora put it somewhere between ten and fifteen percent of total speech duration in multi-party meetings. In a two-person interview it is lower, maybe five percent, but in a roundtable or a panel discussion it climbs fast. So if your system is discarding those frames entirely, you are throwing away a meaningful chunk of your data and introducing systematic gaps in your timeline.
Once you have those segments, what are you extracting from them?
This is where the embedding models come in. For each speech segment you run a speaker encoder that compresses the acoustic content of that segment into a fixed-length vector, typically two hundred fifty-six or five hundred twelve dimensions. The classic approach was x-vectors, which use a time-delay neural network trained on speaker classification. You take the segment, pool the frame-level features into a segment-level representation, and you get your vector. The problem with x-vectors in noisy conditions is that they are sensitive to the channel characteristics, the microphone, the room, the codec. They pick up a lot of noise in that embedding.
If the meeting room has bad acoustics, the x-vector is partly encoding the room.
You are not getting a clean representation of the speaker's voice. You are getting speaker plus environment, and those two things are entangled in the embedding. Which means if that same speaker shows up in a different room, the embeddings for the same person can look like two different people to the model.
ECAPA-TDNN, which stands for Emphasized Channel Attention, Propagation and Aggregation with a Time Delay Neural Network, addresses that. There was a twenty twenty-three study showing ECAPA-TDNN outperforms x-vectors meaningfully in noisy environments, specifically because the channel attention mechanism lets the model weight which frequency bands are actually speaker-discriminative versus which are dominated by noise. You get a cleaner embedding.
WavLM also shows up in this space.
WavLM and similar self-supervised models are interesting because they were not trained on speaker classification directly. They learn general speech representations and then you fine-tune a speaker head on top. The advantage is that the base representations tend to be more robust across domains because they have seen so much varied audio during pretraining. The tradeoff is compute. WavLM-based embeddings are significantly heavier to extract than ECAPA-TDNN.
How much heavier are we talking? Is this a ten percent overhead or are we talking about an order of magnitude?
Closer to the order of magnitude end, depending on the model size and your hardware. On a GPU it is manageable, but if you are trying to run this on a CPU in a latency-sensitive pipeline, WavLM can become a real bottleneck. ECAPA-TDNN is the more practical default for most production setups. WavLM is the thing you reach for when domain robustness is the primary concern and you have the compute budget to support it.
After extraction, you have a pile of embeddings, one per segment.
Spectral clustering is what PyAnnote uses by default. You build an affinity matrix from pairwise cosine similarities between all your segment embeddings, then you run a spectral decomposition to find the natural groupings. It handles non-convex clusters better than simple agglomerative methods. The catch is that you need to specify or estimate the number of speakers, and spectral clustering is not cheap at scale.
NeMo takes a different path here.
NeMo's diarization module, TitaNet embeddings specifically, pairs with agglomerative hierarchical clustering. It tends to be faster and it integrates very cleanly with their ASR pipeline, which is why people reach for it when they want transcription and diarization in one shot. On a clean conference call recording they perform comparably. Put both of them on a noisy multi-speaker scenario and PyAnnote with ECAPA-TDNN tends to hold up better at the segmentation stage because of that explicit overlap modeling.
Then there is the end-to-end camp entirely, EEND.
EEND, end-to-end neural diarization, collapses the whole pipeline into a single model. No separate VAD, no separate embeddings, no clustering step. You feed in audio and the model outputs speaker activity directly. The appeal is obvious, no compounding errors across stages. The limitation is that EEND struggles when the number of speakers exceeds what it was trained on, and it is notoriously data-hungry. Kaldi sits at the other end, it is the classical toolkit, GMM-based or PLDA-based scoring, very interpretable, still used in telephony and broadcast where you have controlled conditions and need something auditable.
Auditable is the key word there. In a legal context, for instance, you cannot just hand a judge a neural network's output and say trust us. Kaldi's scoring is transparent enough that you can explain why the system made a particular speaker assignment. That interpretability has real value even if the accuracy ceiling is lower.
The choice of tool is really a function of your noise floor, how much you trust your speaker count estimate going in, and how much you need to be able to explain your outputs after the fact.
That's the foundation. But it also leads to the harder question: what happens when you want to go beyond clustering to actual identification? You've got your speaker clusters labeled as speaker one, speaker two, speaker three, but what if you already know who those people are? What if you have a voice library?
Which is the thing Daniel is really asking about. You have a database of stored embeddings from prior recordings where you already know the identity. Can you close the loop?
You can, and the architecture for doing it is not that exotic. The enrollment step is where you build your library. For each known speaker you take high-quality reference audio, clean conditions, minimal overlap, and you extract multiple embeddings across different utterances. You average them, or sometimes you store the full set and take a centroid, and that becomes the speaker's representation in your vector database. AssemblyAI published a solid walkthrough of this with d-vectors specifically, and the key point they make is that enrollment quality is the ceiling. If your reference audio is noisy or short, nothing downstream saves you.
How short is too short?
Rough rule of thumb is you want at least ten to fifteen seconds of clean speech per enrollment speaker to get a stable embedding. Below that you are averaging over too few frames and the centroid drifts. Above about sixty seconds you get diminishing returns, the embedding stabilizes.
It is worth being deliberate about what utterances you enroll from. If you pull enrollment audio from a single long monologue, you are capturing one speaking style, one register. If you pull from several separate sessions you are averaging over natural variation, different energy levels, different emotional states, and your centroid ends up more representative of the speaker as a whole rather than just how they sounded on one particular Tuesday.
That is exactly right. Voices are not static. The same person sounds different when they are tired, when they are excited, when they are on a phone call versus in person. A centroid built from diverse utterances is more robust to that natural variation than one built from a single clean recording, even if that single recording is technically higher quality.
You enroll your known speakers. Then new audio comes in, gets diarized, and you have clusters. Each cluster gets a probe embedding.
Cosine similarity against every entry in your voice library. You take the probe embedding, compute cosine similarity to each stored speaker centroid, take the top match. If that similarity score clears a threshold, you assign that identity. If it does not, you flag the cluster as unknown.
The threshold question is where this gets tricky.
It is the hardest tuning problem in the whole pipeline. There was an INTERSPEECH paper proposing adaptive thresholding, where instead of a fixed cutoff you maintain per-speaker score distributions and adjust dynamically. The intuition is that some voices are more distinctive than others. A speaker with a very unique acoustic profile might reliably hit cosine similarities of zero point nine or above against their own enrollment embeddings. A speaker who sounds more average might peak at zero point seventy-five even on a clean match. A fixed threshold of, say, zero point seventy-eight treats those two cases identically, which is wrong.
The threshold is not a number, it is a function of the speaker.
In practice most production systems start with a fixed threshold somewhere between zero point seven and zero point eight and then tune from there using held-out data. PLDA scoring is the more principled alternative, it models the within-speaker and between-speaker score distributions explicitly, but it requires enough enrollment data per speaker to estimate those distributions.
What about domain mismatch? This is where I would expect things to fall apart fast.
It is the biggest practical problem. Your enrollment audio was recorded on a studio condenser microphone. The test audio is a phone call compressed through an AMR codec at twelve kilobits per second. The embedding model has never seen that codec. The resulting embeddings for the same speaker can be far enough apart in the embedding space that cosine similarity drops below your threshold entirely, and you get a false reject. The speaker is in your library and you miss them.
If the threshold is too permissive you get the opposite problem.
Two different speakers who happen to sound similar in that degraded channel both clear the threshold and get assigned the same identity. There was an IEEE paper on domain adaptation for diarization in noisy environments that looked at this specifically. The recommended mitigation is augmenting your enrollment audio with simulated domain conditions, add codec artifacts, add room impulse responses, add background noise, extract embeddings from those augmented versions and include them in the centroid. ECAPA-TDNN handles this better than x-vectors partly because of that channel attention, but it does not eliminate the problem.
There is a fun analogy here actually. It is like trying to recognize a friend's face through frosted glass. You know what they look like in good lighting, but the degraded signal introduces enough ambiguity that your brain starts second-guessing itself. And if two people have similar enough features, the frosted glass can make them look identical even though they are clearly distinct in normal conditions. The augmentation approach is essentially training yourself to recognize people through frosted glass by practicing with frosted glass.
That is a good way to put it. And the practical upshot is that if you know your deployment environment in advance, you should be collecting enrollment audio in conditions that match that environment, not just in the cleanest conditions available. A library enrolled on a studio mic is not the right library for a call center deployment.
What about the scalability side? If the voice library grows to thousands of speakers, nearest-neighbor lookup gets expensive.
That is where approximate nearest-neighbor indexes come in. FAISS is the standard choice. At a few hundred speakers you can do exact cosine search in milliseconds. At tens of thousands you switch to an approximate index, IVF or HNSW, and you get sub-millisecond queries with a small accuracy tradeoff. Pinecone published benchmarks on this and the accuracy degradation at reasonable index sizes is typically under two percent, which is acceptable for most use cases.
For something like this podcast, though, the library is small. A handful of known voices.
Which is actually the sweet spot for this approach. Small library, clean enrollment audio, controlled recording conditions. You would expect identification accuracy in the high nineties on clean test audio. The moment you introduce a noisy environment, short utterances under three seconds, or speakers with similar vocal profiles, you drop fast. Low signal-to-noise ratio is probably the single biggest robustness killer because the embedding model is encoding noise as part of the speaker signature.
Overlapping speech compounds that. A segment where two people are talking simultaneously produces an embedding that is a blend of both speakers. Your nearest-neighbor lookup is going to find... Not necessarily either of them.
Hybrid approaches help there. Use EEND specifically for overlap detection, flag those segments, exclude them from the identity matching step, and let the non-overlapping segments carry the identification. It adds pipeline complexity but it is the honest engineering answer to that failure mode.
Given that approach, what would the practical starting point look like for someone building this right now—say, an AI or automation engineer?
PyAnnote is the honest first choice for the diarization layer if you are on Python. The pipeline is pip-installable, the models are on Hugging Face, and the modularity means you can swap in ECAPA-TDNN embeddings without rewriting the whole thing. Start there, get diarization working on your own audio, and measure your diarization error rate before you bolt on identity matching. A lot of people skip that step and then cannot tell whether their misidentifications are coming from bad clustering or bad lookup.
Measure before you add complexity.
For the voice library side, SpeechBrain and Resemblyzer are both reasonable starting points for embedding extraction if you want something lighter than a full PyAnnote install. Resemblyzer in particular is straightforward for enrollment, you feed it utterances, you get d-vectors back, you store them. FAISS handles the lookup. The whole stack is open source and you can have a prototype running in an afternoon.
There is a useful sanity check you can build in early. Before you connect the diarization output to the identity lookup, run the lookup in isolation on your enrollment audio itself. Enroll a speaker, then immediately query against their own enrollment utterances and verify you are getting similarities in the range you expect. If that is already failing, the problem is in your embedding extraction, not in your clustering. It isolates the failure mode before the pipeline gets complicated.
That is good engineering discipline. Validate each component independently before you chain them together. The integrated system will have failure modes that are hard to attribute if you have never confirmed the pieces work in isolation.
The enrollment quality point keeps coming back. It is load-bearing.
It really is. If I had to give one piece of practical advice it would be this: treat your enrollment audio like a first impression that never changes. Record in the quietest conditions you can manage, at least fifteen seconds per speaker, ideally several separate utterances across different sessions so you are averaging over natural variation in their voice. The centroid you compute from that is your reference point for every future lookup. Garbage in, garbage out applies here more directly than almost anywhere else in the pipeline.
For listeners who want to probe the failure modes before deploying anything, what would you suggest?
Test on audio that is worse than you expect to see in production. Introduce codec degradation deliberately, compress your test audio, add room noise, try speakers with similar pitch ranges. If your system holds up there, you have something. If it collapses, you know exactly which part of the pipeline to harden before it matters.
The robustness is earned, not assumed.
That is the accurate summary of basically everything we have covered today.
Where this goes next is the part I keep turning over. Right now you are building a pipeline that knows voices it has already seen. The interesting pressure is toward a system that learns voices continuously, updates its own library in real time, and starts doing things like flagging when a known speaker sounds unusually stressed or sick because the embedding has drifted from the enrolled centroid.
That is already happening at the research level. The question is whether the infrastructure catches up. Personalized AI assistants are the obvious destination. If your assistant can reliably distinguish you from your spouse from a guest in the room, without a wake word, just from continuous ambient diarization, the interaction model changes completely. It stops being a device you address and starts being something that understands conversational context the way a person in the room would.
Which raises questions that are not purely engineering questions.
No, they are not. Passive continuous diarization in a home environment is a different category of thing than diarizing a meeting recording after the fact. The capability and the consent model are not obviously aligned.
The gap between those two things is not just a legal question. It is a question about what people reasonably expect when they have a device in their home. The meeting recording case has an implicit social contract: everyone in that room knows they are being recorded. Continuous ambient diarization does not have that contract built in. Guests do not know they are being enrolled. Children do not consent. The system is building a voice library whether or not anyone agreed to be in it.
Which is why the governance layer is not a nice-to-have that gets bolted on after the product ships. It has to be part of the design from the start, because retrofitting consent mechanisms onto a system that was built without them is hard. The data is already collected, the embeddings are already stored, and unwinding that is not straightforward.
That is probably where the most interesting work will happen over the next few years. Not in the embedding architecture, which is largely a solved problem at this point, but in the governance layer around when and how identity matching is allowed to run.
I think that is right. The technical ceiling is high enough that the binding constraint has shifted.
Good place to leave it. Thanks to Hilbert Flumingtop for producing, and to Modal for keeping the compute running so we can actually ship these episodes. This has been My Weird Prompts. If you have been enjoying the show, a review on Spotify goes a long way. We will see you next time.