#2582: What Your Browser Does to Mic Audio Before It Reaches Your Server

getUserMedia returns audio, but not raw audio. Here's what browsers actually do to your mic feed before it hits your server.

Featuring

Daniel

Corn

Herman

0:000:00

Episode Details

Episode ID: MWP-2740
Published: May 1
Duration: 31:13
Audio: Direct link
Pipeline: V5
TTS Engine: chatterbox-regular
Script Writing Agent: deepseek-v4-pro
Topics: audio-processing speech-recognition browser-audio-pipeline

AI-Generated Content: This podcast is created using AI personas. Please verify any important information independently.

When you record audio in a browser, you're not getting the raw microphone feed. You're getting whatever the browser decides to give you after running it through its own audio pipeline — and that pipeline varies dramatically across Chrome, Firefox, and Safari.

The Two-API Trap

The browser exposes two separate APIs for audio capture. getUserMedia grabs the stream from the microphone, and MediaRecorder encodes that stream into a container format. Most developers call both with no configuration and move on. But the defaults are designed for video calls from eight years ago, not for feeding neural networks.

Call getUserMedia with no constraints, and you get whatever the browser thinks is appropriate. Chrome defaults to 48 kHz on desktop but can drop to 16 kHz or even 8 kHz on mobile. Firefox behaves similarly. And crucially, the constraints API is a request, not a command — if your user's cheap Android phone can't deliver 48 kHz audio, the browser won't tell you. It just gives you the closest thing it can.

The Codec Maze

Once the stream reaches MediaRecorder, the encoding pipeline diverges by browser. Chrome defaults to Opus in a WebM container at roughly 64 kbps (variable bitrate). Firefox does the same. Safari uses AAC in MP4. Three browsers, three different encoding pipelines, three different frequency responses feeding into your transcription model.

The variable bitrate nature of Opus adds another layer of inconsistency. When audio is silent, the encoder drops bitrate way down. When it's complex, it spikes up. Your transcription pipeline sees a stream that's constantly changing its encoding quality — and it has no way to know.

The Bitrate Sweet Spot

Counterintuitively, higher bitrates can hurt transcription accuracy. Research suggests the sweet spot for Opus-encoded speech in modern ASR models is between 32 and 64 kbps. Below 16 kbps, accuracy degrades as the codec discards frequency information. But above 96 kbps, accuracy sometimes regresses.

The leading theory: most speech-to-text models were trained on compressed audio (because storing raw PCM for millions of hours is prohibitively expensive). High-bitrate audio preserves high-frequency noise and room ambience that the compressed training data didn't contain. The model encounters an out-of-distribution signal and stumbles.

Destructive Defaults

By default, getUserMedia applies echo cancellation, noise suppression, and automatic gain control. These are useful for video calls but actively harmful for recording. Echo cancellation clips utterance beginnings. Noise suppression mangles sibilants and fricatives. Automatic gain control pumps the noise floor up and down, confusing voice activity detection.

You can disable all of these in your constraints — set echoCancellation: false, noiseSuppression: false, autoGainControl: false. But Safari on iOS has historically ignored some of these constraints entirely, with a WebKit bug that sat open for four years before Apple addressed it.

Practical Approaches

The simplest reliable setup: use getUserMedia with explicit constraints (48 kHz sample rate, all processing disabled), pipe through MediaRecorder with an explicit MIME type (audio/webm; codecs=opus) and a target bitrate of 32 kbps for mono voice. This puts you within the distribution most speech models were trained on.

For more control, the Web Audio API gives direct access to PCM samples via AudioContext and AudioWorklet, allowing custom encoding or server-side processing. Libraries like RecordRTC abstract over browser inconsistencies and offer output options including uncompressed WAV (at the cost of larger files) or MP3 (at the cost of CPU usage).

The nuclear option — compiling libsoundio or PortAudio to WebAssembly to bypass the browser's audio stack entirely — exists but requires a dedicated audio engineering team.

Downloads

Episode Audio

Download the full episode as an MP3 file

Download MP3

Transcript (TXT)

Plain text transcript file

Transcript (PDF)

Formatted PDF with styling

#2582: What Your Browser Does to Mic Audio Before It Reaches Your Server

Daniel sent us this one — he's been deep in the weeds building a voice-based productivity app, and he keeps coming back to the same question. When you're recording audio in the browser, what are you actually getting? He's been through the dictation rabbit hole, discovered some surprising things about bitrate — apparently cranking it too high can actually hurt transcription accuracy — and now he's looking at the browser pass-through itself. The convenience of getUserMedia is great, but once that mic feed hits Chrome or Firefox, what's happening to it before it gets to your server? Are we getting the raw microphone or something that's already been mangled by the browser's audio pipeline? And what frameworks and tools are out there for doing this more deliberately?

Oh, this is such a good question. And it's one of those things where the convenience of the web platform has sort of papered over a genuinely messy technical reality. Most developers just copy-paste the getUserMedia snippet from MDN, wire up a MediaRecorder, and never think about it again. But what's actually happening under the hood varies wildly.

Before we get into the weeds — quick note. Today's episode is powered by DeepSeek V four Pro, writing our script. So if anything sounds unusually coherent, that's why.

I was going to say, that explains the unusually tight structure. But seriously, let's dig into this. The first thing to understand is that there are really two separate APIs at play here. You've got getUserMedia, which captures the raw audio stream from the microphone. Then you've got MediaRecorder, which encodes that stream into a container format. And the defaults on both of these are... let's call them conservative.

Conservative meaning bad?

Conservative meaning designed for eight years ago when everyone was worried about bandwidth and nobody was thinking about feeding this into a neural network for transcription. So if you just call getUserMedia with no constraints, what do you actually get?

I'm guessing not studio quality.

The MDN documentation is pretty clear on this but also frustratingly vague in practice. By default, getUserMedia returns audio at whatever the browser decides is appropriate. Chrome typically defaults to a sample rate of 48 kilohertz on desktop, but it can drop to 16 kilohertz or even 8 kilohertz on mobile depending on the device and connection. The audio is typically mono, which is fine for voice. But here's the thing — you can specify constraints. You can say, hey, I want 48 kilohertz sample rate, I want echo cancellation disabled, I want noise suppression turned off. And the browser will try to honor that.

"try to honor" is doing a lot of work there.

The constraints API is a request, not a command. If you're on a cheap Android phone with a terrible mic, the browser cannot magically give you 48 kilohertz audio. It'll give you the closest thing it can. And the real kicker is that most developers never set any constraints at all. They just take the default stream and run with it. So one user might be sending you pristine audio and another might be sending you something that sounds like a potato, and your transcription pipeline has no idea which is which.

That explains half of the support tickets Daniel's probably dealing with. "Your app works great for me" versus "it can't understand a word I say.

And the other half is what happens after you've got the stream. That's where MediaRecorder comes in. MediaRecorder takes your raw audio stream and encodes it using a codec. And here's where it gets really interesting — the default codec and bitrate vary by browser. Chrome uses the Opus codec by default for audio-only recordings when you specify the webm container. Firefox also uses Opus. Safari uses MP4 with AAC. Three different browsers, potentially three different encoding pipelines.

Opus at what bitrate?

This is the part that surprised me when I dug into it. The MediaRecorder spec doesn't mandate a specific default bitrate. Chrome's implementation has historically defaulted to around 64 kilobits per second for mono audio. Firefox is similar. But — and this is crucial — these are variable bitrate encoders. Opus is designed to adapt. If the audio is silent, it drops the bitrate way down. If there's complex audio, it spikes up. So you're not even getting a consistent stream.

Which connects back to what Daniel mentioned about bitrate and transcription accuracy. He said sometimes too high a bitrate gets worse results. That sounds counterintuitive.

It does, but I think I know what's happening there. There was a really interesting paper from some researchers at Carnegie Mellon a couple of years ago that looked at this directly. They found that for modern speech-to-text models, the sweet spot for Opus-encoded speech is somewhere between 32 and 64 kilobits per second. Below 16 kilobits per second, accuracy starts to degrade noticeably because the codec is discarding frequency information that the model uses to distinguish phonemes. But above about 96 kilobits per second, you hit diminishing returns, and in some cases you actually see slight accuracy regressions.

Why would more data make it worse?

The hypothesis — and I should say this isn't definitively proven, it's an active area of research — is that at very high bitrates, the encoder preserves more high-frequency noise and room ambience that would otherwise be discarded. Speech-to-text models trained on compressed audio actually learn to filter that stuff out implicitly. When you feed them near-lossless audio, the noise floor is higher and the model hasn't seen that distribution during training. So it stumbles.

It's not that high bitrate is inherently worse. It's that the models were trained on the compressed stuff, and the high bitrate audio is out of distribution.

That's exactly the working theory. And it makes sense when you think about how these training datasets are constructed. Most of the large speech corpora — LibriSpeech, Common Voice, even the proprietary ones — they're stored as compressed audio because storing raw PCM for millions of hours is prohibitively expensive. So the models learn on 16-bit FLAC or Opus at moderate bitrates, and that becomes their "normal.

Daniel's instinct is right. You want to control the bitrate. But that brings us back to the browser. Can you actually control the bitrate in MediaRecorder?

Yes and no. The MediaRecorder constructor accepts an options object where you can specify the MIME type and, in some browsers, the bitrate. So you can say something like, audio/webm; codecs=opus, and then set bitsPerSecond to 32000. Chrome supports this. Firefox supports this. Safari, as usual, is the odd one out — it has limited support for specifying bitrate on MediaRecorder.

Even when you set it, is it honored?

The browser will try to hit your target bitrate, but Opus is inherently variable bitrate, so it's more of a target than a hard ceiling. And different browser versions have had bugs where the bitrate parameter was partially or fully ignored. There was a Chrome bug in early twenty twenty-four where setting bitsPerSecond on MediaRecorder had no effect for certain MIME types. So you're at the mercy of the browser's implementation quality.

This is exactly the kind of thing that drives developers nuts. You think you're controlling the pipeline, but you're really just making polite suggestions to a black box.

It gets worse when you factor in audio processing. By default, getUserMedia applies echo cancellation, noise suppression, and automatic gain control. These are useful for video calls — nobody wants to hear their own voice echoing back at them — but for transcription, they're destructive. Echo cancellation can clip the beginning of utterances. Noise suppression can mangle sibilants and fricatives — the "s" and "f" sounds that carry a lot of phonemic information. Automatic gain control can pump the noise floor up and down in ways that confuse voice activity detection.

The default audio pipeline is optimized for real-time communication, not for recording. And Daniel's trying to use it for recording.

The good news is you can turn all of that off. In your getUserMedia constraints, you set echoCancellation to false, noiseSuppression to false, and autoGainControl to false. The bad news is that not all browsers respect all of these. Chrome is pretty good about it. Firefox is decent. Safari on iOS has historically ignored some of these constraints entirely because Apple has strong opinions about audio processing.

Of course they do.

There was a WebKit bug report about this that sat open for something like four years before they addressed it. The argument from Apple's side was that echo cancellation is necessary for the user experience on mobile devices. Which, for a video call, sure. But for a recording app, it's actively harmful.

What's the practical answer here? If Daniel wants to build a browser-based recording app that gives him consistent, controllable audio, what does he actually do?

There are a few approaches, and they range from simple to nuclear option. Let me walk through them.

The simplest approach is to use getUserMedia with explicit constraints, disable all audio processing, specify your preferred sample rate — I'd recommend 48 kilohertz for voice, it's what most modern ASR models expect — and then pipe that through MediaRecorder with an explicit MIME type and bitrate. For Opus in a webm container, I'd target 32 kilobits per second for mono voice. That's the sweet spot based on the research I've seen. You get excellent intelligibility, the file sizes are manageable, and you're within the distribution that most speech models were trained on.

If you need something more reliable than that?

The next step up is to use the Web Audio API instead of, or in addition to, MediaRecorder. Web Audio gives you direct access to the PCM samples. You can use an AudioContext, create a MediaStreamSource from your getUserMedia stream, and then use a ScriptProcessorNode or AudioWorklet to grab the raw float samples. From there, you can encode them however you want — you could write your own Opus encoder in WebAssembly, or you could send the raw PCM to a server and encode it there.

That sounds like a lot of work.

It is, but it gives you complete control. There are libraries that make this easier. RecordRTC is probably the most popular — it's been around for years, it abstracts over a lot of the browser inconsistencies, and it lets you specify recording parameters in a cleaner way. There's also a newer library called MicRecorder that's built on top of the MediaRecorder API but adds better error handling and format options.

I've seen RecordRTC mentioned in a lot of projects. Does it actually solve the consistency problem?

It handles the cross-browser quirks reasonably well, and it gives you options for output format — you can get WAV, which is uncompressed PCM, or Opus in webm, or MP3 via a JavaScript encoder. The trade-off is that getting WAV means much larger files, and getting MP3 means running an encoder in the browser which uses CPU. But for a productivity app where the user is recording voice notes, the files are short, so neither of those is a dealbreaker.

What about the nuclear option you mentioned?

The nuclear option is to bypass the browser's audio stack entirely. There are a couple of ways to do this. One is to use WebAssembly to compile a full audio processing pipeline — something like libsoundio or PortAudio compiled to WASM, talking directly to the audio hardware. This is technically possible but wildly complex and not something I'd recommend unless you have a dedicated audio engineering team.

That does sound nuclear.

The more practical nuclear option is to recognize that browser-based recording has fundamental limitations and to offer a companion native app or a progressive web app that uses more capable APIs. On desktop, you can use Electron with native Node modules for audio capture. On mobile, you can use the platform's native recording APIs. But that defeats the convenience argument that Daniel was asking about.

He specifically said he wants to avoid making users record files offline. The whole point is the convenience of in-browser recording.

I think that's the right instinct for most applications. The convenience of browser-based recording is enormous. You send someone a link, they click one button, and they're recording. No install, no permissions dance beyond the initial microphone prompt. The question is whether you can make that convenient path also produce consistent, high-quality audio.

From what you're saying, the answer is mostly yes, with caveats.

Mostly yes, with significant caveats. You need to be explicit about your constraints. You need to test across browsers. You need to handle the failure modes gracefully. And you need to understand that you're never going to get bit-identical output across different browsers and devices. But you can get close enough that a modern speech-to-text model won't care.

What about the frameworks side of Daniel's question? He asked about different frameworks and tools. We've talked about getUserMedia and MediaRecorder and RecordRTC. What else is out there?

There's a whole ecosystem. On the lighter-weight end, you've got things like React-Mic and React-Media-Recorder if you're in the React ecosystem — they wrap the browser APIs in a component interface. Vue has similar packages. For Angular, there's ngx-audio-recorder. These are thin wrappers, but they save you from writing the boilerplate yourself.

If you want a more opinionated recording experience, there's a library called OpusRecorder that specifically targets Opus encoding in the browser. It uses a WebAssembly build of the reference Opus encoder, so you're not relying on the browser's MediaRecorder implementation at all. You get the raw PCM from getUserMedia, feed it to the WASM encoder, and get consistent Opus packets out. The downside is that the WASM binary adds about 300 kilobytes to your bundle, and encoding uses some CPU. But for short voice recordings, that's negligible.

300 kilobytes seems like a rounding error for most web apps these days.

It really is. And the consistency gain is significant. You know exactly what encoder version you're using, exactly what parameters it's running with, and the output is identical regardless of whether the user is on Chrome, Firefox, or Safari. That's a huge win for debugging transcription quality issues.

I'm thinking about this from Daniel's perspective, building a productivity app. He wants voice notes to just work. The user taps a button, speaks, and the transcription is accurate. Every layer of unpredictability in the audio pipeline is a potential support headache.

It's not just about support. It's about the user's perception of the product. If voice recognition works perfectly nine times out of ten, that tenth failure is what people remember. It erodes trust. So controlling the audio pipeline isn't just a technical concern — it's a product quality concern.

You mentioned earlier that the Web Audio API gives you access to raw PCM samples. Is there a world where you skip MediaRecorder entirely and just stream the raw audio to your server?

That's actually what a lot of the real-time transcription services do. You open a WebSocket, you capture audio via getUserMedia, you use an AudioWorklet to get the raw samples, and you stream them directly to the server. The server can then feed them into the ASR model in real time, or buffer them and transcribe when the recording ends. Deepgram's browser SDK works this way. So does AssemblyAI's. And OpenAI's real-time API uses a very similar pattern with WebRTC.

The big players have all moved away from MediaRecorder for their production SDKs?

For real-time use cases, yes. MediaRecorder is fundamentally a file-oriented API. It produces chunks of encoded data that are meant to be assembled into a file. For real-time streaming, you want raw audio or a streaming-friendly codec, and you want to control the chunking yourself. MediaRecorder's dataavailable events fire at intervals that aren't necessarily aligned with utterance boundaries, so you end up having to do your own buffering and segmentation anyway.

That's a good distinction. If Daniel's app is doing real-time or near-real-time transcription, MediaRecorder might not even be the right tool, regardless of the quality concerns.

But if he's doing a record-then-transcribe workflow — user records a note, hits stop, and then gets the transcription — MediaRecorder is perfectly fine. You just need to be deliberate about the configuration.

Let's talk about the deterministic pipeline idea Daniel mentioned. He said using pipelines deterministically lets you control for part of the factors that affect accuracy. What does a truly deterministic browser recording pipeline look like?

A deterministic pipeline would mean that given the same acoustic input, you get the same digital output every time, regardless of browser, operating system, or device. And I'm going to be honest — you cannot achieve that with browser APIs alone. There are too many variables. The analog-to-digital converter on the device. The operating system's audio stack. The browser's audio processing. The encoder implementation.

Deterministic is an aspiration, not a reality.

It's a direction, not a destination. You can get closer by controlling the things you can control. Use the OpusRecorder approach with a WASM encoder so the encoding step is deterministic. Disable all browser audio processing so you're not getting unpredictable filtering. Set explicit sample rate constraints. And then accept that the analog front end — the microphone and ADC — is going to vary.

The microphone variation alone is probably a bigger factor than any of the digital pipeline stuff.

The difference between a decent headset mic and a laptop's built-in microphone array is enormous. Way bigger than the difference between 32 and 64 kilobit Opus. But that's an argument for controlling the things you can control, not for giving up entirely.

There's also the question of what happens when the user has multiple microphones. On a desktop, you might have a webcam mic, a headset mic, and a Yeti on a boom arm. Which one does the browser pick?

By default, the browser picks whatever the operating system says is the default communications device. Users can change this in the browser's site settings, but most never do. You can enumerate devices using enumerateDevices and let the user pick, which is what professional recording apps do. But for a productivity app, that's probably overkill. Most users just want it to work with whatever mic they're using.

The default is usually fine for voice.

The exception is when the default is a webcam mic that's three feet away from the user's face and picks up every keystroke and chair squeak in the room. But again, that's an acoustic problem, not a software problem.

Let's circle back to something Daniel said in his prompt. He mentioned that the browser passes through the user's microphone feed, and developers are never quite sure what kind of audio stream they're capturing. I think that's the core anxiety here. The browser feels like a black box.

It does, and the documentation doesn't help. The MDN pages for getUserMedia and MediaRecorder are accurate but they don't tell the story of what's actually happening to your audio. They tell you what the API surface looks like, not what the processing pipeline looks like. And the processing pipeline varies by browser, by operating system, and sometimes by hardware.

Part of the answer is just education. Developers need to know that the defaults exist and what they do.

The defaults exist for video conferencing. Echo cancellation, noise suppression, automatic gain control — these are all designed to make a Zoom call sound decent. They are not designed to produce clean audio for machine processing. And the moment you understand that, a lot of the weird behavior starts to make sense.

It's almost like the browser needs a "recording mode" that's distinct from "communication mode." A mode where all the processing is disabled and you get the rawest possible feed.

There's been discussion in the WebRTC working group about exactly that. The idea of an "audio raw" constraint or a "processing disabled" mode. But it's tricky because some of the processing happens at the operating system level, below the browser. On Windows, the audio stack has its own processing pipeline. On macOS, Core Audio does sample rate conversion and mixing. The browser can't always bypass that.

Even if the browser wanted to give you raw audio, the OS might not let it.

On iOS, for example, all audio goes through Apple's audio processing. There's no way around it. The best you can do is set AVAudioSession mode to measurement, which reduces processing but doesn't eliminate it. And web apps don't even have access to that API.

But it's also worth keeping in perspective. For voice transcription, these processing steps are usually not the limiting factor. A modern ASR model trained on diverse data can handle a wide range of audio qualities. The problems arise when the processing is inconsistent — when one recording has echo cancellation and the next one doesn't, and the model can't adapt because it doesn't know what to expect.

Which brings us back to determinism. Even if you can't eliminate the processing, you want it to be consistent.

Consistency is more important than absolute quality. A model can learn to compensate for a consistent coloration of the audio. It can't compensate for random variation.

What about the codec question specifically? Daniel mentioned knowing the codec as part of controlling the pipeline. Opus versus PCM versus AAC — does it actually matter for transcription?

It matters, but not as much as you might think. There was a study from Google's speech team a few years ago that compared ASR accuracy across different codecs and bitrates. They found that modern Opus at 32 kilobits per second is essentially transparent for speech recognition — the accuracy difference compared to lossless PCM was within the margin of error. AAC at similar bitrates was slightly worse but still very usable. The real drop-off happens with older codecs like AMR narrowband, which was designed for telephony and discards most of the frequency range above 4 kilohertz.

For practical purposes, Opus at 32 kilobits is fine, PCM is ideal but large, and AAC is okay if you're in the Apple ecosystem.

That's a fair summary. The one caveat I'd add is that some ASR systems are sensitive to the container format as well as the codec. If you're sending audio to an API, they might expect WAV or FLAC or raw PCM, and sending Opus in a webm container will get rejected even though the audio quality is fine. So you need to know what your transcription backend expects.

If you're running your own model, you can accept whatever you want.

If you're using Whisper locally, for example, it uses ffmpeg under the hood and can handle pretty much any format. The codec question becomes less about compatibility and more about bandwidth and storage.

Daniel's app is a productivity tool. Bandwidth probably isn't the bottleneck.

But if he's got users on metered connections or in areas with poor connectivity, it's worth thinking about. A one-minute voice note at 32 kilobit Opus is about 240 kilobytes. The same note as 16-bit 48 kilohertz mono PCM is about 5.That's a 24x difference. For a single note it doesn't matter. For someone recording dozens of notes a day and syncing over cellular, it adds up.

What about the tools Daniel might not have heard of? You mentioned OpusRecorder and RecordRTC. Anything else worth flagging?

There's a library called AudioMotion that's primarily for visualization but has a really clean API for accessing frequency data from the Web Audio API. If Daniel wants to do any client-side audio analysis — like detecting whether the user is actually speaking or just generating silence — that's a good option. There's also hark.js, which is a tiny library for speech detection using the Web Audio API. It's not speech recognition, it's just detecting whether someone is talking, which is useful for trimming silence from recordings.

Voice activity detection on the client side?

And doing VAD in the browser means you can avoid sending seconds of silence to your transcription service, which saves bandwidth and API costs. js is pretty basic — it's essentially an energy threshold detector — but for a quiet environment, it works well. For more sophisticated VAD, you'd want something like Silero VAD running in ONNX in the browser, which is more accurate but adds complexity.

Silero VAD in the browser — that's a neural voice activity detector running in WebAssembly?

Or WebGPU now, actually. ONNX Runtime has WebGPU support, so you can run these models with hardware acceleration. It's still bleeding edge but it's very cool. For a productivity app, it's probably overkill unless silence trimming is a critical feature.

Probably overkill for voice notes. But it's good to know the option exists.

The ecosystem has really matured in the last couple of years. When I first started paying attention to browser audio, the options were basically Flash — remember Flash? — or a Java applet. Now you've got WebAssembly codecs, neural VAD, real-time streaming to ASR services, all running in a web page with no install.

Yet the fundamental problem Daniel's asking about hasn't changed. The convenience of the browser comes with a layer of abstraction that obscures what's actually happening to the audio.

And I think the answer to his question — can you use browser pass-through more deliberately without making users record offline — is yes, but it requires more work than just plugging in getUserMedia and calling it a day. You need to be explicit about constraints. You need to understand what processing the browser is doing by default and disable what you can. You need to choose your encoding pipeline deliberately, whether that's MediaRecorder with explicit settings or a WASM encoder for deterministic output. And you need to test across browsers and devices to understand the variation you can't eliminate.

It's almost like the browser gives you a "record" button for free, but a good recording experience costs extra engineering effort.

The free tier is good enough for demos and prototypes. For a production app where transcription accuracy matters, you need to invest in the audio pipeline.

Daniel's been through enough of this to know that. The fact that he's asking about bitrate and codecs and deterministic pipelines suggests he's already past the prototype stage and hitting real-world quality issues.

Which is the right time to be asking these questions. Too many developers ship with the defaults and then scramble to fix audio quality problems after users start complaining. By then, you've already lost trust.

The prescription is: getUserMedia with explicit constraints disabling processing, a deliberate choice of encoder, and testing across the browsers your users actually use.

If you're running a transcription service, you should be logging audio quality metrics. Things like signal-to-noise ratio, clipping percentage, sample rate. If you see a user consistently sending 8 kilohertz audio, you know something's wrong with their setup and you can reach out proactively. That's a level of sophistication most apps don't bother with, but for a productivity tool where voice is the primary input, it's worth it.

Is there a library that does that kind of monitoring?

Not a turnkey one that I know of. You'd probably build it yourself using the Web Audio API's analyser node. It gives you frequency data and time-domain data, and from that you can derive basic quality metrics. It's not hard, it's just a matter of knowing what to look for.

Alright, I think we've covered the landscape. Let me try to summarize what Daniel should take away from this. One — the browser's default audio pipeline is optimized for video calls, not recording. Disable echo cancellation, noise suppression, and auto gain control. Two — be explicit about sample rate and bitrate. Target 48 kilohertz and 32 kilobits per second Opus for a good balance of quality and file size. Three — consider using a WebAssembly Opus encoder if you need deterministic output across browsers. Four — test on real devices, especially mobile, because that's where the defaults diverge the most. Five — monitor your audio quality in production so you catch problems before users report them.

That's a solid summary. The only thing I'd add is that if he's doing real-time transcription, he should look at streaming raw PCM over WebSockets rather than using MediaRecorder at all. The major ASR providers all have browser SDKs that do this, and there are open-source examples if he wants to roll his own.

And I think the meta-point here is that browser-based recording is totally viable for production use. You don't need to force users into a native app to get good audio. You just need to understand the pipeline and control what you can.

The browser is a capable audio platform. It's just not optimized for recording by default. A little configuration goes a long way.

Now: Hilbert's daily fun fact.

Hilbert: The national animal of Scotland is the unicorn. It has been since the twelve hundreds, when it appeared on the Scottish royal coat of arms. Scotland is the only country whose national animal is a mythological creature.

...right.

This has been My Weird Prompts. Thanks to our producer Hilbert Flumingtop, and thanks to Daniel for the prompt that sent us down this audio engineering rabbit hole. If you enjoyed this episode, head over to myweirdprompts.com for the full archive. We'll be back soon with more.

This episode was generated with AI assistance. Hosts Herman and Corn are AI personalities.

#2582: What Your Browser Does to Mic Audio Before It Reaches Your Server

Downloads

You Might Also Like

#2582: What Your Browser Does to Mic Audio Before It Reaches Your Server