#1085: The Tokenization Lie: How AI Actually Processes Media

Think 1,000 tokens equals 750 words? For audio and video, that rule is a lie. Discover the hidden math behind multimodal AI.

0:000:00
Episode Details
Published
Duration
30:25
Audio
Direct link
Pipeline
V5
TTS Engine
chatterbox-regular
LLM

AI-Generated Content: This podcast is created using AI personas. Please verify any important information independently.

The "750 words per 1,000 tokens" rule has long been the standard for budgeting and benchmarking in the generative AI era. However, as the industry moves from text-only models to native multimodal systems like Gemini and GPT-4o, this metric is becoming obsolete. When dealing with audio, images, or video, the mathematical reality of how data is consumed changes fundamentally.

From Digital Sandwiches to Native Ingestion

In the earlier stages of AI development, processing media required a "digital sandwich" approach. For example, an audio file would first pass through an automatic speech recognition (ASR) system to be transcribed into text, which was then fed into the language model. This method was inefficient and resulted in a significant loss of information, such as tone, emotion, and background context.

Modern models have moved toward native ingestion. Instead of translating media into text, they process the raw signals directly. This allows the model to "hear" the nuances of a voice or "see" the temporal flow of a video, but it comes with a steep computational cost.

The Mechanics of Vector Quantization

Because transformers are sequence-processing engines, they cannot handle continuous signals like sound waves or video frames in their raw form. They require discrete units, or tokens. This transition is achieved through a process called Vector Quantization (VQ).

Imagine a color wheel representing every possible shade of blue. Vector Quantization identifies a specific set of "anchors" or shades and assigns each a number in a codebook. When the model encounters a specific frequency in an audio file or a pixel pattern in an image, it maps that data to the closest number in its codebook. This transforms an infinite flow of information into a discrete sequence of tokens that the model can understand.

The Multimodal Tokenization Tax

The most significant impact of native ingestion is the "tokenization tax." While a minute of speech might only contain 150 words, it can translate into thousands of tokens when processed as a raw audio signal. Some models compress audio into 20-millisecond chunks, resulting in roughly 50 tokens per second.

This density means that media files consume the context window—the model's short-term memory—much faster than text. A few minutes of high-resolution video can easily consume hundreds of thousands of tokens. This explains why context windows have expanded to millions of tokens; it isn't necessarily for longer books, but to accommodate the massive data requirements of video and audio.

Unified Latent Spaces

A breakthrough in current AI architecture is the ability to interleave different types of data in a single sequence. In a unified latent space, the model is trained to recognize that the spoken word "apple," the written word "apple," and a picture of an apple all represent the same concept.

By aligning these different modalities into the same mathematical neighborhood, the model can reason across them simultaneously. When a user uploads a video and asks a question, the model isn't performing a search or a lookup; it is literally "watching" the frames as part of the prompt sequence. As efficiency improves—such as the 2026 updates to temporal compression—the goal is to reduce the redundant data while maintaining the rich context that makes multimodal AI so powerful.

Downloads

Episode Audio

Download the full episode as an MP3 file

Download MP3
Transcript (TXT)

Plain text transcript file

Transcript (PDF)

Formatted PDF with styling

Read Full Transcript

Episode #1085: The Tokenization Lie: How AI Actually Processes Media

Daniel Daniel's Prompt
Daniel
Custom topic: We hear quite a bit about tokens in the context of text inputs to AI models: this many words = approx this many tokens. However, when it comes to processing multimodal inputs like audio, images, video
Corn
If you have ever spent five minutes looking at an A P I pricing page for a large language model, you have probably seen that familiar rule of thumb. You know the one I am talking about. It is the idea that one thousand tokens is roughly equal to seven hundred fifty words. It is the foundational metric of the generative era. It is how we budget, how we benchmark, and how we understand the scale of the data we are feeding into these digital brains. But here is the thing that has been bothering me lately, and it is something our housemate Daniel brought up in a prompt he sent over this morning. That rule of thumb is a total lie the second you step outside the world of plain text.
Herman
It really is. And it is not just a little bit off. It is fundamentally a different mathematical universe. Herman Poppleberry here, by the way. And Corn, you are absolutely right. When we talk about audio, images, or video, we are not just counting words anymore. We are dealing with continuous signals. Daniel was asking specifically about why the pricing and the mechanics of multimodal inputs feel so much like a black box compared to text. And he is hitting on the most important architectural shift in A I that has happened over the last year or two. We have moved from the digital sandwich approach, which we talked about way back in episode nine hundred ninety two, to these native multimodal models like Gemini and G P T four o. But for the average developer or even a power user, the way a minute of audio or a high resolution image actually gets turned into tokens is completely opaque.
Corn
It feels like we are flying blind. I mean, if I upload a ten minute audio file of a meeting, am I consuming ten thousand tokens? A hundred thousand? Is it based on the file size, the duration, or the complexity of the sound? And more importantly, what is actually happening under the hood? Daniel wanted to know if we need unique tokenizers for every single file type or if the model just sees everything as a giant stream of numbers. So, today we are going to crack open that black box. We are going to look at the mechanics of multimodal tokenization, the move toward unified latent spaces, and why your context window is probably disappearing a lot faster than you realize when you start throwing media at it.
Herman
I love this topic because it forces us to look at the difference between a discrete signal and a continuous one. Text is easy because humans already did the hard work of discretizing it for us. We have letters, we have words, we have spaces. But a sound wave or a frame of video is a continuous flow of information. There are no natural boundaries. So, how does a transformer, which is fundamentally a sequence processing engine, handle something that does not have clear steps?
Corn
That is the perfect place to start. Let us bridge that gap. In the old days, if I sent an audio file to an A I, there was a pre processing step. It would go through an automatic speech recognition system, get turned into a text transcript, and then that text would be tokenized. But that is not what is happening anymore, right? We are talking about native ingestion.
Herman
We have moved away from that two step process because you lose so much information in the middle. If you just transcribe audio to text, you lose the tone, the prosody, the background noise, the emotion. To solve this, researchers had to figure out how to tokenize the signal itself. To answer Daniel’s first question about whether we need specific tokenizers for every type of input, the answer is a bit of a yes and no. Technically, you do need an encoder that is trained to understand the specific structure of that data. You cannot just feed raw binary data from a J P E G into a text tokenizer and expect it to work. The text tokenizer is looking for patterns in characters. An image encoder is looking for spatial relationships, edges, and textures.
Corn
So, it is not about the file extension. It is not like there is a separate tokenizer for a W A V file versus an M P three file. It is about the underlying modality.
Herman
Right. Once the data is decoded into its raw form, like a sequence of audio samples or a grid of pixels, it goes through a process called Vector Quantization, or V Q. This is the secret sauce. Imagine you have a color wheel with every possible shade of blue. That is a continuous signal. Vector Quantization is like saying, I am going to pick one hundred specific shades of blue and give each one a number. Every time I see a shade on the wheel, I will just map it to the closest number in my codebook. Now, instead of an infinite range of blues, I have a discrete set of tokens.
Corn
That makes a lot of sense. So, the codebook is essentially the vocabulary for that modality. But that brings up an efficiency problem. If I am sampling audio at, say, sixteen kilohertz, that is sixteen thousand data points every single second. Even if I am smart about how I group them, that sounds like a massive amount of tokens compared to the few words a human can speak in a second.
Herman
You have hit on the exact reason why audio and video are so much more expensive and computationally heavy. In a typical text conversation, one thousand tokens might cover several paragraphs of deep thought. In audio, depending on the model’s architecture, one thousand tokens might only cover a few seconds of sound. For instance, some of the newer multimodal encoders take a chunk of audio, maybe twenty milliseconds long, and compress it into a single token. But even at that rate, you are looking at fifty tokens per second. A one minute clip suddenly becomes three thousand tokens. Compare that to a transcript of that same minute, which might only be one hundred fifty words or two hundred tokens.
Corn
So, we are talking about a factor of fifteen or twenty in terms of sequence length. This really changes how we think about the K V cache, which we discussed in episode one thousand eighty one. If I am filling up my context window with these dense audio tokens, I am going to hit that memory wall much faster than I would with text.
Herman
Much faster. And this is where the transparency issue comes in. When you use an A P I like Gemini, they often charge you per second of audio rather than per token. They do that to make it easier for the user, but it hides the technical reality. Behind the scenes, that one second of audio is being expanded into a huge sequence of vectors that the transformer has to attend to. This is why you see limitations on how much video or audio you can upload. It is not just about storage; it is about the quadratic cost of attention. Every extra audio token makes the model exponentially harder to run.
Corn
I want to dig into the idea of the unified latent space. Daniel asked if multiple tokenizers function simultaneously when you send audio and text in the same call. If I am using the Gemini A P I and I send a prompt that says, listen to this clip and tell me what the person is feeling, how does the model actually look at both of those things at once? Are they two separate streams that get merged later, or are they interleaved?
Herman
In the most modern architectures, they are interleaved. This is the big breakthrough of models like G P T four o and the latest iterations of Gemini. They use what is called a multimodal prefix or an interleaved sequence. Imagine a single long line of tokens. The first ten might be the text of your prompt. The next five hundred might be the tokens representing the audio clip. The model’s attention mechanism treats them all as part of the same sequence. It can look at a text token and an audio token at the same time and calculate the relationship between them.
Corn
But for that to work, the audio tokens and the text tokens have to exist in the same mathematical space, right? If the text tokenizer outputs a number representing the word apple, and the audio tokenizer outputs a number representing a specific frequency, those numbers mean nothing to each other unless they have been aligned.
Herman
That is the alignment layer. During training, the model is shown millions of examples of audio paired with text descriptions. It learns that the vector for the spoken word apple and the vector for the written word apple should point in roughly the same direction in this high dimensional space. This is why we call it a latent space. It is a hidden, underlying map of concepts that is independent of how the concept was delivered. Whether I see a picture of a dog, hear a bark, or read the word D O G, the model’s internal representation should land in the same neighborhood of that map.
Corn
That is fascinating. It suggests that the tokenizer itself is almost just a translator at the border. Once you are inside the country of the model, everyone speaks the same language of vectors. But that leads to Daniel’s third question, which I think is a huge point of confusion for a lot of people. When we upload these files, are they entering an ephemeral R A G pipeline, or are they being ingested like text? For anyone who missed our earlier deep dives, R A G stands for Retrieval Augmented Generation. It is usually where you store a bunch of documents in a database and the A I goes and looks them up. People seem to think that because audio and video are big files, the A I must be doing some kind of lookup.
Herman
And that is a very common misconception. For most of these high end multimodal A P Is, it is not R A G. It is native ingestion. When you upload that video to Gemini, it is actually putting those frames directly into the context window. It is essentially reading the entire file as part of the prompt. This is why the context windows have grown so large. When Gemini announced a one million or even two million token window, a lot of text users were like, why would I ever need that many tokens? But the answer is video. A few minutes of high quality video at several frames per second, where each frame is subdivided into patches that are tokenized individually, can easily eat up hundreds of thousands of tokens.
Corn
So, it is not searching a database of your video; it is literally watching the video as it processes your question. That is a massive difference in terms of how the model can reason about the data. If it were R A G, it might only find the specific parts of the video that seem relevant to your keywords. But with native ingestion, it can understand the temporal flow, the subtle changes over time, and the connection between a sound in the beginning and an action at the end.
Herman
But it also means you are paying for every single one of those tokens in every turn of the conversation. If you are in a chat session and you upload a video, and then you ask five follow up questions, you are re processing that video sequence every single time unless the system is using some very clever caching. This is where the tokenization tax really starts to hurt. We talked about this in episode six hundred sixty six regarding language barriers, but there is a similar tax for multimodal data. If the tokenizer is inefficient, you are essentially wasting money on redundant data.
Corn
Let us talk about that efficiency. How do we measure it? With text, we can look at the compression ratio. If a tokenizer can represent a complex idea in fewer tokens, it is more efficient. Is there a similar benchmark for audio or video?
Herman
It is much harder to define because it depends on the sampling rate and the resolution. But here is a concrete example. In January of twenty twenty six, there was a significant update to the Gemini A P I regarding how it handles multimodal inputs. They introduced a more aggressive form of temporal compression for video. Before that, the model might have been looking at every single frame, or maybe two frames per second, regardless of what was happening in the video. The update allowed the encoder to identify redundant frames. If nothing is moving in the shot for three seconds, it can represent that entire block with fewer tokens. It is almost like a smart video codec, but instead of optimizing for human eyes, it is optimizing for the transformer’s attention.
Corn
That is a huge leap forward. It reminds me of how we talked about audio engineering as prompt engineering in episode five hundred ninety eight. If I am a developer and I want to save money, I should probably be thinking about the signal density I am sending. If I send a high definition, sixty frames per second video of a person sitting still and talking, I am wasting a massive amount of compute. I could probably downsample that to five frames per second and seventy two p, and the model would still understand the content perfectly.
Herman
You absolutely should. In fact, input normalization is going to be one of the most important skills for A I engineers in the next year. Most people just throw the raw file at the A P I and hope for the best. But if you understand that the tokenizer is going to chop that image into sixteen by sixteen pixel patches, you can resize your images to be multiples of sixteen to avoid weird artifacts or wasted padding tokens. For audio, if the model’s internal encoder is optimized for sixteen kilohertz, sending it a forty eight kilohertz professional studio recording is not giving you better results. It is just creating more work for the pre processor and potentially leading to more tokens than you need.
Corn
This brings us back to the opacity of pricing. If the A P I is charging me per second, but the underlying cost is based on tokens, and the number of tokens can change based on how the model compresses the signal, then the price per second is really just a simplified average. It feels like the industry is trying to hide the complexity from us, but in doing so, they are preventing us from being efficient.
Herman
It is a classic trade off. They want it to be user friendly. They want you to think, okay, it costs one cent per minute of audio. That is easy to put in a spreadsheet. But if you are building an enterprise application that processes millions of minutes, you need to know if you can get that down to half a cent by optimizing your bitrates. Right now, most providers do not give you that level of granularity. We are in this weird middle ground where the technology is multimodal, but the business model is still stuck in the text era.
Corn
I want to go back to Daniel’s question about whether we need unique tokenizers for every file type. You said yes and no, but let us look at the future. Do you think we will ever get to a point where there is a truly universal tokenizer? Something that does not care if it is looking at a pixel, a sound wave, or a sensor reading from an industrial machine?
Herman
There is a lot of research into what we call token free or continuous latent models. The idea is to skip the discrete tokenization step entirely and just map the raw signal directly into a continuous vector space. If we can do that, we eliminate the codebook. We eliminate the need to decide ahead of time which three hundred shades of blue are the most important. The model would just see the raw gradients. The problem is that transformers are currently designed to work with discrete sequences. To move to a truly universal, continuous input, we might need a fundamental change in the architecture, something like the state space models or Mamba architectures that people are getting excited about.
Corn
That would be a massive shift. It would mean the end of the token as the primary unit of A I. But until then, we are stuck with this hidden economy of multimodal tokens. Herman, you mentioned the January update for Gemini. Have you seen similar transparency in other models? For example, does G P T four o give any indication of how it is patchifying images?
Herman
They have started to. In their documentation, they explain that an image is often broken down into fifty token chunks or something similar, depending on the detail level you select. But even then, it is an approximation. There is this hidden layer of logic that decides how many tokens to allocate based on the complexity of the image. If you have a very detailed architectural blueprint, the model might decide it needs more tokens to capture all the lines and text than it would for a picture of a clear blue sky. This is what I call the cognitive load of the input. It is not just about the size of the file; it is about the density of the information.
Corn
That is such a crucial point. Cognitive load. It makes me think about how we as humans process information. If I show you a blank white wall, you do not need to spend much mental energy to describe it. But if I show you a busy street in Jerusalem, your brain is working overtime to categorize all the people, the cars, the stone textures. A I is finally starting to work the same way. The problem is that our current billing systems are based on the white wall and the busy street costing the exact same amount because they are both one image.
Herman
And that is why I think we are going to see a push for a multimodal token standard. We need a way for developers to say, I am sending you this many bits of information, and I expect it to cost this much. Right now, you might send the same image twice and get slightly different results because of how the cloud provider is balancing their compute load or which version of the encoder is active that day. It is very difficult to build a predictable business on top of that.
Corn
So, what can people actually do today? If you are a developer or a curious user like Daniel, and you want to be smart about this, where do you start?
Herman
Step one is to stop thinking in terms of file size and start thinking in terms of signal density. For images, look at the patch size of the model you are using. If it is sixteen by sixteen, or thirty two by thirty two, make sure your images are scaled appropriately. For audio, find out the native sampling rate of the encoder. Usually, sixteen kilohertz or twenty four kilohertz is the sweet spot. Anything above that is likely being discarded or downsampled anyway, so you might as well do it yourself and save the bandwidth.
Corn
And for video?
Herman
Video is all about the frame rate. Most A I models do not need thirty frames per second to understand what is happening. For most tasks, like summarizing a meeting or identifying an object, one or two frames per second is plenty. If you can reduce your video from thirty frames to two, you have just cut your token consumption by ninety percent without losing much semantic meaning. This is the kind of input normalization that separates the hobbyists from the pros right now.
Corn
That is a great practical takeaway. I also think we need to be very careful with long conversations involving media. If you are using a chat interface and you have uploaded a video, remember that every time you ask a new question, you might be paying for that entire video again. It is often better to start a new thread if your next question does not actually require the context of that heavy media file.
Herman
That is a pro tip right there. The context window is a hungry beast. Every time you hit enter, you are feeding it the entire history of the chat. If that history includes a ten minute audio file, you are paying a massive tokenization tax on every single message.
Corn
It is funny, we started this show years ago talking about simple text prompts, and now we are talking about managing multi gigabyte streams of sensor data and video frames. The weird prompts are getting a lot more complex, but the core challenge remains the same. We are trying to figure out the most efficient way to communicate with these systems.
Herman
It really is the same journey. We are just moving from vocabulary to signal. And I think Daniel’s question highlights how much we still have to learn about the bridge between the physical world of waves and light and the digital world of vectors and tokens. We are building a map of the world, one patch and one millisecond at a time.
Corn
Well, I think we have covered a lot of ground here. We have looked at why the token to word rule fails for multimodal, how Vector Quantization creates a codebook for sound and light, and why native ingestion is replacing the old R A G pipelines for media. It is a fascinating time to be looking at this, especially with the rapid updates we are seeing in early twenty twenty six.
Herman
It really is. And I hope this gives Daniel and our listeners a bit more clarity when they are looking at those A P I dashboards. Don’t let the simple pricing fool you. There is a lot of complex engineering happening in those black boxes.
Corn
Definitely. Let's really lean into that complexity for a second, Herman. If we are talking about a forty five hundred word deep dive, we need to address the actual sequence length math for the listeners who are building these systems. If I have a context window of one million tokens, like we see in Gemini one point five Pro, and I am feeding it a high resolution video, how many minutes are we actually talking about before the model starts to lose its mind?
Herman
That is the million dollar question. Or, given A P I costs, maybe the ten thousand dollar question. Let's do the math. If you are using a model that samples video at one frame per second, and each frame is subdivided into a grid of sixteen by sixteen patches, that is two hundred fifty six tokens per frame. At one frame per second, a ten minute video is six hundred seconds, which equals one hundred fifty three thousand six hundred tokens. That sounds manageable for a million token window, right? But what if you need more detail? What if you are sampling at ten frames per second to catch fast motion? Suddenly, that same ten minute video is one point five million tokens. You have just blown past the context window of almost every model on the market.
Corn
And that is just the video. That doesn't include the audio track or the text prompt you are sending along with it. This is why we see these models sometimes hallucinating or "forgetting" the beginning of a video. It is not necessarily that the model is bad; it is that the token density of the video has pushed the most relevant information out of the active attention mechanism.
Herman
And this brings us back to the "Tokenization Tax" from episode six hundred sixty six. In that episode, we talked about how certain languages like Telugu or Burmese are tokenized very inefficiently compared to English, meaning speakers of those languages pay more for the same A I performance. We are seeing a "Multimodal Tax" now. If you don't know how to normalize your inputs, you are paying a massive premium for data that the model doesn't even need to do its job.
Corn
I want to go deeper into the "Digital Sandwich" versus "Native Multimodal" distinction. In episode nine hundred ninety two, we talked about how early voice assistants were basically three separate models taped together. You had the A S R for speech to text, the L L M for reasoning, and the T T S for text to speech. In that world, tokenization was simple because everything was converted to text before the "brain" ever saw it. But in a native model like G P T four o, the "brain" is seeing the audio tokens directly. Herman, does that mean the model actually "hears" the frequency, or is it still just seeing a number that represents a frequency?
Herman
It is seeing a number, but that number is part of a vector that captures the relationship between that frequency and all the others around it. It is like the difference between reading the sheet music for a song and actually hearing the vibration of the strings. The native model can "feel" the vibration in the data. This is why native multimodal models are so much better at detecting sarcasm or emotion in a voice. They aren't just reading the words "I am fine"; they are seeing the token for a high pitched, strained frequency that contradicts the words.
Corn
That is a powerful shift. But it also means the tokenizer has to be much more sophisticated. If the codebook for the audio tokenizer is too small, the model becomes "tone deaf." It might not have a token for a specific inflection, so it just maps it to the closest thing it knows, and suddenly the sarcasm is gone.
Herman
Precisely. This is why the development of these codebooks is such a closely guarded secret. Google, Open A I, and Anthropic are all competing to create the most efficient and expressive codebooks. They want to represent the maximum amount of human experience with the minimum number of tokens. It is the ultimate compression challenge.
Corn
So, when Daniel asks if we need unique tokenizers for every file type, the answer is that we need unique "encoders" that can map those files into a shared "latent space." Once they are in that space, they are all just tokens. But the journey from a .mp4 to a token is where the magic—and the cost—happens.
Herman
And that journey is getting more efficient. The January twenty twenty six update I mentioned earlier for Gemini is a great example. They started using what is called "Dynamic Patching." Instead of dividing every image into a fixed grid, the model looks at the image first and decides where the information is. If you have a picture of a person standing in front of a plain white wall, the model might only use a few tokens for the wall and hundreds of tokens for the person's face. It is a more "human" way of looking at things. We don't give equal attention to every square inch of our field of vision.
Corn
That feels like the beginning of the end for the "token" as a fixed unit of cost. If the number of tokens for an image can change based on what is in the image, then the A P I providers are going to have a hard time explaining their bills to users. "Why did this photo of my cat cost twice as much as the photo of my dog?" "Well, your cat has more complex fur patterns, sir."
Herman
It sounds ridiculous, but that is exactly where we are headed. We are moving from a "per word" economy to a "per unit of complexity" economy. And for developers, that means we need new tools. We need "token simulators" that can tell us how much a file will cost before we hit the "send" button.
Corn
I think that is a great place to wrap up the technical deep dive. We have covered the shift from the digital sandwich to native ingestion, the mechanics of Vector Quantization, the reality of the context window wall, and the future of dynamic, complexity based tokenization.
Herman
It has been a journey. And I think it really highlights that we are still in the "Wild West" of multimodal A I. The rules are being written in real time.
Corn
Definitely. And hey, if you are finding these deep dives helpful, we would really appreciate it if you could leave us a review on your favorite podcast app. Whether you are on Spotify or Apple Podcasts, those reviews really help other people find the show and join the conversation.
Herman
They really do. We love seeing the feedback and the new questions that come in. It keeps us on our toes.
Corn
You can find all our past episodes, including the ones we mentioned today like episode nine hundred ninety two on voice A I and episode one thousand eighty one on the K V cache, over at myweirdprompts.com. There is a search bar there so you can dig through our entire archive of over a thousand episodes.
Herman
Thanks for joining us again. It is always a pleasure to dive into the weeds with you, Corn.
Corn
Same here, Herman. Until next time, this has been My Weird Prompts.
Herman
Take care, everyone.
Corn
I was just thinking, Herman, about that codebook analogy you used for the colors. If we have a codebook for everything, does that mean the A I’s world is essentially just a very large, but finite, collection of symbols? Like, is there any room for true novelty that hasn't been tokenized?
Herman
That is a deep philosophical question to end on. In a discrete system, yes, everything is a combination of existing symbols. But the number of possible combinations in a high dimensional space is so vast that it might as well be infinite. It is like the alphabet. We only have twenty six letters, but we have not run out of new things to say yet.
Corn
That is a fair point. Though I still think a sloth’s perspective on time might need its own specialized tokenizer. It is a much slower frequency than what most models are trained for.
Herman
We will have to look into that for episode two thousand. A sloth specific encoder.
Corn
I will start working on the training data. It might take a while.
Herman
I would expect nothing less.
Corn
Alright, we should probably head out. Daniel is probably wondering where his audio prompt went.
Herman
He is probably already working on the next one.
Corn
Most likely. Thanks for listening, everyone. We will catch you in the next one.
Herman
Bye for now.
Corn
One last thing, for those of you interested in the technical specifics of the Gemini update I mentioned, they have a developer blog post from mid January that goes into the temporal compression algorithms. It is a bit of a dense read, but if you are building video applications, it is essential.
Herman
Good call. It explains the difference between their fixed rate sampling and the new dynamic sampling. It is a game changer for cost optimization.
Corn
Alright, now we are really going. Thanks again.
Herman
See you.
Corn
I wonder if we should have mentioned the geopolitical side of this. I mean, the fact that the leading multimodal models are almost entirely coming out of American companies like Google and Open A I is a huge deal for digital sovereignty.
Herman
It is. And the infrastructure required to run these multimodal encoders is so massive that it creates a natural moat. Not many countries or companies can afford the compute to tokenize the world in real time.
Corn
It is a new kind of power. The power to define the codebook for reality.
Herman
That is a heavy thought. Maybe a topic for next week.
Corn
Maybe so. Alright, for real this time, thanks for listening.
Herman
Goodbye.
Corn
And don't forget to check out the website at myweirdprompts.com for the full transcript and links to the research papers we discussed.
Herman
We will have everything linked there.
Corn
Great. Talk soon.
Herman
Talk soon.

This episode was generated with AI assistance. Hosts Herman and Corn are AI personalities.