#758: AI Surveillance: Mastering Frigate, YOLO, and TPUs

Turn passive cameras into active observers. Learn how Frigate and YOLO models use AI to revolutionize home security and object detection.

0:000:00

Episode Details

Published: Feb 21
Duration: 32:24
Audio: Direct link
Pipeline: V4
TTS Engine
LLM
Topics: smart-home local-ai architecture

AI-Generated Content: This podcast is created using AI personas. Please verify any important information independently.

Traditional motion detection has long been a source of frustration for homeowners, often triggered by shifting shadows, wind-blown leaves, or passing insects. Modern surveillance has moved past these limitations through the implementation of local AI-driven object detection. Systems like Frigate NVR leverage advanced neural networks to identify specific objects—such as people, cars, or animals—transforming passive video recording into an intelligent, active observation tool.

The YOLO Revolution

The core technology driving this shift is the YOLO (You Only Look Once) series of models, currently maintained by Ultralytics. Before YOLO’s emergence, object detection was a slow, multi-stage process that scanned images piece by piece. YOLO changed the landscape by treating detection as a single regression problem.

By processing an entire frame in one pass, the model predicts object categories and coordinates simultaneously. This architectural efficiency is what allows for real-time monitoring across multiple high-definition camera feeds. With the arrival of versions like YOLOv11, these models have become more accurate and faster, making them ideal for everything from smart city traffic management to precision agriculture and industrial quality control.

Customizing AI for Specific Needs

While base models are trained on massive datasets like COCO (Common Objects in Context) to recognize standard items like bicycles or umbrellas, many users require more specialized detection. Through a process called transfer learning, hobbyists can adapt existing models to recognize unique objects, such as a specific family pet or a particular power cord that might pose a hazard in a nursery.

The success of a custom model depends heavily on the quality of the training data. A diverse dataset—featuring the target object in various lighting conditions, angles, and backgrounds—is essential. Modern tools now allow for "auto-labeling," where larger AI models assist in preparing data for smaller, faster models. Once a dataset is ready, training can be completed in a few hours on a standard consumer GPU or via cloud-based services, resulting in a custom "weights" file that can be plugged directly into an NVR.

The Power of Specialized Hardware

Running complex AI models in real-time requires significant computational power, but traditional CPUs and even high-end GPUs are not always the most efficient choice for home setups. This is where Tensor Processing Units (TPUs), such as the Google Coral, become essential.

While a GPU is a versatile "Swiss Army knife" designed for a wide range of tasks, a TPU is an Application-Specific Integrated Circuit (ASIC) built solely for matrix multiplication—the primary mathematical operation behind neural networks. TPUs utilize a "Systolic Array" architecture, where data flows through the processor cells in a continuous stream. This minimizes the "Von Neumann bottleneck," the delay caused by constantly moving data between the processor and main memory. This specialized design allows a tiny, low-power TPU to outperform massive, power-hungry graphics cards in dedicated object detection tasks.

Downloads

Episode Audio

Download the full episode as an MP3 file

Download MP3

Transcript (TXT)

Plain text transcript file

Transcript (PDF)

Formatted PDF with styling

Episode #758: AI Surveillance: Mastering Frigate, YOLO, and TPUs

Daniel's Prompt

I’ve been exploring Frigate, an open-source project for self-hosting an NVR with AI-based object detection. Beyond baby monitoring, there are many applications for these models, like the YOLO series or Ultralytics. What exactly are these object detection models, what are their various applications, and how do you go about training a custom model?

In terms of hardware, why do these models perform more efficiently on a TPU compared to a high-end GPU? If someone were building a home server for a self-hosted NVR with AI features, what specifications should they prioritize to avoid needing niche components? Lastly, is it actually feasible to run an NVR on small, AI-focused single-board computers, like those that claim to have integrated TPUs?

Hello everyone and welcome back to My Weird Prompts. I am Corn, and I am here with my brother, the man who has probably forgotten more about computer architecture than I will ever know.

Herman Poppleberry here, at your service. And Corn, that is a generous way of saying I have a lot of useless trivia rattling around in my brain. Though, to be fair, in the world of technology, today’s useless trivia is often tomorrow’s essential troubleshooting guide.

It is certainly not useless today, Herman. Today’s prompt from Daniel is about something that has been a frequent topic of conversation in our house lately, especially since our home server went through its recent, shall we say, hardware evolution. For those who do not follow our social media, let us just say that trying to run six simultaneous four K video streams on a ten year old processor results in a very expensive space heater.

It was a valiant effort, Corn. But physics always wins in the end.

It does indeed. Daniel is asking about Frigate, which is an open source network video recorder, or NVR, that uses artificial intelligence for object detection. He wants to know about the models behind it, like the YOLO series from Ultralytics, how to actually train those models for custom needs, and the specific hardware that makes it all possible without breaking the bank or melting your motherboard.

This is a fantastic topic because it sits right at the intersection of home automation, privacy, and some really heavy duty computer science. Most people think of security cameras as just dumb boxes that record video to a hard drive, but when you add local AI to the mix, you turn a passive recording device into an active observer that understands its environment.

Right, and it is a massive shift from the traditional motion detection that anyone who has owned a cheap camera knows about. You know the struggle, Herman. A shadow moves across the driveway, or a moth flies in front of the lens, and suddenly your phone is buzzing at three in the morning with a "motion detected" alert. You check the footage, and it is just a leaf blowing in the wind.

Exactly. Standard motion detection is incredibly primitive. It just looks for groups of pixels changing color or brightness between frames. It has no concept of what those pixels represent. AI object detection, specifically the kind Frigate uses, actually looks at the shapes and features within the frame and says, "That is a person," "That is a car," or "That is a dog." It filters out the noise so you only get alerted when something you actually care about happens.

So let us start with the models themselves. Daniel mentioned the YOLO series and Ultralytics. For those who are not familiar with the acronym, YOLO stands for You Only Look Once. Herman, can you break down what that actually means in the context of computer vision? Because it sounds like a catchy slogan, but I know there is a lot of math behind it.

It is a very clever name for a very clever architecture. To understand why YOLO was such a breakthrough, you have to look at how object detection worked before it arrived on the scene around two thousand and fifteen. Older systems used a multi stage process. First, they would run a "region proposal" algorithm to scan an image and find areas that might contain an object. Then, they would crop those regions out and run a separate classifier on each one to see what was in there. It was incredibly slow because you were essentially running dozens or hundreds of separate predictions for a single frame of video.

So it was like a person looking at a giant mosaic through a tiny magnifying glass, moving it inch by inch across the whole thing and trying to remember what they saw in the previous spot?

That is a perfect analogy. It was computationally expensive and nearly impossible to do in real time on consumer hardware. YOLO changed the game by treating object detection as a single regression problem. It takes the entire image and processes it through the neural network in one go. It divides the image into a grid and, for each grid cell, the model predicts the probability of an object being there and the coordinates for a bounding box. Because it does this all in a single pass, it is incredibly fast. That is the "You Only Look Once" part. You are looking at the whole image once to get both the "what" and the "where."

And that speed is what makes it feasible for something like Frigate, right? If you are trying to monitor four or five high definition camera feeds in real time, you cannot afford to have a model that takes half a second to process a single frame.

Right. If you are doing thirty frames per second across five cameras, that is one hundred and fifty frames every second. Even if you only process a subset of those frames—say, five frames per second per camera—you still need massive throughput. Ultralytics is the company that currently maintains and develops the most popular versions of YOLO. We have seen a rapid evolution from YOLO version eight to version eleven, which is the current gold standard as of early twenty twenty six. They have turned what used to be a very complex research project into a very polished tool that developers can use with just a few lines of code.

It is amazing how fast this has moved. I remember when version five was the big thing, and now we are already talking about version eleven and the experimental version twelve. What are the actual applications here beyond just security? Daniel mentioned baby monitoring, which is what he uses it for to see if the baby is crawling or in a hazardous spot. But where else are these models showing up in the real world?

Oh, the applications are everywhere now. In smart cities, they use YOLO to monitor traffic flow, count pedestrians at crosswalks to adjust signal timing, and even detect accidents or stalled vehicles in real time. In agriculture, you can mount these cameras on drones or autonomous tractors to detect weeds versus crops, allowing for precision spraying that uses ninety percent less herbicide. In manufacturing, they are used for quality control on assembly lines to spot defects that are too small or moving too fast for a human eye to catch.

I have even seen some interesting wildlife research using it. People set up trail cameras in remote areas and use these models to sort through thousands of hours of footage, automatically tagging every time a specific species of bird or mammal walks by. It saves researchers months of manual labor.

It really does. And because these models are open source, anyone can take the base model, which is usually trained on a massive dataset like COCO—which stands for Common Objects in Context—and then fine tune it for their specific needs. The COCO dataset has about eighty categories, like person, bicycle, car, and umbrella. But if you need to detect something specific, like a particular type of industrial valve or a specific breed of cat, you can do that.

That leads perfectly into Daniel’s next question about training a custom model. If I want a model that does not just recognize a dog, but specifically recognizes my dog, or if I want it to recognize a specific type of power cord as a hazard for a baby, how does that process work? Is it something a hobbyist can actually do in twenty twenty six?

It is definitely within reach for a hobbyist, though it does require some patience and a bit of a learning curve. The process is called transfer learning. You do not start from zero. You start with a model that already "knows" how to see shapes, edges, and textures because it was trained on millions of images. Then, you give it a smaller, specialized dataset of your own to teach it the final layer of recognition.

So you are not teaching it from scratch how to see. You are just teaching it a new vocabulary.

Exactly. It already understands the concept of a "thing." You just need to show it, say, one hundred to five hundred pictures of your specific object from different angles, in different lighting, and against different backgrounds. You have to "label" those images by drawing bounding boxes around the object. There are tools like Label Studio or CVAT that make this easier, and lately, we have seen "auto labeling" tools where a larger, slower AI model helps you label your images for the smaller, faster model.

Once you have your labeled images, what happens next? Do you need a supercomputer?

Not at all. You run a training script, usually in Python. If you have a decent graphics card, like an Nvidia thirty sixty or better, it might take a few hours. If you do not have a GPU, you can use services like Google Colab or Lambda Labs, which give you access to a powerful GPU in the cloud for a few dollars. Once the training is done, you get a "weights" file. You plug that file into Frigate, and suddenly your NVR knows exactly what your specific power cord looks like.

It sounds straightforward, but I imagine the quality of the data is the biggest hurdle. If your pictures are all blurry or the lighting is always the same, the model will probably struggle when the sun goes down or the camera gets a bit of rain on the lens.

That is the golden rule of machine learning: garbage in, garbage out. If you only show the model your dog sleeping on the rug, it might not recognize the dog when it is running in the grass. You need diversity in your training data—different times of day, different weather, and different distances from the camera. This is where "data augmentation" comes in, where the software automatically flips, rotates, and changes the brightness of your images to artificially create more variety.

Now, let us get into the hardware side of things, because this is where the prompt gets really technical and where you usually get that glint in your eye, Herman. Daniel asked why these models perform more efficiently on a TPU compared to even a high end GPU. We are talking about the Google Coral TPU, which is a tiny little chip that looks like a postage stamp, often outperforming a massive, power hungry graphics card for this specific task. Why is that?

This is one of my favorite topics because it is all about specialized architecture versus general purpose architecture. A GPU, or Graphics Processing Unit, is a general purpose parallel processor. It was originally designed to render millions of triangles for video games, which involves a lot of complex floating point math. It is very flexible. You can use it for gaming, for video editing, for scientific simulations, and for training AI models.

So it is like a Swiss Army knife. It can do everything, but maybe it is not the absolute best tool for a very specific, repetitive task.

Exactly. A TPU, or Tensor Processing Unit, is an ASIC, which stands for Application Specific Integrated Circuit. It was designed from the ground up to do exactly one thing: matrix multiplication. In a neural network, almost all the heavy lifting is just multiplying huge arrays of numbers together over and over again.

I remember you explaining this to me once before when we were fixing the server. You mentioned something called a Systolic Array?

Yes! That is the secret sauce. In a traditional CPU or even a GPU, the processor has to constantly go back to memory to fetch data, perform a calculation, and then store the result back in memory. That moving of data back and forth—the "Von Neumann bottleneck"—uses a lot of energy and creates a massive delay. In a Systolic Array, which is what the TPU uses, the data flows through the processor like blood through a heart. Each cell in the array performs a small part of the multiplication and passes the result directly to its neighbor without ever going back to main memory.

So it is like an assembly line where the product never leaves the belt until it is completely finished.

Precisely. This makes it incredibly efficient at "inference," which is the process of running a pre trained model to make a prediction. The Google Coral TPU is rated at four TOPS—that is trillion operations per second—while using only about two watts of power. To get that same level of performance out of a GPU, you might be pulling fifty or one hundred watts. For a home server that is running twenty four hours a day, seven days a week, that power difference adds up to a lot of money on your electric bill.

That makes total sense for the efficiency part, but what about the raw performance? If I have a high end Nvidia card, like a forty ninety, surely that is faster than a tiny USB stick from Google?

On paper, the raw throughput of a forty ninety is massive. But there is a latency issue. Getting the video frame from your CPU, across the PCIe bus, into the GPU memory, running the model, and getting the result back takes time. The Coral TPU is designed for extremely low latency. Also, Frigate is often bottlenecked by how many simultaneous detections it can run. The Coral can handle multiple streams very effectively because it is so specialized. However, I should mention that in twenty twenty six, we are seeing new competitors like the Hailo eight and Hailo ten accelerators, which are starting to outperform the aging Google Coral hardware while maintaining that same efficiency.

So, if someone is building a home server right now for a self hosted NVR with these AI features, what should they actually prioritize? Daniel mentioned wanting to avoid niche components, and let us be honest, the Google Coral has been notoriously hard to find at a fair price for years. It is often out of stock or marked up by scalpers.

It is getting better, but you are right, it is still a bit of a niche item. If you want to build a robust NVR server without hunting for a Coral TPU or a Hailo accelerator, the best thing you can prioritize is a modern Intel CPU with QuickSync.

QuickSync? That is Intel’s hardware video encoder and decoder, right?

Yes. For an NVR, the hardest part is actually not the AI detection. It is the decoding of multiple high resolution video streams. If your CPU has to do that with raw processing power, it will be pegged at one hundred percent all the time, and your system will be sluggish. QuickSync offloads that decoding to a specialized part of the chip. Modern Intel chips, especially from the twelfth generation onwards—and certainly the newer Core Ultra chips—have incredibly good integrated graphics.

And can those integrated graphics handle the AI part too, or just the video decoding?

Both! That is the beauty of it. You can use Intel’s OpenVINO toolkit. OpenVINO is a set of tools that optimizes AI models to run on Intel hardware. It can run the YOLO models directly on the integrated GPU of a standard Intel chip. For a home setup with maybe five to eight cameras, a modern Intel i5 or even an i3 can handle the whole thing—decoding the video and running the AI—without needing a separate TPU or a dedicated graphics card. It is much more efficient and much easier to set up than it used to be.

That sounds like a much more accessible path for most people. What about the rest of the system? RAM and storage? I assume you need a lot of both if you are recording high def video twenty four seven.

RAM is actually not a huge factor for Frigate itself. Eight to sixteen gigabytes is plenty for most home users. What you really need to prioritize is storage endurance and the "type" of storage. If you are recording multiple streams constantly, you are putting a lot of wear on the drive. You want a dedicated surveillance drive, like a Western Digital Purple or a Seagate SkyHawk. These are designed for continuous write operations and have firmware optimized for video streaming.

And I suppose a fast network is a given. If you have five cameras sending four K video, that is a lot of traffic on your local network.

Definitely. You want your cameras on a wired ethernet connection if possible. Wi Fi cameras are okay for one or two, but once you scale up, the congestion becomes a nightmare. Also, for security, you should really have those cameras on a separate VLAN, or virtual local area network. This ensures that if one of your cameras is compromised, the attacker cannot use it as a bridge to access your personal computer or your bank accounts.

That is a great point. We have talked about IoT security before, and cameras are often the most vulnerable devices on a network because they rarely get firmware updates. Keeping them isolated is crucial. Now, what about Daniel’s last question? He asked about those AI focused single board computers, like the ones you see on AliExpress that claim to have integrated TPUs or NPUs. Are they actually feasible for running a full NVR like Frigate?

This is a bit of a "yes, but" situation. There are boards like the Orange Pi five Max or the various Rockchip based boards that have a built in NPU, which stands for Neural Processing Unit. On paper, they are amazing. They are small, cheap, and have dedicated AI hardware that is often faster than a Google Coral.

So what is the catch? There is always a catch with those ultra cheap boards.

The catch is the software ecosystem. Frigate and the underlying libraries like FFmpeg and Python need specific drivers to talk to those NPUs. Because these boards use chips from companies like Rockchip or Amlogic, the documentation is often very sparse, and the drivers are not always integrated into the main versions of the software. You end up having to use a "fork" of Frigate or a specific community built Docker image.

So you end up spending more time fighting with the code and the terminal than actually using the NVR to watch your house.

Exactly. It is a maintenance burden. If you enjoy the hacking, the troubleshooting, and compiling your own kernels, it is a great project. But if you just want your security system to work reliably so you can sleep at night, it can be very frustrating. When the main version of Frigate gets a cool new feature, you might have to wait months for it to be ported to your specific board’s NPU.

It is the classic open source trade off. You can have it cheap and powerful, or you can have it easy and supported, but rarely both at the same time.

That said, the Raspberry Pi five has made some big strides here. While it does not have a built in NPU, it has a PCIe slot now. You can buy a HAT—which is an expansion board—that lets you plug in a Coral TPU or a Hailo eight M.2 accelerator. That is a much more stable path because the Raspberry Pi has such a massive community supporting it.

So if someone wanted a small, low power dedicated box, a Raspberry Pi five with a Hailo or Coral accelerator is probably the most reliable single board computer route.

I would say so. But honestly, Corn, by the time you buy a Raspberry Pi five, a high quality power supply, a case, an NVMe base, an SSD, and the AI accelerator, you are looking at spending over two hundred dollars. For that same price, you can go on eBay and buy a used "tiny" office PC, like a Dell OptiPlex Micro or an HP ProDesk with an Intel tenth or eleventh gen processor.

And those tiny PCs are much more capable and easier to manage.

Much easier. They have better cooling, more ports, and you can just install standard Linux on them and everything usually just works. That is my personal recommendation for anyone starting out with Frigate. Look for a used enterprise mini PC. They are the unsung heroes of the home lab world. They are quiet, efficient, and built to run twenty four seven.

I think that is a very practical takeaway. It is easy to get caught up in the hype of niche hardware, but often the most "boring" standard equipment is the best for a system you need to rely on. So, to recap for Daniel, YOLO is the model that makes it fast, transfer learning is how you make it smart for your specific needs, TPUs are the efficiency kings because of their matrix multiplication architecture, and for a home build, a modern Intel chip with QuickSync is the sweet spot for most people.

Spot on. And I think it is worth mentioning that even if you do not have the AI hardware on day one, you can still use Frigate. It will just use your CPU for the detections. It might be slower and use more power, but it is a great way to test the waters and see if you like the software before you commit to buying more hardware.

That is how we started, and it really helped us understand which cameras needed the most attention. Before we wrap up, I want to touch on one more thing Daniel mentioned, which was the "downresed" version of the video feed. He noticed that these models do not look at the full resolution stream. Why is that? If I have a four K camera, wouldn't the AI be more accurate if it saw all those pixels?

You would think so, but it is actually another efficiency trick. Most of these models, like YOLOv11, are trained on images that are relatively small, usually six hundred and forty by six hundred and forty pixels. If you feed it a four K video frame, the first thing the model does is shrink it down to that size anyway.

So there is no benefit to giving it more pixels than it can actually "see" in its internal grid?

Not really. In fact, it is a disadvantage because shrinking a four K image down to six hundred and forty pixels takes a significant amount of CPU power. Frigate is smart. It lets you define a low resolution "sub stream" from your camera—usually something like seven hundred and twenty p or even lower—specifically for the AI to look at. Meanwhile, it records the high resolution "main stream" directly to your hard drive for when you need to see the fine details later. It is all about doing the least amount of work necessary to get the result.

It is amazing how much engineering goes into making these things seem simple. You just see a green box around a person on your phone, but underneath, there is this whole dance of video decoding, image resizing, and matrix multiplication happening dozens of times every second.

It really is a feat of modern computing. And the fact that we can do this locally, in our own homes, without sending our private video feeds to a cloud server owned by a giant corporation, is a huge win for privacy. We are seeing a real trend toward "Local AI" where you get the convenience of the cloud without the privacy risks.

Absolutely. That is the real power of things like Frigate. You get the high tech features of a Nest or a Ring camera, but you own the data and you control the hardware. No monthly subscriptions, no "outages" because a company’s server went down, and no one watching your footage but you.

And if your internet goes out, your security system still works perfectly. That is a big deal for a lot of people.

Well, Herman, I think we have given Daniel a lot to chew on. This has been a fascinating dive into the world of computer vision and home servers. I feel like I finally understand why our server room—which is actually just the closet under the stairs—is so much quieter now.

It is the power of efficiency, Corn. We are doing more with less.

Before we go, a quick reminder to our listeners. If you have been enjoying My Weird Prompts, we would really appreciate it if you could leave us a review on Apple Podcasts or a rating on Spotify. It genuinely helps other people discover the show and keeps us motivated to keep digging into these complex topics.

Yeah, it really does make a difference. We see every review and we appreciate the feedback, even if it is just to tell me I am being too nerdy about systolic arrays.

Never too nerdy, Herman. You can find all our past episodes, including the ones we mentioned today, at myweirdprompts.com. We have an RSS feed there for subscribers and a contact form if you want to send in your own prompt. You can also reach us directly at show at myweirdprompts.com.

We are also on Spotify, Apple Podcasts, and pretty much everywhere else you listen to podcasts.

Thanks again to Daniel for the prompt. This has been Episode seven hundred and forty five of My Weird Prompts.

Until next time, I am Herman Poppleberry.

And I am Corn. Thanks for listening, and we will talk to you soon. Goodbye!

Goodbye!

This episode was generated with AI assistance. Hosts Herman and Corn are AI personalities.