#2325: Why Depth Is the Hardest Thing for AI to See

Can AI turn your apartment photos into a precise 3D model? Explore the tech behind photogrammetry and spatial reconstruction.

Featuring

Daniel

Corn

Herman

Listen

0:00

Episode Details

Episode ID: MWP-2483
Published: Apr 19
Updated: May 15
Duration: 27:09
Audio: Direct link
Pipeline: V5
TTS Engine: chatterbox-regular
Script Writing Agent: Claude Sonnet 4.6
Topics: spatial-audio computer-vision digital-twins

AI-Generated Content: This podcast is created using AI personas. Please verify any important information independently.

How AI Turns Photos Into 3D Models

The idea of transforming a few hundred photos of your apartment into a precise 3D model sounds like magic, but the technology behind it is grounded in principles of photogrammetry and AI-driven spatial reconstruction. At its core, the process hinges on recovering depth from flat images—a challenge because photographs inherently discard depth information.

The Role of Parallax

Depth reconstruction relies on parallax, the apparent shift in an object’s position when viewed from different angles. Just as your brain uses the slight differences between what your left and right eyes see to judge distance, photogrammetry uses overlapping photos taken from multiple viewpoints to triangulate the position of points in 3D space. This process, known as Structure from Motion, involves three key steps: feature extraction, feature matching, and bundle adjustment.

Scaling the Model

Even with accurate geometry, a 3D model is useless without proper scaling. Reference objects, like a chair or glass of water with known dimensions, act as rulers to calibrate the model. However, using non-standard objects can introduce systematic errors. For example, a chair that’s 10% taller than assumed scales the entire model up by 10%, leading to significant discrepancies in real-world measurements.

Efficiency Gains with AI

Recent advancements in AI-assisted photogrammetry have made the process more efficient. Techniques like smarter feature matching, better initialization for optimization, and learned depth priors have reduced computational requirements by around 30%. Systems like UniRecGen leverage geometric anchors from training data to improve accuracy, even when camera positions are unknown.

Practical Applications

While photogrammetry offers impressive results, it’s not without limitations. Errors in scaling or feature matching can lead to visually plausible but metrically incorrect models. For applications requiring high precision, technologies like LIDAR, which directly measures distance with laser pulses, remain superior.

Mentions

Alice Vision Framework for 3D reconstruction
Claude Sonnet 4.6 AI model powering the episode
Ikea Furniture retailer for apartment planning
Matterport Platform for 3D virtual tours
Meshroom Open-source photogrammetry software
Meshy Browser-based 3D reconstruction tool
UniRecGen Multi-view reconstruction system from research

Downloads

Episode Audio

Download the full episode as an MP3 file

Download MP3

Transcript (TXT)

Plain text transcript file

Transcript (PDF)

Formatted PDF with styling

#2325: Why Depth Is the Hardest Thing for AI to See

Daniel sent us this one, and I have to say, the setup is immediately relatable. You're renting a new apartment, you go to Ikea to sort out the kitchen, and somewhere between the meatballs and the flatpack aisle you realize you have absolutely no idea whether that countertop is one-point-eight meters or two-point-two. The dimensions are gone. The layout is a blur. You're guessing. Daniel's question is: could you just take a few hundred photos of the apartment beforehand, feed them to an AI, and get back a usable three-dimensional model with actual approximate dimensions? And if so, what is technically happening when you do that? Because stitching photographs together sounds simple until you start pulling on that thread.

By the way, today's episode is powered by Claude Sonnet four point six.

Good to know we're in capable hands. Right, so Herman, where do we even start? Because Daniel's framing is smart. He's not asking "does this technology exist," he's asking what is actually going on under the hood, and I think that distinction matters a lot here.

It does, and I think the honest first move is to say there are at least three different things people conflate when they talk about this, and they behave very differently technically. You've got image stitching, which is what your panorama app does. You've got photogrammetry, which is a proper geometric reconstruction discipline. And then you've got what the newer AI-driven spatial modeling tools are doing, which borrows from both but is its own thing. Daniel's prompt is really asking us to trace the line from the first to the third, and I think that's the right way to structure it.

The Ikea problem is a genuinely good motivating case, because it's not just about pretty visuals. You need the model to be metrically accurate enough that when you look at it and think "that's a sixty-centimeter gap between the fridge and the wall," you can actually trust that measurement when you're standing in the store. That accuracy hinges on reconstructing depth from flat images, which is where the real challenge lies.

And that's the core problem, stripped back: a photograph throws away depth. The moment light hits a sensor, you get color and intensity at each pixel, but the distance information is gone. Everything from ten centimeters to ten kilometers gets flattened onto the same plane. Reconstructing three dimensions from that is fundamentally an inference problem.

Which is why a single photo gets you nowhere useful.

Right, you need multiple photos from different positions, because that's how you recover depth. The geometry is called parallax, the apparent shift in an object's position when you view it from two slightly different angles. Your brain does this constantly with your two eyes. Photogrammetry formalizes it: you take dozens or hundreds of overlapping images, identify the same physical point across multiple frames, and use the geometry of where the camera was standing each time to triangulate where that point sits in three-dimensional space.

It's worth pausing on that for a second because I think the brain analogy is actually more instructive than it first sounds. If you close one eye and hold your finger up in front of your face, you lose your sense of how far away it is. You can still see it, but the depth is gone. Open both eyes and suddenly you know exactly where it sits in space. That's literally the same principle photogrammetry is exploiting, just with a camera moving through space instead of two eyes sitting side by side.

The reason it works is that your brain, or the algorithm, knows the geometry of the two viewpoints. Your eyes are roughly six to seven centimeters apart, your brain has internalized that, and it uses the angular difference in what each eye sees to compute distance. Photogrammetry has to figure out the viewpoint geometry from the images themselves, which is a harder problem, but the underlying principle is identical.

LIDAR sidesteps that whole inference problem by just...

LIDAR pulses laser light and times how long it takes to bounce back. Distance is a direct measurement, not a reconstruction. That's why in controlled comparisons, drone LIDAR tests have shown elevation accuracy around fifty-one to fifty-two centimeters even in complex terrain, while photogrammetry can drift depending on conditions.

Where do reference objects fit into this? Because Daniel specifically mentions using a chair or a glass of water as a known dimension.

That's the scaling problem. Photogrammetry can give you a geometrically correct model, meaning proportions are right, but it's unitless unless you anchor it to something real. A known object in the frame, something whose dimensions you can look up, gives the software a ruler. You're essentially saying: this object is forty-five centimeters tall, now calibrate everything else accordingly.

Which is elegant when it works, but when someone's using a non-standard chair, it becomes a disaster waiting to happen.

Right, and that's the exact failure mode with reference objects—it's not just theoretical. If someone mistakenly uses a designer chair with unusual proportions, thinking it's a standard dining chair, every dimension in the model shifts. You end up with a systematic error that compounds across the entire space.

How much does that kind of error actually propagate? Like, if my reference chair is ten percent off, am I ten percent off everywhere?

It's a linear scaling error, so it propagates uniformly. If your chair is actually forty centimeters tall and you've told the software it's forty-five, your model is scaled up by about twelve percent across the board. In a four-meter kitchen, that's nearly fifty centimeters of phantom space. Your cabinet that looked like it would fit with room to spare is now colliding with the wall.

Which is a painful outcome after an hour of careful photography.

The cruelest part is that the model looks completely plausible. There's no visual artifact that tells you it's wrong. It's just quietly scaled incorrectly, and you only find out when the delivery driver shows up and the unit doesn't fit through the door.

Let's get into the stitching itself, because I think that's where Daniel's question really lives. The panorama comparison he raises is a good entry point. What is a panorama app actually doing?

At its simplest, a panorama is a two-dimensional alignment problem. You take a sequence of overlapping frames, find matching features across adjacent images, and warp and blend them into a single wide image. The key step is feature matching, identifying the same corner of a window frame or the same edge of a tile across two frames, and using those correspondences to figure out how to rotate and translate one image so it lines up with the next.

The overlap is non-negotiable for that to work.

You need roughly twenty to thirty percent overlap between adjacent frames, sometimes more in low-texture environments like a plain white wall, because the algorithm needs enough shared content to find reliable correspondences. Too little overlap and you get seams or outright failure. Too much and you're just doing redundant computation.

I've definitely seen what happens when the overlap breaks down. You get those panoramas where someone's arm is in two places at once, or there's a ghost seam across someone's face because they moved between frames.

Those artifacts are the algorithm's failure to find clean correspondences across the overlap region. A moving subject is the classic culprit because the feature that matched in frame one has physically relocated by frame two. The algorithm tries to reconcile the mismatch and produces something uncanny. In a static room that's less of a problem, but it tells you something important: the algorithm has no understanding of the scene. It's purely doing geometry on pixel patterns. If the pixel patterns lie, the result breaks.

The panorama case is: stitch flat, blend seams, done. Three-D reconstruction is a fundamentally different ask.

Much more involved. In the three-D case you're not just aligning images into a flat mosaic, you're trying to infer camera positions in three-dimensional space and simultaneously reconstruct the geometry of what the camera was looking at. The standard pipeline is called Structure from Motion, and it runs in roughly three stages. First, feature extraction: you identify distinctive points in every image, typically using something like SIFT or ORB, algorithms that find corners and edges that are stable across different viewpoints and lighting. Second, feature matching across image pairs. Third, bundle adjustment, which is the expensive part, where you jointly optimize the estimated camera positions and the estimated three-D point positions to minimize reprojection error.

Reprojection error being...

If you take your estimated three-D point and project it back onto each image using your estimated camera position, it should land exactly where the feature was detected. The gap between where it lands and where it actually was is the reprojection error. Bundle adjustment minimizes that gap across every point and every image simultaneously. It's a massive nonlinear optimization problem, which is why processing two or three hundred photos of a sixty-meter apartment is computationally heavy.

Can you give a sense of scale there? Like, what are we talking about in terms of points and computations?

A typical indoor reconstruction from two hundred images might produce somewhere between fifty thousand and a few million sparse points in the initial Structure from Motion pass. Bundle adjustment is then optimizing the positions of all those points plus all the camera poses simultaneously, which means you're solving a system with potentially millions of variables where every variable affects every other variable. The reason it's tractable at all is that the Jacobian matrix involved is very sparse, most points only appear in a small fraction of images, so you can exploit that structure algorithmically. But it's still why your laptop fan starts screaming when you run Meshroom.

That's where the recent efficiency gains matter. There was work published earlier this year suggesting AI-assisted photogrammetry pipelines have cut resource requirements by around thirty percent compared to where things were twelve months ago.

That tracks with what I've been reading. The gains are coming from a few places: smarter feature matching that avoids redundant pairings, better initialization for the bundle adjustment so it converges faster, and learned depth priors that give the optimizer a better starting guess rather than solving from scratch. UniRecGen, a multi-view reconstruction system out of recent research, specifically addresses the case of sparse unposed images, meaning photos where you don't even know the camera positions in advance, and it outperforms earlier baselines on geometric accuracy by anchoring the reconstruction to what they call geometric anchors.

Which is basically a learned version of the reference object idea.

In a sense, yes. Instead of a physical chair in the frame, the model has internalized geometric priors from training data and uses those to constrain the reconstruction. The chair is implicit rather than explicit.

The chair as metaphor. Daniel would appreciate that.

The practical upshot for Daniel's apartment scenario is that the stitching problem breaks into two separable challenges: getting the geometry right, which Structure from Motion handles, and getting the scale right, which requires either a reference object, known camera parameters, or those learned priors. Miss either one and your Ikea planning session goes sideways.

Right, but what does that actually look like in practice? Because there's a gap between "technically solvable" and "works reliably enough that a normal person can do it before their Ikea trip.

That's where the knock-on effect start to bite. The geometry problem and the scale problem are theoretically separable, but in practice they compound. A small error in your reference object calibration, say five percent off on your chair height, propagates through every measurement in the model. If your kitchen is four meters long, you're now off by twenty centimeters. That's the difference between a cabinet fitting and not fitting.

That's assuming the reconstruction geometry itself is clean. Which it often isn't.

Dense indoor environments are hard. Reflective surfaces, like kitchen tiles or a glass splashback, confuse feature matching because the same physical point looks different depending on the angle. Plain walls give the algorithm almost nothing to grip onto. And furniture with repetitive patterns, think a grid of cabinet doors, creates false matches where the algorithm thinks it's found the same corner but it's actually one cabinet over.

That last one is interesting. So if you've got a row of identical kitchen cabinet doors, the algorithm might think it's found the same door twice when it's actually matched door three to door five?

It's called an ambiguous correspondence, and it's one of the nastier failure modes because it doesn't just introduce noise, it introduces a systematic geometric error. The reconstruction might confidently place that section of the kitchen in completely the wrong position. And from a visual inspection of the output, you'd never know, because each individual cabinet door looks perfectly reconstructed. It's only when you try to measure the overall run of cabinets that you realize it's half a meter shorter than it should be.

The apartment is basically a stress test for every known weakness in the pipeline.

It really is. Which is part of why photogrammetry for outdoor terrain, open fields, building facades, drone surveys, tends to produce cleaner results than indoor residential spaces. The outdoor case has natural texture variation, no reflections, and you can fly a consistent grid pattern. Indoors you're fighting the geometry on every front.

That's where the LIDAR comparison becomes interesting rather than just academic. Because LIDAR doesn't care about texture. It doesn't care if your kitchen tiles are reflective.

Right, the laser pulse either bounces back or it doesn't. You're not trying to infer geometry from visual appearance, you're measuring it directly. The fifty-one to fifty-two centimeter elevation accuracy figure from drone LIDAR comparisons is in terrain that would challenge photogrammetry, dense vegetation, irregular surfaces. Indoors, a LIDAR scanner can map a sixty-meter apartment in minutes with millimeter-level precision.

Why isn't everyone just doing that?

A decent handheld LIDAR scanner runs several thousand dollars. The iPhone Pro has a LIDAR sensor, which is interesting for this use case, but it's lower range and lower resolution than a dedicated unit. Photogrammetry, by contrast, costs you a phone camera and time. The thirty percent reduction in computational overhead from recent AI-assisted pipelines makes that tradeoff more attractive than it was even a year ago.

There's a fun bit of history there actually. LIDAR as a technology is older than most people realize. It was used in the Apollo fifteen mission in nineteen seventy-one to map the lunar surface. So the same basic principle that's now sitting in the pocket of anyone with a recent iPhone was first deployed to measure craters on the Moon fifty years ago. The miniaturization journey is kind of staggering when you put it that way.

The cost trajectory mirrors that. In the early two-thousands, a terrestrial LIDAR scanner cost hundreds of thousands of dollars and needed a dedicated operator. Now the sensor in an iPhone Pro costs Apple a few dollars to manufacture at scale. The physics hasn't changed at all. It's purely a fabrication and integration story.

Which is exactly the kind of thing that makes the next ten years interesting for Daniel's use case. Because if LIDAR sensors keep getting cheaper and more capable while photogrammetry pipelines keep getting more efficient, the two approaches are going to converge on the same consumer device at roughly the same price point.

They're already converging. The newer reconstruction apps on iPhone are doing sensor fusion, combining the LIDAR depth data with the camera imagery to get better results than either alone. The LIDAR gives you coarse geometry quickly, the photogrammetry fills in texture and fine detail, and the combination is more robust than either pipeline running independently.

For Daniel's apartment scenario, the honest answer is: photogrammetry with good reference objects gets you within a useful margin for Ikea planning, but you're not getting architectural precision.

That's about right. For practical interior planning, if you're asking "will this two-hundred-and-ten-centimeter bookcase fit against that wall," a well-executed photogrammetry pass with a known reference object probably gets you close enough. If you're asking "will this countertop clear the building code clearance by exactly the required amount," you need LIDAR or a tape measure.

Which is a reasonable division of labor. The real estate industry has figured this out.

They have, and it's a useful case study. Virtual apartment tours using photogrammetry have become standard. Companies like Matterport built their entire business on this, taking a few hundred overlapping images of a space and producing a navigable three-D model that potential tenants or buyers can walk through online. The dimensional accuracy is good enough to let people check whether their existing furniture fits, which is exactly Daniel's use case.

Matterport is interesting because they started with proprietary hardware, a dedicated camera rig that cost several thousand dollars, and have progressively moved toward supporting standard smartphones. That transition only became viable as the photogrammetry pipelines got good enough to compensate for the lower quality input.

That's the pattern across the industry. The hardware requirements drop as the software gets smarter. It's not that the physics of the problem got easier, it's that learned priors and better optimization are substituting for what used to require better sensors.

The computational cost of producing those models has dropped enough that it's now a routine service rather than a specialist operation.

That's the trajectory. The thirty percent efficiency improvement in AI-assisted pipelines is part of what's pushing this into the consumer tier. Tools like Meshy are now offering browser-based reconstruction that can turn a photo set into a navigable model relatively quickly. The precision ceiling is still lower than dedicated photogrammetry software, but the accessibility floor has dropped dramatically.

The retail angle is interesting too. Because if you can model a space accurately enough, you can do virtual product placement. Drop a sofa into the model and see if it looks right before you buy it.

That's already happening. Interior design platforms are doing exactly this, using photogrammetry-derived room models as the base and overlaying product meshes. The accuracy requirement there is actually a bit different, you need the spatial proportions to feel right visually more than you need millimeter precision, so the photogrammetry approach works well. The harder problem is lighting, making the inserted product look like it belongs in the actual room rather than pasted on top.

Which is a rendering problem rather than a reconstruction problem.

Separate pipeline entirely. But it illustrates how these technologies stack. Reconstruction gives you geometry and scale. Rendering handles how materials interact with light. AI is now doing both, but they're still distinct challenges with distinct error modes. So, practically speaking, what does that mean for someone like Daniel standing in his apartment with his phone?

What are the moves? If Daniel's holding his phone right now, what's he actually doing with all this tech?

The reference object question is the first thing to nail down, and it's more nuanced than just "put a chair in the frame." You want something with known dimensions in all three axes, height, width, depth, that appears in multiple overlapping images across the space. A chair works. A water glass is trickier because it's small relative to the room, and small reference errors amplify. The bigger the reference object relative to the scene, the better your scale propagation.

A dining chair beats a coffee cup.

By a lot. And you want it centrally placed, not shoved into one corner where only a handful of images capture it. The scale anchor needs to be visible from multiple angles so the bundle adjustment can triangulate its position reliably.

What about the shooting pattern itself?

Consistent overlap is the discipline. Twenty to thirty percent between adjacent frames, and you want to walk the space in deliberate arcs rather than random sweeps. Think of it as painting the room with coverage. For a sixty-meter apartment, two to three hundred images is the realistic number. More than that and you're adding computational load without proportional accuracy gains, less and you risk gaps in the reconstruction, especially in corners and doorways.

Corners and doorways being the places you most need to get right if you're planning furniture placement.

Because that's where the constraints live. Whether a wardrobe fits in an alcove, whether a sofa clears the doorframe, those are all corner and threshold measurements. And they're also the places where the shooting geometry gets awkward. You're often trying to capture a tight space from a position that doesn't give you clean overlap with adjacent frames. It's worth doing a dedicated pass just for corners, getting in close and making sure you've got multiple angles on each one.

Which is where the thirty percent efficiency improvement in AI-assisted pipelines actually helps, because you can afford to shoot a bit more freely without the processing time becoming punishing.

Right, and on the software side, if you're doing this yourself rather than using a service, Meshroom is the open-source option built on the Alice Vision framework. It runs Structure from Motion and multi-view stereo locally. The browser-based tools like Meshy are faster to get started with but give you less control over the precision parameters.

The honest ceiling being: good enough for Ikea, not good enough for a building permit.

That's the practical summary. Know what accuracy level you need before you choose your tool.

The frontier that keeps nagging at me is what happens when the reference object problem gets solved implicitly, at scale. Because right now we're still asking the user to put a chair in the frame. What does this look like when the model has seen enough apartments that it just knows, probabilistically, how high a standard kitchen countertop is?

That's the direction the learned prior work is pointing. UniRecGen is an early version of that idea. Train on enough labeled spatial data and the reconstruction starts to carry scale information without needing an explicit anchor. The chair becomes optional rather than required. I'm not sure how close we are to that being reliable enough for consumer use, but the trajectory is clear.

There's a really interesting epistemological question lurking there. Because when a model infers that a countertop is ninety centimeters high because that's what countertops usually are, it's not measuring your apartment anymore. It's averaging over every apartment it's ever seen. Which is fine if your apartment is standard, but if you're in a converted industrial space with non-standard ceiling heights and custom joinery, the model's priors are actively working against you.

That's a important caveat. The learned prior is a bet that your space is typical. The more atypical your space, the more the prior misleads rather than helps. Which is why the explicit reference object isn't going away entirely, it's becoming a fallback for edge cases rather than a universal requirement. But for a bog-standard rental apartment with standard door heights and standard kitchen fittings, the prior is probably more reliable than a hastily measured chair.

Once you cross that threshold, the workflow collapses to almost nothing. Walk through your apartment with your phone, upload the video, get back a metrically accurate model. No reference objects, no deliberate overlap discipline, no Meshroom configuration.

The Matterport case is instructive there. They went from specialist hardware to something that runs on an iPhone. The next move is removing the remaining friction from the capture process itself. That's where the video tokenization insight Daniel flagged connects back in. If you can extract a useful reconstruction from low-framerate, low-resolution video rather than a carefully planned photo set, the barrier drops to almost nothing.

Which is a big deal for real estate, for insurance, for interior design, for anyone who has ever stood in an Ikea and realized they cannot remember how wide their doorframe is.

The use cases are mundane in the best possible way.

I think that's actually the most interesting thing about Daniel's question when you zoom out. He's not asking about some exotic research application. He's asking about a problem that every person who has ever moved house has had. The technology to solve it has existed in research labs for decades. What's changed is the cost and friction have finally dropped below the threshold where a normal person with a phone can actually use it. That transition, from specialist tool to everyday utility, is where most of the interesting things happen.

We're right in the middle of it. The pipeline isn't fully seamless yet, you still need to think about reference objects and shooting patterns and which software to use. But the gap between "technically possible" and "something my mum could do before her Ikea trip" is closing faster than it was even two years ago.

Big thanks to Hilbert Flumingtop for producing, and to Modal for keeping our pipeline running. If you've found this one useful, leaving a review helps people find the show. This has been My Weird Prompts. We'll see you next time.

This episode was generated with AI assistance. Hosts Herman and Corn are AI personalities.

#2325: Why Depth Is the Hardest Thing for AI to See

How AI Turns Photos Into 3D Models

The Role of Parallax

Scaling the Model

Efficiency Gains with AI

Practical Applications

Mentions

Downloads

You Might Also Like

#2325: Why Depth Is the Hardest Thing for AI to See