Daniel sent us this one, and I have to say, the setup is immediately relatable. You're renting a new apartment, you go to Ikea to sort out the kitchen, and somewhere between the meatballs and the flatpack aisle you realize you have absolutely no idea whether that countertop is one-point-eight meters or two-point-two. The dimensions are gone. The layout is a blur. You're guessing. Daniel's question is: could you just take a few hundred photos of the apartment beforehand, feed them to an AI, and get back a usable three-dimensional model with actual approximate dimensions? And if so, what is technically happening when you do that? Because stitching photographs together sounds simple until you start pulling on that thread.
By the way, today's episode is powered by Claude Sonnet four point six.
Good to know we're in capable hands. Right, so Herman, where do we even start? Because Daniel's framing is smart. He's not asking "does this technology exist," he's asking what is actually going on under the hood, and I think that distinction matters a lot here.
It does, and I think the honest first move is to say there are at least three different things people conflate when they talk about this, and they behave very differently technically. You've got image stitching, which is what your panorama app does. You've got photogrammetry, which is a proper geometric reconstruction discipline. And then you've got what the newer AI-driven spatial modeling tools are doing, which borrows from both but is its own thing. Daniel's prompt is really asking us to trace the line from the first to the third, and I think that's the right way to structure it.
The Ikea problem is a genuinely good motivating case, because it's not just about pretty visuals. You need the model to be metrically accurate enough that when you look at it and think "that's a sixty-centimeter gap between the fridge and the wall," you can actually trust that measurement when you're standing in the store. That accuracy hinges on reconstructing depth from flat images, which is where the real challenge lies.
And that's the core problem, stripped back: a photograph throws away depth. The moment light hits a sensor, you get color and intensity at each pixel, but the distance information is gone. Everything from ten centimeters to ten kilometers gets flattened onto the same plane. Reconstructing three dimensions from that is fundamentally an inference problem.
Which is why a single photo gets you nowhere useful.
Right, you need multiple photos from different positions, because that's how you recover depth. The geometry is called parallax, the apparent shift in an object's position when you view it from two slightly different angles. Your brain does this constantly with your two eyes. Photogrammetry formalizes it: you take dozens or hundreds of overlapping images, identify the same physical point across multiple frames, and use the geometry of where the camera was standing each time to triangulate where that point sits in three-dimensional space.
It's worth pausing on that for a second because I think the brain analogy is actually more instructive than it first sounds. If you close one eye and hold your finger up in front of your face, you lose your sense of how far away it is. You can still see it, but the depth is gone. Open both eyes and suddenly you know exactly where it sits in space. That's literally the same principle photogrammetry is exploiting, just with a camera moving through space instead of two eyes sitting side by side.
The reason it works is that your brain, or the algorithm, knows the geometry of the two viewpoints. Your eyes are roughly six to seven centimeters apart, your brain has internalized that, and it uses the angular difference in what each eye sees to compute distance. Photogrammetry has to figure out the viewpoint geometry from the images themselves, which is a harder problem, but the underlying principle is identical.
LIDAR sidesteps that whole inference problem by just...
LIDAR pulses laser light and times how long it takes to bounce back. Distance is a direct measurement, not a reconstruction. That's why in controlled comparisons, drone LIDAR tests have shown elevation accuracy around fifty-one to fifty-two centimeters even in complex terrain, while photogrammetry can drift depending on conditions.
Where do reference objects fit into this? Because Daniel specifically mentions using a chair or a glass of water as a known dimension.
That's the scaling problem. Photogrammetry can give you a geometrically correct model, meaning proportions are right, but it's unitless unless you anchor it to something real. A known object in the frame, something whose dimensions you can look up, gives the software a ruler. You're essentially saying: this object is forty-five centimeters tall, now calibrate everything else accordingly.
Which is elegant when it works, but when someone's using a non-standard chair, it becomes a disaster waiting to happen.
Right, and that's the exact failure mode with reference objects—it's not just theoretical. If someone mistakenly uses a designer chair with unusual proportions, thinking it's a standard dining chair, every dimension in the model shifts. You end up with a systematic error that compounds across the entire space.
How much does that kind of error actually propagate? Like, if my reference chair is ten percent off, am I ten percent off everywhere?
It's a linear scaling error, so it propagates uniformly. If your chair is actually forty centimeters tall and you've told the software it's forty-five, your model is scaled up by about twelve percent across the board. In a four-meter kitchen, that's nearly fifty centimeters of phantom space. Your cabinet that looked like it would fit with room to spare is now colliding with the wall.
Which is a painful outcome after an hour of careful photography.
The cruelest part is that the model looks completely plausible. There's no visual artifact that tells you it's wrong. It's just quietly scaled incorrectly, and you only find out when the delivery driver shows up and the unit doesn't fit through the door.
Let's get into the stitching itself, because I think that's where Daniel's question really lives. The panorama comparison he raises is a good entry point. What is a panorama app actually doing?
At its simplest, a panorama is a two-dimensional alignment problem. You take a sequence of overlapping frames, find matching features across adjacent images, and warp and blend them into a single wide image. The key step is feature matching, identifying the same corner of a window frame or the same edge of a tile across two frames, and using those correspondences to figure out how to rotate and translate one image so it lines up with the next.
The overlap is non-negotiable for that to work.
You need roughly twenty to thirty percent overlap between adjacent frames, sometimes more in low-texture environments like a plain white wall, because the algorithm needs enough shared content to find reliable correspondences. Too little overlap and you get seams or outright failure. Too much and you're just doing redundant computation.
I've definitely seen what happens when the overlap breaks down. You get those panoramas where someone's arm is in two places at once, or there's a ghost seam across someone's face because they moved between frames.
Those artifacts are the algorithm's failure to find clean correspondences across the overlap region. A moving subject is the classic culprit because the feature that matched in frame one has physically relocated by frame two. The algorithm tries to reconcile the mismatch and produces something uncanny. In a static room that's less of a problem, but it tells you something important: the algorithm has no understanding of the scene. It's purely doing geometry on pixel patterns. If the pixel patterns lie, the result breaks.
The panorama case is: stitch flat, blend seams, done. Three-D reconstruction is a fundamentally different ask.
Much more involved. In the three-D case you're not just aligning images into a flat mosaic, you're trying to infer camera positions in three-dimensional space and simultaneously reconstruct the geometry of what the camera was looking at. The standard pipeline is called Structure from Motion, and it runs in roughly three stages. First, feature extraction: you identify distinctive points in every image, typically using something like SIFT or ORB, algorithms that find corners and edges that are stable across different viewpoints and lighting. Second, feature matching across image pairs. Third, bundle adjustment, which is the expensive part, where you jointly optimize the estimated camera positions and the estimated three-D point positions to minimize reprojection error.
Reprojection error being...
If you take your estimated three-D point and project it back onto each image using your estimated camera position, it should land exactly where the feature was detected. The gap between where it lands and where it actually was is the reprojection error. Bundle adjustment minimizes that gap across every point and every image simultaneously. It's a massive nonlinear optimization problem, which is why processing two or three hundred photos of a sixty-meter apartment is computationally heavy.
Can you give a sense of scale there? Like, what are we talking about in terms of points and computations?
A typical indoor reconstruction from two hundred images might produce somewhere between fifty thousand and a few million sparse points in the initial Structure from Motion pass. Bundle adjustment is then optimizing the positions of all those points plus all the camera poses simultaneously, which means you're solving a system with potentially millions of variables where every variable affects every other variable. The reason it's tractable at all is that the Jacobian matrix involved is very sparse, most points only appear in a small fraction of images, so you can exploit that structure algorithmically. But it's still why your laptop fan starts screaming when you run Meshroom.
That's where the recent efficiency gains matter. There was work published earlier this year suggesting AI-assisted photogrammetry pipelines have cut resource requirements by around thirty percent compared to where things were twelve months ago.
That tracks with what I've been reading. The gains are coming from a few places: smarter feature matching that avoids redundant pairings, better initialization for the bundle adjustment so it converges faster, and learned depth priors that give the optimizer a better starting guess rather than solving from scratch. UniRecGen, a multi-view reconstruction system out of recent research, specifically addresses the case of sparse unposed images, meaning photos where you don't even know the camera positions in advance, and it outperforms earlier baselines on geometric accuracy by anchoring the reconstruction to what they call geometric anchors.
Which is basically a learned version of the reference object idea.
In a sense, yes. Instead of a physical chair in the frame, the model has internalized geometric priors from training data and uses those to constrain the reconstruction. The chair is implicit rather than explicit.
The chair as metaphor. Daniel would appreciate that.
The practical upshot for Daniel's apartment scenario is that the stitching problem breaks into two separable challenges: getting the geometry right, which Structure from Motion handles, and getting the scale right, which requires either a reference object, known camera parameters, or those learned priors. Miss either one and your Ikea planning session goes sideways.
Right, but what does that actually look like in practice? Because there's a gap between "technically solvable" and "works reliably enough that a normal person can do it before their Ikea trip.
That's where the knock-on effect start to bite. The geometry problem and the scale problem are theoretically separable, but in practice they compound. A small error in your reference object calibration, say five percent off on your chair height, propagates through every measurement in the model. If your kitchen is four meters long, you're now off by twenty centimeters. That's the difference between a cabinet fitting and not fitting.
That's assuming the reconstruction geometry itself is clean. Which it often isn't.
Dense indoor environments are hard. Reflective surfaces, like kitchen tiles or a glass splashback, confuse feature matching because the same physical point looks different depending on the angle. Plain walls give the algorithm almost nothing to grip onto. And furniture with repetitive patterns, think a grid of cabinet doors, creates false matches where the algorithm thinks it's found the same corner but it's actually one cabinet over.
That last one is interesting. So if you've got a row of identical kitchen cabinet doors, the algorithm might think it's found the same door twice when it's actually matched door three to door five?
It's called an ambiguous correspondence, and it's one of the nastier failure modes because it doesn't just introduce noise, it introduces a systematic geometric error. The reconstruction might confidently place that section of the kitchen in completely the wrong position. And from a visual inspection of the output, you'd never know, because each individual cabinet door looks perfectly reconstructed. It's only when you try to measure the overall run of cabinets that you realize it's half a meter shorter than it should be.
The apartment is basically a stress test for every known weakness in the pipeline.
It really is. Which is part of why photogrammetry for outdoor terrain, open fields, building facades, drone surveys, tends to produce cleaner results than indoor residential spaces. The outdoor case has natural texture variation, no reflections, and you can fly a consistent grid pattern. Indoors you're fighting the geometry on every front.
That's where the LIDAR comparison becomes interesting rather than just academic. Because LIDAR doesn't care about texture. It doesn't care if your kitchen tiles are reflective.
Right, the laser pulse either bounces back or it doesn't. You're not trying to infer geometry from visual appearance, you're measuring it directly. The fifty-one to fifty-two centimeter elevation accuracy figure from drone LIDAR comparisons is in terrain that would challenge photogrammetry, dense vegetation, irregular surfaces. Indoors, a LIDAR scanner can map a sixty-meter apartment in minutes with millimeter-level precision.
Why isn't everyone just doing that?
A decent handheld LIDAR scanner runs several thousand dollars. The iPhone Pro has a LIDAR sensor, which is interesting for this use case, but it's lower range and lower resolution than a dedicated unit. Photogrammetry, by contrast, costs you a phone camera and time. The thirty percent reduction in computational overhead from recent AI-assisted pipelines makes that tradeoff more attractive than it was even a year ago.
There's a fun bit of history there actually. LIDAR as a technology is older than most people realize. It was used in the Apollo fifteen mission in nineteen seventy-one to map the lunar surface. So the same basic principle that's now sitting in the pocket of anyone with a recent iPhone was first deployed to measure craters on the Moon fifty years ago. The miniaturization journey is kind of staggering when you put it that way.
The cost trajectory mirrors that. In the early two-thousands, a terrestrial LIDAR scanner cost hundreds of thousands of dollars and needed a dedicated operator. Now the sensor in an iPhone Pro costs Apple a few dollars to manufacture at scale. The physics hasn't changed at all. It's purely a fabrication and integration story.
Which is exactly the kind of thing that makes the next ten years interesting for Daniel's use case. Because if LIDAR sensors keep getting cheaper and more capable while photogrammetry pipelines keep getting more efficient, the two approaches are going to converge on the same consumer device at roughly the same price point.
They're already converging. The newer reconstruction apps on iPhone are doing sensor fusion, combining the LIDAR depth data with the camera imagery to get better results than either alone. The LIDAR gives you coarse geometry quickly, the photogrammetry fills in texture and fine detail, and the combination is more robust than either pipeline running independently.
For Daniel's apartment scenario, the honest answer is: photogrammetry with good reference objects gets you within a useful margin for Ikea planning, but you're not getting architectural precision.
That's about right. For practical interior planning, if you're asking "will this two-hundred-and-ten-centimeter bookcase fit against that wall," a well-executed photogrammetry pass with a known reference object probably gets you close enough. If you're asking "will this countertop clear the building code clearance by exactly the required amount," you need LIDAR or a tape measure.
Which is a reasonable division of labor. The real estate industry has figured this out.
They have, and it's a useful case study. Virtual apartment tours using photogrammetry have become standard. Companies like Matterport built their entire business on this, taking a few hundred overlapping images of a space and producing a navigable three-D model that potential tenants or buyers can walk through online. The dimensional accuracy is good enough to let people check whether their existing furniture fits, which is exactly Daniel's use case.
Matterport is interesting because they started with proprietary hardware, a dedicated camera rig that cost several thousand dollars, and have progressively moved toward supporting standard smartphones. That transition only became viable as the photogrammetry pipelines got good enough to compensate for the lower quality input.
That's the pattern across the industry. The hardware requirements drop as the software gets smarter. It's not that the physics of the problem got easier, it's that learned priors and better optimization are substituting for what used to require better sensors.
The computational cost of producing those models has dropped enough that it's now a routine service rather than a specialist operation.
That's the trajectory. The thirty percent efficiency improvement in AI-assisted pipelines is part of what's pushing this into the consumer tier. Tools like Meshy are now offering browser-based reconstruction that can turn a photo set into a navigable model relatively quickly. The precision ceiling is still lower than dedicated photogrammetry software, but the accessibility floor has dropped dramatically.
The retail angle is interesting too. Because if you can model a space accurately enough, you can do virtual product placement. Drop a sofa into the model and see if it looks right before you buy it.
That's already happening. Interior design platforms are doing exactly this, using photogrammetry-derived room models as the base and overlaying product meshes. The accuracy requirement there is actually a bit different, you need the spatial proportions to feel right visually more than you need millimeter precision, so the photogrammetry approach works well. The harder problem is lighting, making the inserted product look like it belongs in the actual room rather than pasted on top.
Which is a rendering problem rather than a reconstruction problem.
Separate pipeline entirely. But it illustrates how these technologies stack. Reconstruction gives you geometry and scale. Rendering handles how materials interact with light. AI is now doing both, but they're still distinct challenges with distinct error modes. So, practically speaking, what does that mean for someone like Daniel standing in his apartment with his phone?
What are the moves? If Daniel's holding his phone right now, what's he actually doing with all this tech?
The reference object question is the first thing to nail down, and it's more nuanced than just "put a chair in the frame." You want something with known dimensions in all three axes, height, width, depth, that appears in multiple overlapping images across the space. A chair works. A water glass is trickier because it's small relative to the room, and small reference errors amplify. The bigger the reference object relative to the scene, the better your scale propagation.
A dining chair beats a coffee cup.
By a lot. And you want it centrally placed, not shoved into one corner where only a handful of images capture it. The scale anchor needs to be visible from multiple angles so the bundle adjustment can triangulate its position reliably.
What about the shooting pattern itself?
Consistent overlap is the discipline. Twenty to thirty percent between adjacent frames, and you want to walk the space in deliberate arcs rather than random sweeps. Think of it as painting the room with coverage. For a sixty-meter apartment, two to three hundred images is the realistic number. More than that and you're adding computational load without proportional accuracy gains, less and you risk gaps in the reconstruction, especially in corners and doorways.
Corners and doorways being the places you most need to get right if you're planning furniture placement.
Because that's where the constraints live. Whether a wardrobe fits in an alcove, whether a sofa clears the doorframe, those are all corner and threshold measurements. And they're also the places where the shooting geometry gets awkward. You're often trying to capture a tight space from a position that doesn't give you clean overlap with adjacent frames. It's worth doing a dedicated pass just for corners, getting in close and making sure you've got multiple angles on each one.
Which is where the thirty percent efficiency improvement in AI-assisted pipelines actually helps, because you can afford to shoot a bit more freely without the processing time becoming punishing.
Right, and on the software side, if you're doing this yourself rather than using a service, Meshroom is the open-source option built on the Alice Vision framework. It runs Structure from Motion and multi-view stereo locally. The browser-based tools like Meshy are faster to get started with but give you less control over the precision parameters.
The honest ceiling being: good enough for Ikea, not good enough for a building permit.
That's the practical summary. Know what accuracy level you need before you choose your tool.
The frontier that keeps nagging at me is what happens when the reference object problem gets solved implicitly, at scale. Because right now we're still asking the user to put a chair in the frame. What does this look like when the model has seen enough apartments that it just knows, probabilistically, how high a standard kitchen countertop is?
That's the direction the learned prior work is pointing. UniRecGen is an early version of that idea. Train on enough labeled spatial data and the reconstruction starts to carry scale information without needing an explicit anchor. The chair becomes optional rather than required. I'm not sure how close we are to that being reliable enough for consumer use, but the trajectory is clear.
There's a really interesting epistemological question lurking there. Because when a model infers that a countertop is ninety centimeters high because that's what countertops usually are, it's not measuring your apartment anymore. It's averaging over every apartment it's ever seen. Which is fine if your apartment is standard, but if you're in a converted industrial space with non-standard ceiling heights and custom joinery, the model's priors are actively working against you.
That's a important caveat. The learned prior is a bet that your space is typical. The more atypical your space, the more the prior misleads rather than helps. Which is why the explicit reference object isn't going away entirely, it's becoming a fallback for edge cases rather than a universal requirement. But for a bog-standard rental apartment with standard door heights and standard kitchen fittings, the prior is probably more reliable than a hastily measured chair.
Once you cross that threshold, the workflow collapses to almost nothing. Walk through your apartment with your phone, upload the video, get back a metrically accurate model. No reference objects, no deliberate overlap discipline, no Meshroom configuration.
The Matterport case is instructive there. They went from specialist hardware to something that runs on an iPhone. The next move is removing the remaining friction from the capture process itself. That's where the video tokenization insight Daniel flagged connects back in. If you can extract a useful reconstruction from low-framerate, low-resolution video rather than a carefully planned photo set, the barrier drops to almost nothing.
Which is a big deal for real estate, for insurance, for interior design, for anyone who has ever stood in an Ikea and realized they cannot remember how wide their doorframe is.
The use cases are mundane in the best possible way.
I think that's actually the most interesting thing about Daniel's question when you zoom out. He's not asking about some exotic research application. He's asking about a problem that every person who has ever moved house has had. The technology to solve it has existed in research labs for decades. What's changed is the cost and friction have finally dropped below the threshold where a normal person with a phone can actually use it. That transition, from specialist tool to everyday utility, is where most of the interesting things happen.
We're right in the middle of it. The pipeline isn't fully seamless yet, you still need to think about reference objects and shooting patterns and which software to use. But the gap between "technically possible" and "something my mum could do before her Ikea trip" is closing faster than it was even two years ago.
Big thanks to Hilbert Flumingtop for producing, and to Modal for keeping our pipeline running. If you've found this one useful, leaving a review helps people find the show. This has been My Weird Prompts. We'll see you next time.