Daniel sent us this one — he's been playing around with Hugging Face demos that take a single photograph and spit out an explorable three-dimensional world. He tried it with a nighttime shot of Jerusalem, a crane, some buildings, and it kind of got the general vibe but fell apart on the details. His real question is, if you actually put in the work, walk around your city, shoot four or five hundred images, and feed that into the best world generation model today, are you capturing the actual place or just the aesthetic? And beyond game development, which everyone defaults to, what does this actually unlock? He's thinking augmented reality, creative applications, the whole trajectory.
This is a fantastic prompt because it cuts straight through the hype to the thing that actually matters. And by the way, DeepSeek V four Pro is handling our script today, so let's see how it does with this one.
Alright, DeepSeek, don't let us down. So where do we even start with this? Because Daniel's already spotted the central tension — there's a chasm between "looks like Jerusalem" and "is Jerusalem.
And that chasm is where all the interesting engineering lives. Let me ground this for a second. What Daniel encountered on Hugging Face are mostly implementations of what people are calling three D world generation models, and the current state of the art as of early twenty twenty-six breaks into roughly three approaches. The first is what he tried, single-image to three D scene, which is genuinely miraculous that it works at all. The second is multi-view reconstruction, where you feed in dozens or hundreds of images and the model builds a geometrically consistent representation. The third is what the game industry has been doing for years, procedural generation with an AI layer on top for textures, lighting, and detail synthesis. The single-shot stuff is the flashy demo, but multi-view is where the real mapping happens.
When Daniel asks whether feeding in four hundred images captures the world or just the aesthetic, the answer depends entirely on which of those approaches the model is using. If it's a proper multi-view reconstruction pipeline, you're getting geometry plus appearance. If it's a generative model that's been trained on internet-scale data and it's just using his photos as conditioning, you're getting something that looks plausible but might have invented a building that doesn't exist.
And this is the misconception I see everywhere. People assume that because the output is three-dimensional and navigable, the model has somehow reconstructed reality. But most of these single-shot systems are closer to a very sophisticated version of inpainting in three D. The model sees a crane, recognizes the semantic category "construction site," and then draws on everything it learned during training about what construction sites typically look like. It's not measuring depth from the single image with any real precision. It's hallucinating depth, hallucinating occluded geometry, hallucinating what's behind that building. The more images you provide, the more constraints you impose, and the less room there is for hallucination.
Which brings up an interesting question. At what point does the model stop hallucinating and start reconstructing? Is there a threshold where you've provided enough constraints that the output becomes a measurement rather than a guess?
That threshold is fuzzy, but researchers talk about it in terms of view coverage. If you've captured every surface in your scene from at least two or three angles with good overlap, modern structure-from-motion pipelines can give you sub-centimeter accuracy. The AI layer in world generation models sits on top of that. The geometric reconstruction is often still done with classical computer vision techniques, bundle adjustment, multi-view stereo, that stuff hasn't gone away. What the AI adds is the ability to fill in regions where you didn't capture enough data, synthesize consistent textures under varying lighting, and infer semantic information about what things are. So with four hundred images of Jerusalem, you'd get a very faithful geometric reconstruction of everything you photographed, plus AI-generated filler for the gaps. The question is how big the gaps are.
Daniel's walking around with his phone, snapping buildings. He's not doing a photogrammetry survey with calibrated cameras and controlled lighting. The gaps could be substantial.
But here's what's changed in the last eighteen months. There are now models that can take casual phone footage and produce reconstructions that would have required a professional rig and a week of processing time just two or three years ago. NVIDIA's Instant NeRF was a big milestone, but the thing that really moved the needle was the emergence of feed-forward three D reconstruction models. Instead of optimizing a separate neural radiance field for every scene, which takes hours, these models are trained on massive datasets of three D scenes and can produce a reconstruction in a single forward pass. You feed in images, you get out a three D representation in seconds. The trade-off is that the geometric fidelity isn't quite as high as per-scene optimization, but the speed difference is absurd, like going from eight hours to eight seconds.
You're saying the trajectory is toward real-time reconstruction from casual input. That changes the use case calculation quite a bit. If it takes eight hours of processing, this is a niche tool for professionals. If it takes eight seconds, suddenly you can imagine walking through a city and having your AR glasses build a world model as you go.
That's exactly where this is headed. And AR is the application Daniel mentioned that I think is actually under discussed. Everyone jumps to game development because that's the obvious market, but game studios have artists and level designers who are going to be very particular about their worlds. The real transformative application might be in what researchers are calling "world-scale AR," where you want a lightweight, continuously updated three D model of the real world that virtual content can be anchored to. Pokémon Go was the crude version of this, flat planes detected by a phone camera. The next generation is dense three D meshes with semantic labels, updated in real time, shared across users. That requires exactly the kind of technology Daniel is poking at.
Let's talk about the "capturing the aesthetic versus capturing the world" distinction, because I think that's the sharpest part of his question. If I take four hundred photos of Jerusalem and feed them into a world generation model, and then I walk through the resulting world, am I seeing Jerusalem or am I seeing a Jerusalem-like dream?
I love how you put that. And the honest answer is, it depends on what the model is optimizing for. There's a fundamental tension in these systems between photorealism and geometric accuracy. A model that's been trained primarily on synthetic data or game assets will produce beautiful, clean geometry, but it might smooth over the idiosyncratic details that make Jerusalem actually Jerusalem, the specific pattern of wear on a particular stone wall, the exact angle of a streetlamp that was installed slightly crooked in nineteen eighty-seven. Those details are what distinguish a place from a representation of a place.
The model is biased toward Platonic ideals of what things should look like.
If it's seen ten thousand churches in its training data, and you show it a blurry photo of a specific church in Jerusalem, it might reconstruct something that looks more like the average church than the actual church. This is the same problem large language models have with hallucination, it's just manifesting in three dimensions. The model has a prior, a strong expectation about what the world looks like, and when your input is sparse or ambiguous, the prior dominates. When your input is dense and high-quality, the data dominates.
With four hundred images, assuming reasonable quality, the data probably dominates for the surfaces you've captured, and the prior fills in the rest. The question is whether the filled-in parts feel jarringly different from the real parts.
That's exactly the current frontier of research. How do you make the transitions seamless? How do you ensure that the AI-generated content respects the local style? If you've captured the limestone walls of Jerusalem's Old City, and the model needs to fill in a missing section, does it know enough about Jerusalem limestone to get the color and texture right, or does it default to some generic stone texture from its training set?
I suspect the answer right now is somewhere in between. It'll get closer than a generic model would have two years ago, but a Jerusalem resident would notice the difference immediately.
And that's the gap between "good enough for a game" and "good enough for digital preservation" or "good enough for architectural planning." Different use cases have wildly different fidelity requirements. If you're building a game set in Jerusalem, you want the vibe, the feel, the general sense of place. If you're an archaeologist documenting the Old City for preservation, you need millimeter accuracy on every stone.
Let's talk about the use cases, because I think Daniel's right that nobody's jumping up and down about game companies being able to develop worlds more effectively, but there's a much broader set of applications that haven't been fully explored yet. You mentioned world-scale AR.
Real estate is an obvious one that's already happening. Virtual tours that aren't just three hundred sixty degree photos stitched together, but actual three D models you can walk through with accurate lighting and scale. The difference between a panorama and a reconstructed three D space is enormous for getting a genuine sense of a property. Insurance and damage assessment is another. After a natural disaster, you could fly a drone over a neighborhood, capture a few thousand images, and have a detailed three D model within hours for claims adjustment. Urban planning, where you want to see how a proposed new building would actually look in context, not just as a rendering pasted onto a photo, but as a three D model you can walk around and view from any angle.
Cultural heritage preservation is a big one. Documenting historical sites before they degrade further, creating digital archives that are more than just photographs.
That's where the fidelity question becomes existential. If you're preserving Notre Dame digitally, you don't want an AI that decides the gargoyles should look more like standard gargoyles. You want exact documentation of what was actually there. So for preservation applications, you'd probably want to minimize the generative component and maximize the reconstruction component. Use AI for alignment and hole-filling, but don't let it invent details.
That seems like a design principle that should be more explicit in these tools. A slider that goes from "creative" to "faithful," where creative mode lets the model hallucinate freely and faithful mode constrains it to only what's supported by the data.
Some of the research prototypes have exactly that. It's often implemented as a confidence threshold. The model outputs not just geometry and color, but an uncertainty estimate for every voxel or every vertex. You can then visualize the uncertainty, and in some systems, you can set a threshold below which the model refuses to generate and instead leaves the region transparent or marked as unknown. That's crucial for applications where being wrong is worse than being incomplete.
I want to circle back to something Daniel said that I think is really astute. He noted that world development didn't drop out of the sky, game developers have been doing it for some time, and the AI layer is a novelty on top of an existing process. I think that's both true and slightly underselling what's changed. A novelty on top of an existing process can still be transformative if it reduces the cost or time by an order of magnitude.
The analogy I'd draw is to what happened with image generation. Artists have been creating images for millennia. Digital art tools have been around for decades. But when diffusion models arrived, they didn't just make existing processes slightly faster, they enabled entirely new workflows and lowered the barrier to entry so dramatically that the number of people who could create compelling images exploded. World generation might follow a similar trajectory. Today, building a detailed three D environment for a game takes a team of artists months. If a single person can do it in a week by providing reference photos and letting the AI handle the tedious parts, that changes who can participate in world-building.
It changes the economics of entire industries. If you're an indie game studio with five people, you previously couldn't afford to build a richly detailed open world. Now maybe you can. That's not just a novelty, that's a structural shift in who gets to compete.
The thing I'm most excited about though, and this connects to what Daniel was saying about creative applications, is the possibility of personalized worlds. Imagine an AR application where you walk through your own neighborhood and the world generation model has rebuilt it in a different aesthetic. You're walking through the actual geometry of your streets, but rendered in the style of nineteen twenties Paris, or a cyberpunk future, or a Studio Ghibli film. The world is real, the geometry is real, but the appearance is transformed. That's something you couldn't do before at all. It requires the AI to understand both the geometric structure of the real world and the aesthetic principles of a particular style, and to merge them coherently.
That's compelling. And it's not just a gimmick, it's a new form of creative expression. You're layering imagination onto reality in a way that's spatially consistent. The street corner where you buy your morning coffee is still recognizable as that street corner, but it's been reimagined.
It's shareable. Multiple people with AR glasses could experience the same transformed world simultaneously. That's a social experience that doesn't really have a precedent.
Let's get into the technical weeds a bit, because I think understanding how this actually works helps answer Daniel's question about where the technology is at. You mentioned feed-forward reconstruction models. Can you unpack what's happening under the hood when Daniel uploads his single Jerusalem photo to one of these Hugging Face demos?
The single-image systems are almost certainly using a two-stage approach. The first stage is a model that takes the two D image and estimates a three D representation, usually something called a triplane or a set of feature grids. This model has been trained on millions of three D assets with associated renderings from multiple viewpoints. During training, it learns the mapping from a single two D view to the underlying three D structure. It's essentially learning the statistical regularities of three D geometry, that vertical lines in an image probably correspond to walls or poles, that a patch of blue at the top of the image is probably sky and should be placed far away, that the texture gradient on a road suggests a receding plane.
It's pattern matching in three dimensions, basically.
It's pattern matching, but the patterns are embedded in a learned latent space that captures three D structure. The second stage takes that estimated three D representation and renders it from whatever viewpoint the user wants to explore. The rendering is often done with a neural radiance field or a Gaussian splatting approach. Gaussian splatting has become really dominant in the last year or so because it's fast and produces high-quality results. The basic idea is that the scene is represented as millions of tiny three D Gaussian blobs, each with a position, a color, an opacity, and a covariance that determines its shape. You can render these very efficiently by projecting them onto the image plane and blending them together.
When you feed in more images, you're providing more constraints on where those Gaussians should be and what color they should have. With one image, the model is guessing. With four hundred, it's measuring.
That's the intuition, but the implementation gets subtle. With four hundred images, you could do classical structure-from-motion to get camera poses and a sparse point cloud, then use that to initialize your Gaussians, and then optimize. The AI component becomes more about hole-filling and refinement than about the core reconstruction. The heavy lifting of geometric reconstruction is being done by algorithms that are decades old, just accelerated and augmented with learned components.
That's an important point that I think gets lost in the hype. The AI isn't replacing the entire pipeline. It's replacing specific components where learned approaches outperform hand-crafted algorithms, and those tend to be the components that require semantic understanding. Knowing that a particular blob of pixels is a window and should be flat and rectangular, that's a semantic inference. Knowing that two images were taken from slightly different angles and computing the camera positions from feature correspondences, that's geometry, and classical methods are still very good at it.
The best systems today are hybrids. They use classical computer vision for the parts that are well-understood mathematically, and learned models for the parts that require visual understanding. It's the same pattern we've seen in other domains, like robotics, where you use classical control theory for low-level motor control and learned models for high-level perception and planning.
If Daniel wants to actually build a faithful three D model of Jerusalem, the workflow probably looks something like, capture a few hundred images with good overlap, run them through a structure-from-motion pipeline to get camera poses, use multi-view stereo to get a dense point cloud, and then use a world generation model to fill in the gaps and produce a clean, renderable mesh or Gaussian splat. The AI is doing the cleanup and the inpainting, not the core reconstruction.
That's the current state of the art for high-fidelity reconstruction. But here's the thing, there's a parallel track of research that's trying to replace the entire pipeline with a single learned model, end-to-end. You feed in images, you get out a three D world. No explicit camera pose estimation, no separate stereo matching step. That's the holy grail, and it's not there yet for high-fidelity work, but it's improving fast. The advantage of the end-to-end approach is that it can learn to be robust to things that break classical pipelines, like reflective surfaces, thin structures, textureless walls. Classical stereo matching falls apart when there's nothing to match. A learned model can infer depth from monocular cues, from the semantic context, from the way lighting falls on a surface.
That's the difference between a system that works in ideal conditions and a system that works in the real world. Jerusalem has plenty of textureless stone walls. A classical pipeline would struggle. A learned model that's seen enough stone walls might do better.
But it might also hallucinate details that aren't there. The stone wall in your reconstruction might have a crack pattern that doesn't exist in reality. For a game, no one cares. For architectural preservation, that's a serious problem.
We keep coming back to this fidelity question. Let's try to give Daniel a concrete answer. If he walks around Jerusalem and captures four hundred high-quality images with good overlap, and feeds them into the best available system today, what does he get?
He gets a model that's geometrically accurate to within a few centimeters for the surfaces he captured well, with AI-generated filler for occluded regions and areas with poor coverage. The textures will be photorealistic where the input images were sharp and well-exposed, and will degrade in quality where the input was blurry or poorly lit. The AI will attempt to normalize the lighting so the scene looks consistent, but this lighting normalization can sometimes produce an uncanny flatness. The model will be explorable in real time on consumer hardware thanks to Gaussian splatting or a similar efficient representation. And critically, the model will be a static snapshot. It won't include moving objects, it won't update as the real world changes, and it won't understand that the crane in Daniel's photo was temporary.
The crane is a great example. A model trained on internet data knows that construction cranes exist and what they look like. But it doesn't know that this specific crane was erected in January twenty twenty-six and will be gone by June. It might incorporate the crane into its understanding of "what Jerusalem looks like" in a way that's misleading for long-term use.
Temporal understanding is the next frontier. Right now, these models capture a moment. They don't understand change over time. But you can imagine a future version where a system continuously updates its world model as new images come in, detecting changes, flagging temporary structures, learning the rhythm of a place. That's not science fiction, it's an active area of research called four D reconstruction or dynamic scene modeling.
Four D being three D plus time.
And that's where things get really interesting for AR applications. If your AR glasses are maintaining a live world model of your environment, they need to understand not just the static geometry but the dynamic elements, the cars moving, the people walking, the leaves rustling. That requires a fundamentally different approach than the static reconstruction we've been discussing.
Let's talk about the Hugging Face ecosystem specifically, because that's where Daniel encountered these demos. What's actually available there, and how does it compare to the research frontier?
Hugging Face has become this fascinating distribution mechanism for three D AI models. You see a lot of Gradio demos where you can upload an image and get back a three D model in a few seconds. The most popular ones are usually built on top of models like Stable Zero one two three, or various implementations of the LRM family, Large Reconstruction Models. These are the single-shot systems Daniel tried. They're impressive for what they are, but they're very limited compared to what you can do with a proper multi-view pipeline. The Hugging Face demos are great for understanding the concept, but they're not representative of what's possible if you invest real effort in capture.
Daniel's experience of "it wasn't great" is exactly what you'd expect from a single-shot demo. The technology is capable of much more, but you need to move beyond the demo and into actual multi-view reconstruction workflows.
And those workflows are becoming more accessible. There are tools like Polycam, Luma AI, and Kiri Engine that package the complex pipeline into a mobile app. You walk around an object or a space, the app captures video, uploads it to the cloud, and you get back a three D model. The results are impressive for casual use. For a space like a city street, you'd want something more sophisticated, but the consumer tools are getting better every quarter.
I want to pull on a thread Daniel raised that we haven't fully addressed. He said, "if you were to feed that into the best-in-class world generation model today, would you be replicating the Jerusalem that you saw, or would you be creating something that looks very similar but doesn't really map on?" And I think the answer is, it depends on what you mean by "map on." Geometrically, with enough input data, it maps on quite well. Semantically, it's still limited. The model knows that a building is a building, but it doesn't know that this specific building is the Jerusalem International Convention Center. That semantic layer is separate, and it's something you'd need to add with additional annotation or by integrating with mapping data.
That's a crucial distinction. A world generation model gives you geometry and appearance. It doesn't give you meaning. It doesn't tell you that this street is Jaffa Road, that this building was built in the Ottoman period, that this cafe is where you had your first date. That semantic layer has to come from somewhere else. Some research groups are working on joint reconstruction and semantic labeling, where the model simultaneously builds the geometry and classifies every surface, but it's early days for that kind of integrated approach.
The world generation model gives you the stage, but not the script.
And that's actually fine for many applications. If you're building a game, you want the stage, and you'll write your own script. If you're doing AR navigation, you need the geometry for occlusion and placement, and you can get the semantic labels from a separate system. The division of labor works.
What about the compute requirements? Daniel mentioned walking around a city and taking four or five hundred images. That's a lot of data. Can a consumer-grade system handle that?
Today, you'd probably offload the heavy processing to the cloud. Training or optimizing a detailed Gaussian splat of a city block from hundreds of images takes a decent GPU, something with at least sixteen gigs of VRAM, and it can take anywhere from minutes to hours depending on the resolution and the size of the area. But inference, actually rendering and exploring the resulting model, is surprisingly lightweight. A well-optimized Gaussian splat can run at sixty frames per second on a modern phone. So the creation is expensive, but the consumption is cheap. That asymmetry is important for thinking about use cases. You do the heavy processing once, and then the resulting world can be experienced by millions of users on cheap hardware.
That's the model that makes AR worlds viable. A city government or a tourism board commissions a high-quality scan of a historic district, processes it in the cloud, and then visitors can explore it on their phones or AR glasses. The cost is amortized across thousands or millions of users.
You can imagine a future where this is crowdsourced. Thousands of tourists are already taking photos of the same landmarks every day. If those photos could be pooled and used to continuously update a shared world model, you'd have a living, evolving digital twin of the city that gets more detailed over time. There are privacy implications to work through, obviously, but the technical foundation is being laid right now.
Privacy is a whole other episode, but yes, the idea of a continuously updated digital twin built from tourist photos raises some questions. Let's stay focused on Daniel's question about where the technology is at. I think we've established that single-shot is a party trick, multi-view with AI enhancement is the real deal, and the gap between "looks like Jerusalem" and "is Jerusalem" narrows as you add more input data but never fully closes because the AI is always making some inferences.
I'd add that "never fully closes" might be too pessimistic. For surfaces you've photographed from multiple angles with good lighting, the reconstruction can be essentially perfect, down to the level of individual pebbles on the ground. The problem is that in any real-world capture of a city, there will always be surfaces you missed, the back of that billboard, the roof of that building, the alley you didn't walk down. The AI fills those gaps, and the quality of the fill depends on how typical the missing surface is. A missing section of a repetitive stone wall gets filled in nearly perfectly. A missing section with a unique mural gets filled in with something generic and wrong.
The technology is simultaneously more impressive than the demos suggest and more limited than the hype implies. It's in that awkward adolescent phase where it's clearly going to be transformative but isn't quite there yet for the most demanding applications.
That's exactly where we are in April twenty twenty-six. And the trajectory is steep. The papers coming out of CVPR and SIGGRAPH this year are pushing on all the fronts we've discussed, faster reconstruction, better hole-filling, temporal consistency, semantic integration. I wouldn't be surprised if by twenty twenty-eight, the distinction between "reconstructed" and "generated" becomes almost invisible for casual users.
Let's talk about one more use case that I think is underexplored, training data for other AI systems. If you can generate photorealistic, geometrically accurate three D worlds, you can use those worlds to train robots, autonomous vehicles, drones. Instead of collecting millions of miles of real-world driving data, you can simulate in a world that's indistinguishable from reality.
Simulation is arguably the killer app for world generation. Waymo and Tesla and the other autonomous vehicle companies have been building simulated worlds for years, but they're mostly hand-crafted or procedurally generated. The ability to automatically generate a high-fidelity simulation environment from a few hours of driving around a new city would dramatically accelerate deployment to new locations. And it's not just autonomous driving. Any AI system that needs to operate in the physical world, warehouse robots, delivery drones, construction equipment, can be trained more safely and more cheaply in simulation if the simulation is good enough.
"good enough" is the key phrase. The sim-to-real gap has been a persistent problem in robotics. You train a policy in simulation, it works perfectly, you deploy it on a real robot, and it fails because the simulation wasn't accurate enough. World generation models that can capture the actual geometry and appearance of a specific real-world environment could narrow that gap significantly.
There's a really interesting feedback loop here. World generation models are trained on real-world data to produce realistic environments. Those environments are then used to train robots. Those robots collect more real-world data, which improves the world generation models. It's a virtuous cycle, and it's one of the reasons I think this technology is strategically important beyond just games and AR.
We've covered games, AR, real estate, insurance, cultural preservation, urban planning, and simulation for robotics. Daniel was right, there are a huge number of applications. The common thread is that they all require some level of geometric fidelity to the real world, and the value of the AI layer is in making that fidelity cheaper and faster to achieve.
In enabling applications that were previously impossible, not just making existing applications cheaper. The personalized AR world I described, where your neighborhood is re-rendered in a different aesthetic while preserving its geometry, that's not something you could do before at any price. It's new.
Alright, let's try to land this. If Daniel wants to actually build a faithful three D model of Jerusalem, what should he do? What's the practical advice?
First, use a proper multi-view capture app, not a single-shot demo. Polycam or Luma AI are good starting points. Second, capture methodically. Walk slowly, overlap your frames by about seventy percent, and make sure you're covering every surface from multiple angles. For a city street, that means walking down both sides, capturing the facades head-on and at oblique angles. Third, pay attention to lighting. Overcast days are ideal because you get diffuse, even illumination without harsh shadows. Golden hour looks beautiful but creates strong shadows that can confuse the reconstruction. Fourth, accept that you'll need to do some cleanup. The raw output will have floating artifacts, holes, and blurry regions. Tools like Blender or specialized Gaussian splat editors can help you clean things up.
Fifth, be realistic about what you're going to get. It'll be a geometrically faithful representation of what you photographed, with AI-generated filler for the gaps. It'll look like Jerusalem, it'll feel like Jerusalem, but a sharp-eyed resident will spot differences. It's a tool for communication and experience, not for archival-grade documentation.
Not yet, anyway. Give it a few more years.
Now: Hilbert's daily fun fact.
Hilbert: The city of Bangkok's full ceremonial name is Krung Thep Mahanakhon Amon Rattanakosin Mahinthara Ayuthaya Mahadilok Phop Noppharat Ratchathani Burirom Udomratchaniwet Mahasathan Amon Piman Awatan Sathit Sakkathattiya Witsanukam Prasit. It is the longest place name in the world and is listed in the Guinness Book of World Records.
That's quite a mouthful. I understand why they shortened it.
I'd love to see the envelope they write on their mail.
This has been My Weird Prompts. Thanks to our producer Hilbert Flumingtop. If you want more episodes, find us at myweirdprompts dot com or search for My Weird Prompts on Spotify. We'll be back soon with another one.