Daniel sent us this one — he's asking about AI upscaling, that whole phenomenon where algorithms turn blurry, low-res images and video into something actually usable. And he's zeroed in on the practical side of this: what models actually work, how they work under the hood, and how marketers and content people can use them to salvage collateral that would otherwise end up in the trash. There's a real tension here between what these tools promise and what they actually deliver, and I think the video side especially is where things get interesting.
The video piece is fascinating because it's not just the image problem scaled up — it's a fundamentally different beast. But before we even get there, we should probably define what we're talking about when we say upscaling, because the term gets thrown around for everything from basic pixel stretching to full generative hallucination, and those are not the same thing at all.
So give me the quick taxonomy. What's the difference between the "enhance" button on a phone and what something like Real-ESRGAN is doing?
The fundamental split is between interpolation and super-resolution. Traditional upscaling — bilinear, bicubic interpolation — that's just math on a grid. You have a pixel here and a pixel there, and you're calculating what color the new pixel between them should be based on a weighted average of the neighbors. It's deterministic, it's fast, it's been around for decades, and the output always looks soft because you're not adding information, you're just smoothing the transitions between existing information.
It's the visual equivalent of putting a slightly damp cloth over the lens.
That's not bad. Super-resolution, on the other hand, is inventing pixels based on learned patterns. The model looks at a patch of the image and says, "Based on every brick wall I've ever seen in my training data, this blurry arrangement of tan and brown pixels is probably a brick wall, and I'm going to render it with actual brick texture." It's not recovering detail — the detail was never there. It's hallucinating plausible detail.
That's the key thing people get wrong, isn't it? This idea that AI upscaling is somehow pulling hidden information out of the original file like it was there all along, just waiting for a smarter algorithm to find it.
Completely wrong, and it's probably the number one misconception we should address. A 64 by 64 pixel face contains a specific amount of information — mathematically, information-theoretically — and no algorithm can recover eyelashes or pores that weren't captured. What super-resolution does is say, "I've seen faces at this resolution that upscale to faces at higher resolution, and here's my best guess at what those pores and eyelashes probably looked like." It's a reconstruction, not a recovery. Think of it like a forensic sketch artist working from a blurry security camera photo — they're not revealing what was actually there, they're drawing what they believe the person probably looked like.
Which means every upscaled image is a collaboration between you and someone else's training data. And that training data might have been mostly high-quality studio headshots when you're upscaling footage shot on a potato in a conference room.
And that mismatch is where a lot of the weirdness comes from. Now, this concept of AI-driven upscaling isn't new — it actually predates the whole generative AI explosion. There was Waifu2x back in twenty-fifteen, which was originally built for upscaling anime-style artwork using convolutional neural networks. And even earlier, SRCNN in twenty-fourteen — that's Super-Resolution Convolutional Neural Network — was one of the first papers showing that a neural network could learn an end-to-end mapping from low-res to high-res images and beat bicubic interpolation.
Wait, Waifu2x — this was an anime upscaler?
It was designed for two-dimensional animation art and it genuinely worked well for that domain. But it's interesting because it illustrates a principle that carries forward: these models work best on the kind of data they were trained on. Waifu2x was trained on anime, so it did clean lines and flat color regions beautifully. Try it on a photograph of a forest and you'd get something that looked like an oil painting.
The whole field is basically a series of increasingly sophisticated matching games between what the model was trained to expect and what you feed it. Tell me about where this really took off — you mentioned ESRGAN as the inflection point.
ESRGAN, the Enhanced Super-Resolution Generative Adversarial Network, came out in twenty-eighteen from Xintao Wang and collaborators, and it really marked a leap in perceptual quality. So let me unpack the architecture, because understanding this helps with understanding the limits. The core of ESRGAN is a generator network built out of these things called RRDB blocks — Residual-in-Residual Dense Blocks — which is a mouthful, but what it means in practice is that information flows through the network in a way that preserves detail from earlier layers and recombines it later. That's the residual part: instead of completely transforming the image at each layer, it learns the difference from the input, which is a much easier optimization problem.
Rather than building a high-res image from scratch internally, it's building corrections to a base, and the deep architecture lets it correct at multiple scales at once.
But the real secret sauce is the adversarial training — that's the GAN part. You have two networks playing a game. The generator creates upscaled images. The discriminator is trained to tell the difference between real high-resolution images and the generator's output. The generator's loss function includes this adversarial loss — it gets penalized not just for getting pixels wrong, but for creating images that the discriminator can identify as fake.
It's training to fool its own internal critic, and the critic keeps getting better too.
Yes, and critically, ESRGAN also introduced perceptual loss. Instead of only comparing pixel values — which tends to produce blurry outputs because averaging possible pixel values is the safest bet — they ran both the generated image and the ground truth through a pre-trained classifier network like VGG and compared the internal feature representations. This made the outputs much sharper and more texture-rich, because the model was optimizing for something closer to human perception of structure rather than per-pixel accuracy.
A clean mathematical downscale isn't what real-world low-res imagery looks like. That was the Real-ESRGAN leap?
That was the critical insight. The original ESRGAN was trained on a "bicubic downscaling" degradation model — you take a high-res image, shrink it cleanly with bicubic, and the network learns to reverse that. But real-world low-res imagery has sensor noise, compression artifacts from JPEG, motion blur, improper focus, chromatic aberration. If you feed real degraded images into a model trained only on clean synthetic degradation, the output is often terrible because the model doesn't recognize the degradation pattern and can't invert it.
Like training to recognize that someone's glasses are foggy when all your training pairs are crisp lenses in slightly lower light.
A very Corn metaphor. Real-ESRGAN, published by Xintao Wang's group in July twenty twenty-one, solved this by building what they called a "high-order degradation pipeline." Instead of one simple downscaling operation, they stacked multiple degradation processes in a randomized order — blur, noise, JPEG compression, resizing, more blur, more compression — with randomized parameters at each step. The network was forced to learn to handle this combinatorial explosion of degradation types, and the result is a model that generalizes remarkably well to real-world images it's never seen before.
It trained on the full disaster buffet and learned to reverse-engineer any particular combination. That's clever. But it's still hallucinating, and the hallucination has to be plausible. How do you tell a good hallucination from a bad one in practice?
This is what I've come to think of as the "hallucination budget." Every upscaler has to spend this budget — it invents high-frequency detail, full stop. A good hallucination is one that's consistent with the broader scene context and doesn't contradict any visible constraint in the source. So if you upscale a 64-by-64 face, and the model adds pore texture, individual eyelash strands, subtle skin tone variation — and those additions are consistent with the lighting direction, the age of the face, the expression — that's the model spending its budget well. A bad hallucination is when you upscale a brick wall and the model hallucinates a repeating tile pattern that doesn't exist in the original because it's trying too hard to impose a regular texture on something that might be irregular.
That's the brick wall failure mode. I've seen this happen — the upscaled version looks unnervingly like a wallpaper print of bricks instead of actual bricks.
And the reason for that gets deeper with something like SwinIR from twenty twenty-one and HAT from twenty twenty-three, which are transformer-based super-resolution models. These use attention mechanisms that let the model look at relationships between distant parts of the image. For most image content, this is great — it prevents the checkerboard artifacts you'd get from pure convolutional approaches that only look at local neighborhoods. But for a repetitive pattern like bricks or fabric weave, the attention mechanism can sometimes "detect" false long-range correlations and impose a grid-like regularity that the real bricks don't have.
It's seeing patterns that aren't there, which is a very human cognitive failure too. Before we jump to video — what do people actually chain together in practice? Because I see these GitHub repos, GFPGAN, CodeFormer — they're often used in sequence.
The open-source ecosystem for this has become remarkably sophisticated. The standard stack for faces — and this is what a lot of enthusiasts and even professionals use — goes roughly like this: Real-ESRGAN handles the general upscaling and artifact removal. Then GFPGAN, or Generative Facial Prior GAN, refines the face specifically. GFPGAN is fascinating because it injects a pretrained face generation prior into the upscaling process — it essentially knows what faces are supposed to look like, so when it encounters a degraded face, it can pull it toward the manifold of plausible face images while staying faithful to the identity and expression.
CodeFormer takes the idea of using a learned prior even further. It encodes the degraded face into a latent code, then refines that code by consulting a codebook of clean face features learned from a high-quality face dataset. It's basically saying, "This face's nose region looks like codebook entry three hundred seventy-two, but slightly corrupted." The refinement happens in this compressed latent space, which is more robust to severe degradation. The output tends to be more faithful to identity than raw generative approaches — less likely to give the person a different-shaped face.
There's this whole taxonomy of when you reach for which tool, and it depends on what you're solving for — texture plausibility versus identity preservation. Which brings us to the case study you mentioned. A product photo from twenty fifteen at something like eight hundred by six hundred, upscaled to four-K.
This is a practical scenario I think a lot of marketers will recognize. Company rebrands, new website, everything needs to be crisp four-K or at least high-res retina quality — and someone digs up a hero product shot that only exists at web resolution from eight years ago.
The original raw file is on a drive that hasn't been spun up since the Obama administration.
It's in a folder called "final final USE THIS ONE FINAL" and it's still eight hundred pixels wide. So you run it through Real-ESRGAN and Topaz Gigapixel separately to compare. On a fabric weave product shot — let's say a close-up of upholstery material — Real-ESRGAN does a phenomenal job on the texture. The weave looks plausibly tight, the thread variations are convincing, the surface appears tactile. Topaz Gigapixel, being a commercial tool with a highly refined inference pipeline, often produces slightly cleaner results on edges and color fidelity but sometimes oversmooths texture in favor of a more "photographic" look.
Which is interesting because it points to an aesthetic choice being embedded in the training process — what does a "real" photo look like? That's a cultural assumption more than a technical one.
And here's a concrete failure pattern that comes up a lot in marketing collateral. When you upscale an image that has text in it — signs in the background, product labels, anything with characters — the output is frequently garbled pseudo-text. It looks like letterforms from a distance, but close up it's gibberish. The model knows "this texture pattern in this context often corresponds to text," so it hallucinates text-like shapes. But because the semantic content — the actual letter identity — wasn't recoverable from the degradation, you end up with what I can only describe as alphabet soup rendered in plausible typography.
Like the AI equivalent of that dream where you try to read a sign and the words keep shifting.
And the mitigation for this is to mask text regions before upscaling and handle them separately — either vectorize the text or use models specifically trained on text super-resolution, which is a whole subfield. This is a classic case where just slapping Real-ESRGAN on the whole image and calling it done will embarrass you if the upscaled image gets examined closely.
Far we've been talking about single frames. But Daniel's prompt specifically asked about video, and the frame-by-frame approach is exactly where video upscaling breaks sideways. Walk me through that.
Video upscaling as naive per-frame processing is a disaster in motion. Here's why. Picture a close-up of a person speaking at a conference. Frame one, the model hallucinates a specific pattern of pores and a specific catchlight reflection in the eye. Frame two — which is a thirtieth of a second later — the model runs independently and hallucinates a slightly different pore arrangement and a catchlight in a slightly different position. At thirty frames per second, these frame-to-frame inconsistencies produce visible flickering. Every part of the image that's model-invented, which is basically all the high-frequency detail, is subtly shifting and swimming between frames. It's unwatchable, or at least deeply uncanny.
The video upscaling problem isn't just the image upscaling problem times thirty per second — it's an entirely new constraint: temporal coherence. You need the model to understand that frame twenty-three and frame twenty-four represent mostly the same underlying reality.
We're only beginning to solve this effectively. Basic Video Super Resolution, or BasicVSR, and then BasicVSR Plus Plus in early twenty twenty-two, were major steps forward. These use something called bidirectional optical flow propagation. The idea is that instead of processing each frame in isolation, the model calculates how pixels move between frames — the optical flow — and propagates information along those motion paths, both forward and backward through time. If a face moves from left to right across the frame over two seconds, the model can pool observations from all those frames, aggregating information that would be too degraded in any single frame to reconstruct reliable detail.
The model is borrowing resolution from neighboring frames, essentially using the fact that the same object persists across time to build a better average.
Yes, with the clever addition of a recurrent structure that maintains a hidden state across frames, similar to how recurrent neural networks process text. This allows it to "remember" what it's reconstructed previously and maintain consistency. The metrics are impressive — on the REDS4 dataset, a standard video super-resolution benchmark, BasicVSR Plus Plus achieved a 7.3 decibel peak signal-to-noise ratio improvement over simple bicubic interpolation, which in image quality terms is enormous.
The compute implications here are wild. What does this actually take to run?
That's where this goes from intriguing to humbling. RealBasicVSR — the practical implementation of this video super-resolution approach — requires roughly twelve gigabytes of VRAM for a standard 720p to 1080p conversion at thirty frames per second with default settings. That's manageable on a consumer card like the RTX 4080 or 4090, which has 16 or 24 gigs respectively. But if you're running a full multi-bidirectional propagation with a deep network at 4x scaling on longer content, the VRAM demands balloon. On an A100 with forty gigabytes of VRAM, you can run reasonably large batch sizes. On a consumer card, you're tiling — processing the image in chunks — or severely reducing batch size, both of which affect quality or runtime.
What's the runtime on a practical workflow? Say I have a ten-minute CEO Zoom keynote shot at low bitrate 720p, and I want to upscale it to 1080p for an investor sizzle reel.
Let me give you real numbers here. With Topaz Video AI's Chronos model — version four, released March twenty twenty-five — they claim 4x upscaling with temporal consistency in under two minutes per minute of footage on an RTX 4090. So for ten minutes of content, that's about twenty minutes of processing. The result will have good visual coherence, minimal flickering, and acceptable noise handling, because Topaz has put a lot of work into their proprietary temporal model and inference optimizations. If you were building a DIY pipeline — Real-ESRGAN per frame followed by a temporal smoothing post-process, maybe with RIFE frame interpolation in the mix — you'd be looking at an order of magnitude slower and more artifact-prone results, but you'd pay nothing in licensing fees.
RIFE being — remind me?
Real-time Intermediate Flow Estimation. It's a model for frame interpolation — creating in-between frames to increase framerate or smooth motion. In video upscaling pipelines, you sometimes interpolate first to create more temporal data, then upscale those denser frames, specifically to reduce the amount the model has to hallucinate between frames.
That brings us to the commercial landscape and this gap Daniel mentioned between local and cloud. Topaz, Magnific, and the open-source stuff — what's the actual state of play for someone who doesn't have a dedicated GPU workstation?
The landscape right now is fragmented. Topaz Video AI is the go-to for local inference if you have the GPU — it runs on Windows and Mac, it's been heavily optimized for consumer hardware, and the v4 Chronos model I mentioned brought substantial improvements in temporal coherence. It's a one-time purchase model, about three hundred dollars for the software, and you can run unlimited processing on your own hardware. The catch is that it's a black box — you get what their curated models produce, and the training data composition and architecture decisions aren't user-configurable the way they are with open-source models.
On the open-source side?
No polished all-in-one video upscaling tool yet. It's all DIY pipelines — you need to be comfortable with command-line tools, Python scripts, and some understanding of how the models interface. For a technical user, this is totally viable and the quality ceiling can equal or exceed commercial tools. A twenty twenty-four study found that human raters preferred Real-ESRGAN output over bicubic upscaling 87 percent of the time in blind A-B tests. And with careful hyperparameter tuning and chaining the right models, Real-ESRGAN plus some kind of temporal smoothing can match or exceed Topaz for certain content.
The tradeoff is time and expertise.
Time, expertise, and iteration speed. If I'm a marketer who needs to clean up six clips for a social media campaign by end of day, the commercial route pays for itself. If I'm building a pipeline that's going to process user-generated content at scale, investing in the open-source stack and automated quality controls starts to make more sense.
Then Magnific AI, which I see people post about on social media, is purely cloud-based?
Cloud-based, subscription model, higher quality than Topaz for pure image upscaling with substantial creative control. Magnific lets you adjust "creativity" versus "fidelity" sliders that control how aggressively the model hallucinates detail. But it's not really optimized for video — it's an image upscaler, and while you could theoretically batch your video frames through their API, it's not designed for temporal consistency, and the round-trip times and costs would be brutal for anything longer than a few seconds.
Cloud services are currently the weird middle child — accessible without hardware, great image quality, but not organzied around video properly. Which feeds back to your earlier point about knowing when not to upscale. What's the failure pattern if you don't respect this?
There's one I see all the time that I call the "oil painting effect," and it's especially brutal with low-light video. Low-light footage means high sensor noise. Sensor noise looks like texture to these models — it's high-frequency variation that could be legitimate detail. So the upscaler goes to work enhancing all that noise, sharpening the random speckle into something that looks like deliberate texture. The result is a video where the person's face looks like it's rendered in pointillist brushstrokes, swimming and shimmering as the noise pattern changes frame to frame. If you can't get decent denoising before upscaling, which itself is a lossy process that softens real detail, you're in a trap.
Noise amplification seems like one part of a broader rule: know exactly what you're feeding the model. If your source was heavily compressed by the platform — your Zoom call was bitrate-starved, your conference footage went through WhatsApp's compression — aren't you building on quicksand?
Deeply so, and the JPEG artifact upscaling situation is a perfect illustration. Heavy JPEG compression produces characteristic 8-by-8-pixel blocking artifacts — those faint grid patterns, the edge ringing, the flat blocks where subtle gradients used to be. If these artifacts are firmly baked into your source before upscaling, the model either treats the block boundaries as real edges and enhances them into hard lines, or it hallucinates through them, creating plausibly detailed regions that nevertheless don't quite correspond to the underlying scene because the underlying scene was degraded years ago.
Which is why the first rule Daniel's prompt points toward is: always start with the highest quality source you have. Raid the backup drive. Open the raw file. Beg the videographer for the original export. Because nothing recovers detail that the first compression pass already threw away.
This is the single most important practical advice I'd give. The time I've seen content teams lose upscaling confetti shot on phones that was uploaded to some social platform, downloaded, re-encoded, shared on Slack — at that point, you're not upscaling the original conference moment, you're upscaling a fifth-generation encoding of it, and the model has no idea.
Let me try to pull the pragmatic through-line here, because I think our marketing listener is really asking: give me a workflow. Here's the image I scraped from an archived version of our site that's 600 pixels across and slightly soft — the original no longer exists. Am I dead?
Definitely not dead, but you need to work systematically. For images, start with Real-ESRGAN if you want a free, open-source baseline that handles real-world degradation well. Feed it the best source version you can find. Upscale at 2x first, not 4x. This "two-pass rule" is crucial — upscaling directly to 4x forces the model to invent seventy-five percent of the pixel grid in one shot, which massively strains hallucination plausibility. At 2x, the leap is smaller and the model makes fewer major structural errors. Then open that 2x intermediate, optionally refine edges or noise, and run a second 2x upscale.
It's like proofreading — easier to catch and correct errors at each generation than to fix the final draft of something that was composed at full speed.
If faces are important in the image — executives, user-generated content, historical photos of your founders — run GFPGAN or CodeFormer on the 2x intermediate, before faces are fully committed. Then if you need 4x, run the second upscaler on the face-enhanced intermediate. This gives the final model a better version of the face to work with than it would have gotten at the heavily degraded original resolution.
For video, though, this multistep thing is a bit mad. You're adding much more compute for each pass.
Which is why for video I nearly always recommend testing cloud or commercial solutions unless you have the dedicated GPU, the time, and the risk tolerance. But if you're committed to a DIY route, do a single upscale pass with highly aggressive temporal smoothing, and test your output on someone who's never seen the original. Fresh eyes often catch the weirdness your training level of familiarity will miss.
Then you need to evaluate whether any of this is good enough. Tell me about the metrics, because half the time the upscaler reports some score that's good, but real humans think it looks like something from an early deepfake experiment.
The typical reference-based full-supervision metrics are PSNR, SSIM, and LPIPS — Peak Signal-to-Noise Ratio, the Structural Similarity Index Measure, and Learned Perceptual Image Patch Similarity. PSNR measures absolute pixel error; a higher number means less numerical deviation from ground truth. SSIM tries to measure structural degradation relative to human visual perception — brightness, contrast, and correlation separately weighted. The gold standard right now is LPIPS, which uses a deep network trained to predict human perceptual judgments — it's not counting pixels wrong, it's asking a learned model: "Would a human say these two images look different in a meaningful way?
When researchers report a seven point three decibel PSNR improvement over bicubic, that's meaningful objectively, but does it correlate with what non-technical reviewers prefer?
Here's where it gets awkward for exactly your reason — and it gets at something I think Daniel's prompt is intuiting about real-world collision: PSNR often penalizes the precise kind of texture hallucination that humans actually like. A GAN-based upscaler might regrow convincing grass texture on a degraded lawn photo, creating grass blades that don't correspond to the original arrangement. PSNR will score that harshly, because pixel-for-pixel, it's "wrong" relative to the original high-res ground truth. But humans overwhelmingly prefer the hallucinated grass. So subjective evaluation — showing samples to target audiences — remains actually important in practice for marketing work, not just because "metrics can't capture artistry" but because the leading objective metric directly defects against product-plausible outputs.
Relying solely on PSNR basically guarantees your upscaled images will be technically faithful and visually blurrier than they could be. Which buries the real challenge. Unless the image will be subjected to forensic photoshopping by someone who used to analyze moon landing footage. All people care about is: does the fabric weave like the real thing?
Right, authenticity feel — not underlying pixel math — has this completely different bar, and in contract photography, large print collateral for product marketing, the threshold might be literal pixel-level credibility within the campaign. Understanding which segments penalize false invention is the real management decision.
I'd say the audience member picks Real-ESRGAN to start. They're not price-gating, then spending a few bucks on serious video with Topaz. And every session stopping to ask: do I trust this output sitting alongside photos in annual reports, or should it stay as a slightly blurrier thumbnail people scroll past quickly? That distinction between acceptable-use media location alone saves almost all collateral panic.
That's exactly the framework: internal presentations, archival slide decks where someone wants a visual sweep — absolutely, upscale and salvage. Hero product images on an e-commerce product detail page where a return hinges on someone seeing the correct grain of leather? Don't upsample. Treat these pipelines as salvage equipment, not instruments in the polish cabinet.
Going forward, we know NVIDIA has native RTX upscale integration already shipping for browsers, which means hardware adoption will strip pipeline costs away on ingestion side alone, transforming grab. It currently benefits pre-show conversations much more than post-shoot. So the message to content directors sitting on dozens of dusty QuickTime files isn't the current tooling — it's that next-year silicon solves this passively.
That forward-look on codec expansion is smart. If every streamer sees compressed frames arrive pre-selective-scale on decoding, marketing archival scrubbing stops being its own separate workflow and just lives buried inside display hardware. It completely drops the pressure-load of these still-janky per-clip manual repair processes.
Test 2x pass now with open tools. Don't wait to throw archival time when on-chip handles fundamental path. But that off-screen transparency decision always forces candid culture cost of upscale-adopt-in-fake-terrain questions so heavy against quarterly delivery timelines.
And now: Hilbert's daily fun fact.
Hilbert: In June nineteen fifteen, an enormous squid surfaced off Trinity Bay in Newfoundland — it measured over thirty-two feet in length and had one eye reportedly larger than a man's head. Several specimens were collected during this era after decades during which some scientists considered giant cephalopod reports little more than sailor folklore.
Staring at it from the dock long enough to out a myth — I both admire and resist the draw.
We're generally speculating about ones that may be still undiscovered, somehow there being partial proof now just makes oceanic unknown comfortable but deeper unease absolute.
None of that makes things swim any less.
Being tracked partially contradicts our proximity default; they've never been in the calibration run anyway.
For maybe what all historical eyewitness saw always looks peaceful long past edible light; Hilbert picks quality break points this round if you ask me. Alright, closing mind periscope action: many personal decade-warming. And thank you to prolific supporter and kind eccentric treasure Hilbert Flumingtop for navigating archives on full dry creative.
Real advice embedded: which tiny photograph feels today's extension given no boundary gear holds — attempt test — and recognize upscaled trust negotiation becoming equivalent-lens art moral question emerging below silence and beyond code-pixel grain that you believed just past glass.
This has been My Weird Prompts; investigate calmly what code invited your screen to commission permanence onto never-captured detail.
Rate and review on whatever terrain renders fine recollection; talk people through what actual scaring-turned-realize emerged bottom-corner brightness map during close device visual wander happen somewhere next note glance around.
We preserve soft arrival. Practice lucidity scanning the clear frame missing past surface line. Tangibility near too.
Go listen intentionally.