#2534: Can AI Generate Diagrams Without Typo Disasters?

Why AI diagram tools still mangle text labels — and what to do about it today.

0:000:00
Episode Details
Episode ID
MWP-2692
Published
Duration
34:07
Audio
Direct link
Pipeline
V5
TTS Engine
chatterbox-regular
Script Writing Agent
deepseek-v4-pro

AI-Generated Content: This podcast is created using AI personas. Please verify any important information independently.

Why AI Still Can't Draw a Diagram Without Misspelling "PostgreSQL"**

If you've ever tried to generate a technical architecture diagram using a text-to-image AI, you've hit the wall: the model produces something that looks gorgeous, but the labels are garbled. "Authentication Service" becomes "Authntcaton Servce." "Redis Cache" turns into "Rdis Cache." It's close enough to be frustrating, but wrong enough to be unusable in production.

This isn't a niche complaint. Technical documentation, compliance diagrams, and client-facing architecture drawings all require precise text labels. A single misspelled component name undermines trust. And yet, the current generation of diffusion models — including impressive ones like NanoBanana 2 — still struggle with character-level accuracy, especially as diagrams grow more complex.

Why Text Is Hard for Diffusion Models

Traditional text-to-image models treat text as just another visual pattern to reconstruct from noise. The result looks word-shaped, but individual characters are probabilistic guesses. NanoBanana 2 introduced a character-aware rendering pipeline that tokenizes text at the character level and conditions the diffusion process on explicit character position embeddings. It's a genuine breakthrough — achieving roughly 90% accuracy on short labels — but that last 10% is where production use breaks down.

The failure modes are predictable: long strings, unusual font sizes, text near image edges, special characters, and mixed formatting. More critically, the problem compounds. Even if each label has a 95% chance of being correct, a diagram with 20 labels has only about a 36% chance of being completely clean. In technical documentation, "mostly right" is broken.

The Hybrid Architecture That Might Fix It

The most promising research direction comes from ETH Zurich, which has published work on "structured canvas generation." Their approach decouples diagram structure from visual rendering: first, a model generates a semantic graph of components and their relationships; then, a dedicated typesetting module renders the text deterministically, while the diffusion model handles only the visual styling.

This separation of concerns is the key insight. Visual style is inherently probabilistic — there are many valid ways to shade a box or draw an arrow. Text is discrete and exact. By letting the generative model handle creative visual choices while pinning text to a guaranteed-accurate rendering pass, you eliminate the core failure mode.

The tradeoff is flexibility: you can't have text warping around curved arrows or embedding in 3D perspective. But for technical diagrams, that's usually the right tradeoff. Clarity beats typographic acrobatics.

What's Shipping Today

Several tools are already attacking this problem from different angles. Eraser.io uses a pipeline approach where natural language descriptions are converted into structured representations, then rendered through their own diagramming engine. The output is consistent and editable — closer to "AI writes the Mermaid code for you" than "AI draws the diagram directly."

A newer player called Diagramly has built a custom model fine-tuned exclusively on technical diagrams, with a loss function that heavily penalizes text errors. Their training dataset of 200,000 verified diagrams uses OCR-based text similarity metrics during training, producing a model that's much more text-faithful than anything general-purpose — at the cost of being useless for anything else. It's currently in beta.

Practical Prompting for Better Labels Today

For anyone trying to get work done with existing models, there are emerging best practices. The most effective technique is to explicitly separate text specifications from visual specifications in your prompt. Place exact text strings in a dedicated section with emphasis on character-level accuracy, rather than interleaving them with visual instructions.

Keep individual labels under 15 characters where possible — "PG Primary" instead of "PostgreSQL Primary Database Server." For longer annotations, use numbered references that point to a legend or footnote elsewhere in the diagram. Work with the model's limitations rather than fighting them.

The Broader Lesson

This problem points to a larger shift happening in AI: the move away from one-model-to-rule-them-all generality toward specialized tools that do one thing reliably. A diagram generator doesn't need to also produce photorealistic sunsets. It needs to never, ever misspell "Kubernetes." That focus — and the architectural insights emerging from it — may ultimately produce tools that are far more useful for technical work than any general-purpose image generator could be.

Downloads

Episode Audio

Download the full episode as an MP3 file

Download MP3
Transcript (TXT)

Plain text transcript file

Transcript (PDF)

Formatted PDF with styling

#2534: Can AI Generate Diagrams Without Typo Disasters?

Corn
Daniel sent us this one — he's been deep in the weeds on technical diagramming, and he's hitting a frustration I think a lot of people recognize. You've got Mermaid and traditional tools that are reliable but visually boring, and then you've got generative text-to-image models that can produce something striking, but they keep hallucinating the text. His question is whether there's a model actually built for this hybrid thing — stylistic diagramming where the annotations have to be rock solid, not pseudo text gibberish. And if that doesn't exist yet, what's the best path through the S-a-a-S landscape or prompt engineering to get there.
Herman
This is one of those problems where the gap between what's technically impressive and what's actually usable in production is still a chasm. And Daniel's right to zero in on text reliability as the bottleneck. I've been tracking this closely because it touches so many adjacent fields — documentation pipelines, architecture decision records, even regulatory compliance where a mislabeled component in a diagram could be genuinely problematic.
Corn
Before we go further — quick note, today's script is coming from DeepSeek V four Pro. So if anything sounds especially articulate today, that's why.
Herman
I'll try not to be intimidated.
Corn
You'll manage. So let's start with the state of play. Daniel mentioned NanoBanana 2 as the first model where he's seen text rendering approach production reliability. What's actually happening under the hood there?
Herman
NanoBanana 2 — and I should be precise here, since this model family has moved fast — represents a genuine shift in how text is handled in diffusion-based image generation. Traditional text-to-image models treat text as just another visual pattern to reconstruct from noise. You get something that looks word-shaped, but the individual characters are probabilistic guesses. NanoBanana 2 introduced what they call a character-aware rendering pipeline, where text strings are tokenized at the character level and the diffusion process is conditioned on explicit character position embeddings. It's not perfect, but it's the first time I've seen a model where you can say "label this component 'PostgreSQL Primary' and this one 'Redis Cache'" and actually get those exact strings back maybe ninety percent of the time.
Corn
Ninety percent is interesting because Daniel's experience is that it's almost there but not quite. Where does that last ten percent break down?
Herman
Long strings, unusual font sizes, text near the edges of the image, and anything with special characters or mixed formatting. If you ask for "PostgreSQL 16.3" you're more likely to get a garbled version number than if you just ask for "PostgreSQL." The character-aware pipeline has a practical limit around maybe thirty to forty characters per label before the attention mechanism starts to lose fidelity. And if you have dense annotations — like a complex microservices diagram with fifteen labeled components — the probability that at least one label is wrong climbs pretty quickly.
Corn
It's a compounding probability problem. Even if each individual label has a ninety-five percent chance of being correct, with twenty labels you're down to about a thirty-six percent chance of a completely clean diagram.
Herman
And that's the production problem. In technical documentation, a single wrong label isn't "mostly right" — it's broken. You can't ship an architecture diagram to a client where the authentication service is labeled "Authntcaton Servce" because the model dropped a few characters.
Corn
Which brings us to Daniel's core question — is there a model purpose-built for this? Not a general image generator pressed into service, but something that treats diagramming as a first-class task.
Herman
The short answer is no, not yet — at least not as a standalone model you can download or call via API the way you would NanoBanana or DALL-E or Midjourney. But the longer answer is more interesting. There are several research efforts and early-stage products that are attacking this from different angles. The most promising direction I've seen comes from a group at ETH Zurich that published work on what they call "structured canvas generation" — essentially a hybrid architecture where the model first generates a semantic graph of the diagram components and their relationships, then renders that graph into a visual layout, with text handled by a separate dedicated typesetting module rather than by the diffusion process itself.
Corn
Decoupling the structure from the rendering.
Herman
And that's the key insight. When you ask a diffusion model to generate both the visual style and the precise text, you're asking it to do two fundamentally different things at once. Visual style is inherently fuzzy and probabilistic — there are many valid ways to draw an arrow or shade a box. Text is discrete and exact — there's exactly one correct spelling of "Kubernetes." The ETH approach says, let the generative model handle the visual creativity, but pin the text to a deterministic rendering pass that's guaranteed to be character-accurate.
Corn
That sounds like it would solve the reliability problem, but at the cost of flexibility. If the text is rendered separately, you can't have it warp around a curved arrow or embed naturally in a three-dimensional perspective.
Herman
That's the tradeoff. And for technical diagrams, I'd argue it's usually the right tradeoff. Most architecture diagrams don't need perspective-warped text. They need clear, readable labels in consistent fonts. The visual interest comes from layout, color, iconography — not from typographic acrobatics.
Corn
Daniel also asked about the S-a-a-S landscape. What's out there that's actually shipping?
Herman
There are a few players worth knowing about. Eraser dot io has been making moves in this space — they started as a collaborative whiteboard tool but have built out an AI diagram generation feature that uses a pipeline approach. You describe what you want in natural language, it generates a structured representation, and then renders that using their own diagramming engine. The output looks consistent because it's not raw pixel generation — it's more like programmatic diagram creation with an AI planner on the front end.
Corn
It's closer to "AI writes the Mermaid code for you" than "AI draws the diagram directly.
Herman
And that approach has the huge advantage that the output is editable. If the AI gets something wrong, you're not stuck with a raster image you have to manually fix in Photoshop. You tweak the underlying representation and re-render. Another player is Excalidraw with their AI features — similar philosophy, different aesthetic. Then there's a newer company called Diagramly that's specifically targeting this "stylistic diagramming" niche Daniel described. They've built a custom model trained exclusively on technical diagrams, architecture drawings, and flowcharts, with a loss function that heavily penalizes text errors.
Corn
A custom model trained from scratch or fine-tuned from something larger?
Herman
Fine-tuned from a base diffusion model, but with a surprisingly aggressive training regimen. They curated a dataset of something like two hundred thousand technical diagrams with verified text annotations, and during training they used a specialized text-similarity metric as part of the loss function. If the model generates an image where the O-C-R extraction of the text doesn't match the prompt, it gets penalized heavily — way more than for visual style deviations. The result is a model that's much more text-faithful than anything general-purpose, but at the cost of being less visually flexible. It knows flowcharts, architecture diagrams, sequence diagrams, network topologies — and not much else.
Corn
That's exactly the kind of task-specific tooling Daniel was advocating for. Is Diagramly publicly available?
Herman
It's in beta. You can request access, but they're being selective about it. The founder wrote a blog post a few months back that I thought was refreshingly honest — they said they're not trying to build a general-purpose image generator, they're trying to build the best technical diagram generator, and they're okay with being useless for anything else.
Corn
There's something almost old-fashioned about that approach that I find appealing. The whole industry has been chasing generality for the past few years — one model to rule them all — and there's a counter-current forming around specialized models that do one thing reliably.
Herman
It makes economic sense too. If you're generating diagrams for production documentation, you don't need a model that can also generate photorealistic sunsets or anime characters. You need a model that never, ever misspells "PostgreSQL." The generality is overhead you're paying for in compute cost and unreliability.
Corn
Let's talk about the path forward for someone like Daniel who's trying to get work done today, not waiting for betas to open up. He mentioned that even with NanoBanana 2, he finds edge cases where text breaks down. What does the prompting craft look like for maximizing text reliability?
Herman
There's an emerging set of practices that the power users have been developing. I've been collecting these from various forums and from my own experimentation. The first and most impactful thing is to explicitly separate your text specification from your visual specification in the prompt. Don't say "draw a diagram of a microservices architecture with a load balancer labeled Nginx." Say something like — here's the visual style I want, here's the layout I want, and here are the exact text strings that must appear, in a separate section of the prompt, with explicit emphasis on character-level accuracy.
Corn
You're essentially doing manually what the ETH approach does architecturally — decoupling the concerns.
Herman
And it works because you're helping the model's attention mechanism. When text requirements are interleaved with visual requirements, the model tends to blend them — it treats the text as part of the visual composition rather than as a precise constraint. By separating them, you're signaling that these are different kinds of instructions.
Herman
If you can keep every text string under about fifteen characters, your reliability goes way up. That might mean using abbreviations or acronyms where appropriate — "PG Primary" instead of "PostgreSQL Primary Database Server." It's a constraint on your diagram design, but it's one that often leads to cleaner diagrams anyway.
Corn
For cases where you need longer annotations?
Herman
Instead of trying to fit "Authentication Service with OAuth 2.0 and JWT Validation" into a single box, you label the box "Auth Service" and put a numbered reference that points to a legend or footnote elsewhere in the diagram. The model handles short labels much more reliably, and you can generate the legend separately or add it in post-processing.
Corn
That's clever — working with the model's limitations rather than fighting them. What about the visual language side of things? Daniel mentioned wanting something that sits between traditional Mermaid-style diagramming and full creative image generation.
Herman
This is where I think the most interesting prompting innovation is happening. There's a style that some practitioners are calling "augmented schematic" — it uses the structural clarity of traditional diagramming but adds visual depth through shading, subtle gradients, icon-like illustrations for components, and more organic connection lines. The key is that the visual embellishment is decorative rather than informational. The text and the topology carry the meaning; the styling makes it pleasant to look at.
Corn
You're not asking the model to be creative with the information architecture, just with the skin.
Herman
And that's a much more tractable problem for current models. You can get quite specific — "render this as a clean technical diagram with soft shadows, rounded rectangles for services, cylinders for databases, directional arrows for data flow, using a muted blue and teal color palette on a light gray background." The model can execute on that visual direction while the text constraints are handled more carefully.
Corn
I want to dig into something Daniel mentioned in passing — the idea that by the time you fix up a generated diagram manually, you might as well have just used Mermaid. Is that actually true, or is there a middle ground where AI-assisted diagramming saves net time even with some manual cleanup?
Herman
I think it depends heavily on the type of diagram. For a standard microservices architecture diagram with a dozen components and straightforward relationships, Mermaid or PlantUML is probably still faster and more reliable. You write the code, you get the diagram, it's perfectly accurate, and it takes maybe fifteen minutes. Using a generative model for that is overkill and introduces reliability risk for no real benefit.
Corn
When does the generative approach actually win?
Herman
First, when you need visual style that goes beyond what Mermaid can produce — if you're creating diagrams for a pitch deck, a conference presentation, or customer-facing documentation where aesthetics actually matter. Mermaid diagrams look like Mermaid diagrams, and everyone in tech recognizes them. Sometimes you want something that looks custom. Second, when you're diagramming something that doesn't fit neatly into the boxes-and-arrows paradigm — maybe you're illustrating a data pipeline that involves physical infrastructure, or a deployment topology that spans cloud and on-premise, or a system interaction that benefits from a more spatial or geographic representation.
Corn
The spatial representation point is interesting. Mermaid forces you into a particular abstraction — nodes and edges on a flat plane. But some systems are better understood spatially. A diagram of a CDN might benefit from showing geographic regions with actual relative positions.
Herman
Right, and that's where generative models shine. They can produce a world map with data centers marked, or a three-dimensional-ish representation of a network topology, or a layered diagram where depth conveys information. Those are things that are hard to do in Mermaid or even in traditional diagramming tools like Lucidchart. And the text requirements for those kinds of diagrams are often simpler — shorter labels, fewer of them — so the reliability problem is less acute.
Corn
Let's talk about the image-to-image angle Daniel mentioned. He asked specifically about image-to-image models for diagramming. What's the use case there?
Herman
Image-to-image is underrated for this workflow. The idea is you start with a rough sketch — could be a hand-drawn diagram, could be a Mermaid export, could be something you threw together in Excalidraw — and you use an image-to-image model to stylize it while preserving the structure. The advantage is that the text is already correct in your source image, and the model is being asked to enhance the visuals without changing the content.
Corn
Does that actually work in practice? My experience with image-to-image is that it can be unpredictable about what it preserves and what it "improves.
Herman
It's model-dependent. NanoBanana 2 in image-to-image mode with a low denoising strength — say zero point three to zero point five — is reasonably good at preserving text while upgrading the visual quality. The key parameter is the denoising strength. Too high and the model starts hallucinating new text over your correct labels. Too low and the visual improvement is negligible. Finding the sweet spot takes experimentation, but once you find it for a particular style of diagram, it tends to be fairly consistent.
Corn
A workflow might be: sketch in Mermaid to get the structure and text right, export as an image, then run through image-to-image with a style prompt and low denoising strength to get something visually polished.
Herman
That's exactly the workflow I'd recommend for someone who needs production reliability today. It's not the most elegant pipeline, but it's reliable, and the time investment is reasonable once you've dialed in your parameters. You're looking at maybe five minutes of Mermaid coding, thirty seconds of generation, and maybe two minutes of spot-checking the output. Compare that to an hour of fighting with a pure text-to-image approach or forty minutes of manual diagramming in a visual tool.
Corn
I want to circle back to something you mentioned earlier about Diagramly and specialized models. Is there a world where we get an open-source model specifically for diagram generation? Something the community could fine-tune on their own diagram styles?
Herman
I think it's inevitable, and I'd actually be surprised if we don't see something in the next six to twelve months. The training data exists — there are millions of technical diagrams in public documentation, arXiv papers, engineering blogs. The annotation challenge is real — you need accurate text transcripts of the labels in those diagrams — but O-C-R has gotten good enough that you could bootstrap a dataset. The bigger question is whether there's enough demand to sustain an open-source effort versus everyone just waiting for the commercial offerings to mature.
Corn
The Mermaid community is pretty active. I could see someone in that ecosystem building a fine-tune that takes Mermaid code as conditioning and produces styled output directly.
Herman
That would be elegant. And Mermaid code is already a structured representation — it's essentially a domain-specific language for diagrams. Using it as the conditioning signal rather than natural language would give you much tighter control. You'd write your Mermaid as usual, and the model would render it with whatever visual style you specify. Best of both worlds — the reliability of code-based diagramming with the visual quality of generative rendering.
Corn
There's something almost poetic about Mermaid code becoming the "source of truth" that generative models decorate. It preserves the key property that made Mermaid popular in the first place — it's version-controllable, diffable, reviewable plain text. You can put it in a git repo and track changes over time. The generated image is an artifact, not the source.
Herman
That's a crucial point for production use. If you're generating diagrams directly from natural language prompts, you lose reproducibility. The same prompt run twice might give you two different layouts, two different visual styles. For documentation that needs to be consistent and auditable, that's a real problem. The code-as-source approach solves it neatly.
Corn
Daniel also asked about what to do if you're committed to prompting NanoBanana directly and want to maximize your chances of getting clean text. You mentioned separating text and visual specs, keeping labels short, and using numbered callouts.
Herman
A few more tactical things. One is to use what I've seen called "text anchoring" — you explicitly describe where each text label should appear relative to visual elements. "The word Database appears centered inside the cylinder shape. The word Cache appears above the rounded rectangle on the right." This helps the model's spatial attention focus text generation on specific regions.
Corn
Does that actually improve accuracy or just placement?
Herman
Both, in my testing. When the model has a clear spatial target for a text string, it seems to allocate more attention to getting that string right in that specific location. Without spatial anchoring, text can drift or get duplicated across the image.
Corn
What about the order of labels in the prompt?
Herman
There's some evidence that the order matters — labels mentioned earlier in the prompt get more reliable rendering than labels mentioned later. It's a recency and primacy effect in the attention mechanism. If you have one or two labels that are absolutely critical, put them first. If you have labels that would be nice to have but aren't essential, put them last and be prepared to fix them manually if needed.
Corn
It's a bit like the old advice about putting your most important points at the beginning and end of a presentation. The middle gets lost.
Herman
The analogy holds surprisingly well. And one more thing — repetition helps. If you mention a critical label twice in the prompt, once in the text specification section and once in the visual description, you get a modest reliability boost. Not dramatic, but measurable. Something like a five to ten percent reduction in character error rate.
Corn
Five to ten percent is meaningful when you're trying to push from ninety percent reliable to ninety-nine percent. What about post-processing? You mentioned O-C-R as a verification step.
Herman
That's becoming a standard part of serious diagramming workflows. You generate the diagram, run O-C-R on the output to extract all the text, and compare it to what you expected. If there's a mismatch, you flag it for manual correction or regenerate. Some people are building this into CI/CD pipelines — automatically validate that generated diagrams have correct text before they get published.
Corn
That's the kind of thing that sounds like overengineering until you've shipped a documentation page with a diagram that says "Kubemetes" and someone screenshots it and it ends up on Hacker News.
Herman
The internet is unforgiving. And honestly, in regulated industries or for compliance documentation, a mislabeled diagram could have actual consequences beyond embarrassment. If you're submitting an architecture diagram as part of a SOC 2 audit and it has hallucinated component names, that's not great.
Corn
We've covered the current state, the specialized models on the horizon, the prompting craft, and the workflow strategies. Let's zoom out for a minute. Why is this problem — reliable text in generated images — so stubborn? We've solved convincingly photorealistic faces, complex scene composition, consistent lighting. But "spell this word correctly" remains hard.
Herman
It comes down to the fundamental architecture of diffusion models. These models learn to denoise images by understanding visual patterns at multiple scales. Text, at the pixel level, is just a particular arrangement of edges and curves. The model doesn't know that the arrangement of pixels that spells "database" is semantically different from an arrangement that spells "databese" — they're both plausible-looking patterns of edges and curves. The model has no intrinsic concept of a character, let alone a word. It's all just pixels.
Corn
Whereas a human, even a child, understands that letters are discrete symbols that must be reproduced exactly.
Herman
And that's why the character-aware approaches in NanoBanana 2 and the ETH work are important — they're essentially injecting a symbolic understanding of text into a system that otherwise operates entirely in continuous pixel space. It's a hybrid approach that acknowledges that some things are better represented discretely.
Corn
This connects to a broader theme in AI that I find fascinating — the tension between continuous and discrete representations. Language models work with discrete tokens and they handle text perfectly but struggle with visual consistency. Image models work with continuous pixels and handle visuals beautifully but struggle with text. The hybrid architectures are where the interesting progress is happening.
Herman
Diagrams are the perfect stress test for hybrid approaches because they demand both. You need the visual quality of a generative model and the text precision of a deterministic renderer. It's a microcosm of the larger challenge in making AI systems that are both creative and reliable.
Corn
Daniel mentioned that he's always advocating for task-specific tooling. I think his instinct is right here, and the market seems to be validating it. The general-purpose models will keep improving their text rendering, but there's a ceiling on how good they can get without fundamentally rethinking the architecture. The specialized tools — whether that's Diagramly's fine-tuned model or Eraser's pipeline approach or a future Mermaid-to-styled-diagram renderer — are attacking the problem from the right angle.
Herman
I'd add that even if the general-purpose models do eventually solve text reliability perfectly, there will still be a place for specialized diagramming tools. A diagram is not just an image with text on it. It's a structured representation of information. The ability to edit the structure, to change a database to a cache and have the diagram update accordingly, to generate multiple views of the same underlying model — those are things that a pure image generator can't do. They require the tool to understand the semantics of the diagram, not just its visual appearance.
Corn
A generated image is a dead end. A diagram with an underlying model is a living document. For any documentation that's going to be maintained over time, you want the latter.
Herman
This is where I think the Mermaid ecosystem has actually been underappreciated. The fact that you can version-control your diagrams, review changes in a pull request, and regenerate them automatically as part of a build pipeline — that's valuable engineering practice. The visual styling layer should be additive to that workflow, not a replacement for it.
Corn
If you're Daniel, or someone in Daniel's position, the practical advice is something like: keep your Mermaid or PlantUML as the source of truth, use generative tools as a styling pass rather than a creation pass, and keep an eye on the specialized tools emerging in the next year or so. Does that capture it?
Herman
I think so. And be strategic about when you reach for the generative tools. Not every diagram needs to be beautiful. Internal documentation, quick sketches for colleagues, anything that's going to be thrown away in a sprint — Mermaid is fine. Save the generative polish for the diagrams that are going to be seen by customers, executives, or the public.
Corn
There's a discipline to that. The temptation once you have a shiny tool is to use it everywhere, and you end up spending forty minutes making a pretty diagram for a Slack message that three people will see.
Herman
Guilty as charged. I once spent an entire afternoon generating a visually stunning diagram of our podcast recording setup. It was beautiful. Corn still makes fun of me for it.
Corn
It had drop shadows on the microphone stands, Herman.
Herman
They were tasteful drop shadows.
Corn
They were unnecessary drop shadows. But I take your point. The tool should fit the use case, not the other way around.

And now: Hilbert's daily fun fact.

The national animal of Scotland is the unicorn. It has been since the twelve hundreds, when it was used on the Scottish royal coat of arms. Scotland is one of the few countries whose national animal is a mythical creature.

So where does this leave us practically? If you're listening and you're dealing with technical diagramming, here's what I'd take away.
Herman
First, don't abandon Mermaid or PlantUML. They remain the most reliable way to produce accurate, maintainable technical diagrams, and they integrate beautifully with documentation workflows. The fact that they're "boring" is a feature when you need consistency and version control.
Corn
Second, if you need visual polish, use image-to-image rather than text-to-image. Start with a correct diagram — whether from Mermaid, a sketch, or a manual tool — and use a model like NanoBanana 2 with low denoising strength to enhance the aesthetics while preserving the text.
Herman
Third, if you're going to prompt for diagrams directly, separate your text specification from your visual specification. Keep labels short. Use spatial anchoring. Put critical labels first. And always, always verify the output — ideally with automated O-C-R checking if this is part of a production pipeline.
Corn
Fourth, keep an eye on the specialized tools. Diagramly, Eraser, and whatever emerges from the open-source community in the next year. The general-purpose models are impressive, but task-specific tooling is where reliability lives.
Herman
Finally, be thoughtful about when you need beauty and when you need accuracy. The best diagram in the world is the one that communicates clearly. Sometimes that's a Mermaid diagram with default styling. Sometimes it's a custom-generated visual masterpiece. The craft is knowing which situation you're in.
Corn
One thing I keep coming back to — and this is more forward-looking — is that we might be in an awkward intermediate phase. In five years, I suspect this whole conversation will seem quaint. Either the models will have solved text reliability, or the tooling will have evolved to the point where the distinction between "diagramming tool" and "generative model" disappears. You'll describe what you want, get an editable, semantically-aware diagram, and the question of whether it was "generated" or "rendered" will be an implementation detail.
Herman
I think that's right, and I think it'll happen through the hybrid approaches we've been discussing. Not through diffusion models getting better at text, but through systems that combine generative visuals with deterministic text rendering and semantic understanding of diagram structure. The end result will feel like magic, but under the hood it'll be a carefully engineered pipeline.
Corn
Which is, honestly, how most technology that "feels like magic" actually works.
Herman
The best magic has a lot of engineering behind it.
Corn
This has been My Weird Prompts. Thanks to our producer Hilbert Flumingtop for keeping us on track. If you enjoyed this episode, leave us a review wherever you listen — it helps other people find the show. I'm Corn.
Herman
I'm Herman Poppleberry. We'll catch you next time.

This episode was generated with AI assistance. Hosts Herman and Corn are AI personalities.