#2483: Generating Synthetic Data Without PII Risk

How to generate realistic synthetic voice notes and calendar data with zero PII exposure risk.

Featuring

Daniel

Corn

Herman

0:000:00

Episode Details

Episode ID: MWP-2641
Published: Apr 27
Duration: 24:41
Audio: Direct link
Pipeline: V5
TTS Engine: chatterbox-regular
Script Writing Agent: deepseek-v4-pro
Topics: small-language-models privacy model-collapse

AI-Generated Content: This podcast is created using AI personas. Please verify any important information independently.

Generating Synthetic Data Without PII Risk

A developer named Daniel recently prototyped a classification model for voice notes using synthetic data. He started with his actual voice notes, ran them through an LLM to strip personally identifiable information, and then asked two questions: what other use cases exist for this kind of synthetic data generation, and what frameworks actually work for generating credible synthetic data from scratch without PII exposure risk.

The Substitution Approach

What Daniel did — taking real data, running it through an LLM to swap names and sensitive details, then using the result as seed material — is called substitution anonymization. A March 2026 study by Albanese and colleagues tested this approach on conversational data using local on-premise models (GPT-oss at 20 billion parameters, DeepSeek-r1 at 7 billion). The results were striking: 0.99 privacy recall, meaning virtually every piece of identifying information was caught and replaced, while downstream utility remained nearly intact — Q&A accuracy held at 95%, and fine-tuning performance had a mean absolute error of just 0.029.

Pure redaction — simply deleting PII and leaving blanks — destroyed utility entirely. The same study found redaction achieved 0.98 privacy but only 26% Q&A accuracy, with fine-tuning error jumping to 0.417. The intuition is straightforward: if you're classifying voice notes about project deadlines and you've blanked out every name and date, the classifier has nothing to learn from.

The Cost-Performance Trade-off

A comprehensive arXiv survey (2503.14023) catalogued use cases for synthetic data in text classification. The numbers on data augmentation are worth knowing: taking just 100 real training samples and augmenting with 100 synthetic samples from GPT-3.5 yielded accuracy improvements of 3 to 26% across classification tasks. The cost asymmetry is staggering — labeling 3,000 sentences for sentiment analysis cost roughly $220-$300 and took about 1,000 minutes of human time, while GPT-3 generated 6,000 examples for about $29 in 46 minutes.

The trade-off: with 6,000 synthetic examples, accuracy hit 76%. With 3,000 human-labeled examples, 88%. For prototyping, "good enough" synthetic data often beats expensive perfect data.

Available Frameworks

The tooling has matured enormously in the last year. SDG Hub from Red Hat (released November 2025) is an open-source YAML-based framework where you chain blocks into flows — generate a voice note transcript, then a classification label, then vary tone and urgency. It supports local models through Ollama or vLLM.

Evidently AI's synthetic data generator (August 2025, v0.7.11) produces pandas DataFrames directly and supports user profiles — specify role, tone, intent — and few-shot generation from real examples. Their blog post showed generating a git diff, then a code review comment for that diff.

For privacy guarantees, a December 2025 paper on "DP-fying your data" recommends a two-stage pipeline: generate representative synthetic data with differential privacy parameters baked in, then audit the output for residual PII. Training on DP synthetic data reduces attack surfaces compared to fine-tuning on raw data, even if you think you've anonymized the raw data.

The Model Collapse Risk

If you train successive generations of models primarily on synthetic data from other models, you get model collapse: loss of diversity, factuality, and robustness. Outputs become bland, repetitive, eventually nonsensical. The mitigation is straightforward: blend synthetic and real data. Use synthetic for augmentation, not replacement. Daniel's approach of starting with real seed data, applying LLM-based anonymization, then augmenting synthetically is the gold standard pattern that preserves both privacy and utility while avoiding collapse risk.

Downloads

Episode Audio

Download the full episode as an MP3 file

Download MP3

Transcript (TXT)

Plain text transcript file

Transcript (PDF)

Formatted PDF with styling

#2483: Generating Synthetic Data Without PII Risk

Daniel sent us this one — he's been prototyping a classification model for voice notes, the kind of thing that could become a productivity tool. He used his actual voice notes for the demo, but ran them through an LLM first to strip out personally identifiable information. And now he's asking two things. First, what other use cases are people finding for this kind of synthetic data generation? And second, what frameworks actually work for generating credible synthetic data from scratch — say, five hundred voice notes or calendar appointments — without any PII exposure risk. There's a lot to unpack here.

By the way — DeepSeek V four Pro is writing our script today. Which feels appropriate given we're talking about synthetic generation.

Though I'm not sure whether to be flattered or concerned that a model is writing about models generating data. Feels like we're one step away from the snake eating its own tail.

That's actually a real problem we should get into — model collapse. But let me start with the use cases, because Daniel's example is actually a perfect entry point. What he did — taking real voice notes, running them through an LLM to swap out names and sensitive details, then using those as seed material — that's what researchers are now calling substitution anonymization. And the numbers on how well this works are genuinely striking.

There was a paper just last month — Albanese and colleagues, March twenty twenty-six — where they tested this approach on conversational data. Used on-premise local models, so nothing ever leaves your machine. GPT-oss twenty billion parameters, DeepSeek-r1 seven billion. They achieved zero point nine nine privacy recall. That means practically every piece of identifying information got caught and replaced. But here's the part that matters for Daniel — the downstream utility was preserved almost perfectly. Q and A accuracy held at ninety-five percent. Fine-tuning performance had a mean absolute error of just zero point zero two nine.

The anonymized text was basically as useful as the original for training purposes.

Whereas pure redaction — just deleting PII and leaving blanks — that destroyed utility. Same study, redaction scored zero point nine eight on privacy but only twenty-six percent on Q and A accuracy. The fine-tuning error jumped to zero point four one seven. You're left with Swiss cheese text that can't train anything useful.

Which makes intuitive sense. If you're classifying voice notes about project deadlines and you've blanked out every name and date, the classifier has nothing to learn from. It's just learning to recognize the word "meeting" surrounded by holes.

The substitution approach replaces "meeting with Sarah Chen next Tuesday" with "meeting with David Morrison next Thursday." Same structure, same semantic relationships, completely different real-world referents. And the models doing this are getting small enough to run locally. Hugging Face published work on what they're calling Anonymizer SLMs — small language models, down to six hundred million parameters. Their Qwen3 four billion parameter model scored nine point five five out of ten on anonymization quality, comparable to GPT-four point one, but you can run it on a laptop.

Daniel's approach is validated. But he asked about other use cases. What else are people doing with this?

The survey literature is actually really rich here. There was a comprehensive survey on arXiv — twenty-five zero three dot one four zero two three — that catalogued use cases across text classification, and the numbers on data augmentation alone are worth knowing. They found that taking just a hundred real training samples and augmenting with a hundred synthetic samples from GPT-three point five yielded accuracy improvements of three to twenty-six percent across classification tasks. The cost asymmetry is staggering. Labeling three thousand sentences for sentiment analysis cost roughly two hundred twenty to three hundred dollars and took about a thousand minutes of human time. GPT-three could generate six thousand examples for about twenty-nine dollars in forty-six minutes.

What was the actual performance difference?

That's the trade-off. With six thousand synthetic examples, they hit seventy-six percent accuracy. With three thousand human-labeled examples, eighty-eight percent. So synthetic is cheaper and faster, but you leave some accuracy on the table. The question becomes whether "good enough" synthetic data beats expensive perfect data for your specific use case. For prototyping, the answer is almost always yes. Daniel's building a classification model for a productivity tool — he doesn't need production-grade accuracy at the demo stage. He needs something that credibly shows the concept works.

The data he's generating — voice notes, calendar appointments — those are exactly the kind of semi-structured things where synthetic generation shines. A voice note has a predictable shape. "Remind me to call X about Y." "Pick up Z from the store." You can template those patterns and vary the slots.

Which brings us to the second part of Daniel's question — the frameworks. And this is where I get excited, because the tooling has matured enormously in the last year. Let me start with SDG Hub from Red Hat. Released November twenty twenty-five, open source, installable with pip. It's a modular YAML-based framework where you chain what they call blocks into flows. You define a pipeline — say, generate a voice note transcript, then generate a classification label for it, then vary the tone and urgency — and it handles the orchestration. It supports local models through Ollama or vLLM, or you can point it at hosted APIs. And it comes with prebuilt pipelines, including ones for generating question-answer pairs from documents.

YAML-based, you said. So configuration as code.

And that's important because reproducibility matters when you're generating synthetic data. You want to be able to tweak a parameter and regenerate the whole dataset consistently. SDG Hub gives you that. But it's not the only option. Evidently AI released their synthetic data generator in August twenty twenty-five, version zero point seven point eleven, and it takes a different approach. It produces pandas DataFrames directly, which is very Pythonic and familiar to data scientists. And it has this concept of user profiles — you can specify role, tone, intent — and then generate data that matches those profiles.

If I want five hundred voice notes from a harried project manager versus five hundred from a relaxed creative director, I can dial that in.

And Evidently AI supports few-shot generation, so you give it a handful of real examples and it extrapolates. It also does multi-step pipelines. Their blog post showed an example that's wonderfully — generate a git diff, then generate a code review comment for that diff. Synthetic data generating synthetic code reviews. For Daniel's calendar appointment use case, you could chain appointment generation with classification label generation.

Daniel specifically asked about generating data without PII exposure risk. So the privacy angle matters as much as the generation angle.

Right, and that's where the differential privacy work comes in. There was a paper in December twenty twenty-five on what they called DP-fying your data — differential privacy synthetic data. The recommended approach is a two-stage pipeline. First, generate representative synthetic data with DP parameters baked in. Then, audit the output for residual PII before you use it. The key insight is that training on DP synthetic data reduces attack surfaces compared to fine-tuning on raw data, even if you think you've anonymized the raw data. Because raw data can leak in subtle ways through model parameters. Synthetic data with DP guarantees gives you mathematical bounds on what can be extracted.

For structured stuff like calendar appointments, there are domain-specific tools. I saw something called medscheduler for generating outpatient appointment datasets — calendar slots, patient demographics, booking outcomes. That's obviously healthcare-focused, but the pattern generalizes. You define the schema, the constraints, the realistic distributions, and the tool populates it.

NVIDIA's NeMo tools also do this for general-purpose synthetic calendar events. You configure seed data, define your columns, write prompts that capture the patterns you want, and it generates. For voice notes specifically, there's a whole pipeline that's emerged. LLM generates diverse transcripts, then a TTS engine synthesizes the audio with variations in pitch, speed, and background noise, and then you train your classifier on the paired data. There's a public dataset of eighty-three thousand seven hundred WAV files of isolated words with those variations, used for training keyword spotting and voice command classifiers.

Daniel could generate his five hundred voice notes as text, verify they're PII-clean, then run them through a TTS engine with some parameter variation to get realistic audio files. And he never touches real user data.

If he wants to be extra careful, he can run the whole thing locally. The Anonymizer SLM models I mentioned — six hundred million to four billion parameters — those run on consumer hardware. Combine that with a local LLM for generation via Ollama, and a local TTS engine, and the entire pipeline stays on his machine. No API calls, no data leaving his network, no PII exposure risk whatsoever.

That's the surgical anonymization frontier you mentioned earlier. Targeted replacement rather than scorched-earth redaction.

And the Hugging Face blog post on this from twenty twenty-five was really clear about the philosophy. These models are trained via GRPO — group relative policy optimization — to perform what they call surgical PII replacements. They target only specific entities. Name, phone number, email, address. Everything else stays intact. So the context and the semantics survive. If your voice note says "The Q3 report needs Sarah's signature by Friday," only "Sarah" gets swapped. The business meaning is preserved.

Which is exactly what Daniel was intuiting when he ran his own voice notes through an LLM. He was doing manually what these frameworks now do systematically. But I want to circle back to something you mentioned in passing — model collapse. Because if we're telling people to generate synthetic data at scale, we should also tell them what breaks.

And this is where the survey paper I cited earlier gets really important. The risk is that if you train successive generations of models primarily on synthetic data from other models, you get model collapse. Loss of diversity, loss of factuality, loss of robustness. The outputs become bland, repetitive, eventually nonsensical. Each generation amplifies the biases and flattens the tails of the distribution. The mitigation the survey recommends is straightforward: blend synthetic and real data. Don't train exclusively on synthetic. Use synthetic for augmentation, not replacement.

Which Daniel is already doing, whether he realizes it or not. He started with his real voice notes. The synthetic data is an extension, not a substitute.

His approach is actually the gold standard. Real seed data, LLM-based anonymization, synthetic augmentation. That's the pattern that preserves both privacy and utility while avoiding collapse risk. And the frameworks I mentioned — SDG Hub, Evidently AI — they're designed with this blending approach in mind. You're not supposed to generate data in a vacuum. You're supposed to generate data that extends and varies your real data.

Let's talk about some of the more creative use cases, because Daniel asked what else people are doing. I've been reading about RAG evaluation datasets — retrieval augmented generation. You take a knowledge base, say a company's internal documentation, and you generate synthetic question-answer pairs from it. Those become your ground truth for evaluating whether your RAG system is retrieving the right documents and generating accurate answers. Without synthetic data, you'd have to manually write hundreds of test questions.

Adversarial testing is another big one. You want to test your classifier or your chatbot against edge cases — prompt injections, toxic content, rare linguistic patterns. Those don't show up often in real logs, or if they do, you might not want to store them. Synthetic generation lets you create a test suite of adversarial examples on demand. The Evidently AI blog post from August showed exactly this — generating Twitter-style posts with varying toxicity levels to stress-test content moderation classifiers.

Customer support simulation is probably the most common enterprise use case. Before you deploy a chatbot, you generate thousands of synthetic user queries with different intents, tones, complexity levels, and languages. You run those through the bot and measure response quality. You find the failure modes before real users do.

One that I think is underappreciated — code review simulation. The Evidently AI example I mentioned, where you generate a synthetic git diff and then generate a code review comment. You can use that to train junior developers, or to evaluate code review tools, or to build training data for automated review systems. It's a multi-step pipeline where each step generates something that feeds the next step.

We've got use cases. We've got frameworks. Let me ask the question I think Daniel is really driving at. If I'm an individual developer with an early-stage idea — not an enterprise, not a team — what's my practical starting point? What do I install on a Tuesday afternoon?

For Daniel's specific scenario — generating credible voice notes and calendar appointments — I'd start with Evidently AI. It's Python-native, it produces DataFrames, it has user profiles for varying tone and intent, and it supports few-shot generation. You write a handful of example voice notes, define the columns you want, specify the profiles, and it generates five hundred variants. The learning curve is maybe an afternoon. And because it's model-agnostic, you can point it at a local model through Ollama and keep everything private.

If he wants more structure, more reproducibility, more pipeline thinking?

Then SDG Hub from Red Hat. The YAML configuration is more upfront work, but once you've defined your flow, you can regenerate the entire dataset with different parameters in one command. It's designed for exactly the kind of iterative prototyping Daniel is doing. And because it's Red Hat, the documentation is solid and it's built with enterprise concerns in mind even though it's open source. Things like logging, versioning, reproducibility.

What about the tabular data side? Calendar appointments have structured fields — time, date, duration, attendees, location.

For the structured fields, Synthetic Data Vault — SDV — is the mature option. It handles tabular, relational, and time-series data using GANs, VAEs, and statistical methods. But I should be clear — SDV does not natively support unstructured text generation. So for the free-text content of a calendar appointment — the description, the agenda — you'd use SDV for the metadata and an LLM-based tool for the text. Or you'd use MOSTLY AI's Synthetic Data SDK, which handles tabular and sequential data with differential privacy support built in. That's pip install mostly-ai with the local flag.

The stack might be SDV or MOSTLY AI for the appointment structure, plus Evidently AI or SDG Hub for the natural language content, all running against local models.

If you want to go really deep on privacy, you layer in the Anonymizer SLM models as a final pass. Generate the synthetic data, then run it through the anonymizer to catch any residual PII that the generation step might have hallucinated. Because LLMs do hallucinate. They might generate a realistic-seeming phone number or email that happens to belong to a real person. The anonymizer catches that.

Synthetic data isn't automatically PII-free just because it's synthetic. The model might reproduce patterns from its training data that map to real entities.

That's exactly why the two-stage DP approach from that December paper matters. Generate with differential privacy guarantees, then audit. Don't assume. And for Daniel's use case, where he's generating voice notes that sound like real people talking about real projects, the risk of accidental PII generation is non-trivial. A generated voice note might say "call Dr. Patel at five five five zero one two three" and that number could actually belong to someone.

Alright, so we've covered the frameworks, the privacy approach, the use cases. What's the thing most people get wrong about this?

I think the biggest misconception is that synthetic data is a cheap substitute for real data in all contexts. It's not. The survey numbers tell the story — seventy-six percent accuracy with synthetic versus eighty-eight percent with human-labeled. That twelve-point gap matters in production. Where synthetic data shines is prototyping, augmentation, edge case generation, and privacy-sensitive scenarios. It's a tool in the toolbox, not a replacement for ground truth.

The other misconception I see is people thinking that generating synthetic data means you don't have to think about data quality. You absolutely do. If your prompts are sloppy, your synthetic data will be sloppy. If your seed examples are biased, your synthetic data will amplify that bias. The frameworks help with reproducibility and scale, but they don't replace judgment.

That connects to model collapse again. If you're not careful about diversity in your generation parameters — if you always generate the same kind of voice note with the same structure and the same vocabulary — your classifier will overfit to that narrow distribution. You need to deliberately inject variety. Different sentence lengths, different levels of formality, different implied urgencies. The user profile features in Evidently AI are designed for exactly this, but you have to use them.

To pull it all together for Daniel's specific case. He wants five hundred credible voice notes or calendar appointments without PII exposure. The pipeline I'd recommend: start with a small set of real examples — his own voice notes, anonymized through a local LLM using substitution, not redaction. Use those as few-shot examples in Evidently AI or SDG Hub. Generate the five hundred variants with deliberate variation in tone, length, and content. Run the output through an Anonymizer SLM as a safety pass. Then if he needs audio, pipe the transcripts through a TTS engine with parameter variation.

For calendar appointments, layer SDV or MOSTLY AI for the structured fields, use the LLM-based tools for the free-text descriptions, and apply the same anonymization pass. The whole thing can run locally. No API costs, no data leakage, full reproducibility.

The cost angle is worth underlining. Daniel's doing early-stage prototyping. He doesn't have a budget for data labeling. The GPT-three point five numbers from that survey — twenty-nine dollars for six thousand examples — that's within solo developer territory. And with local models, the marginal cost is essentially zero. You're trading compute time for data quality, and for prototyping, that's almost always the right trade.

The time savings. A thousand minutes of human labeling versus forty-six minutes of generation. That's the difference between iterating on your prototype this weekend versus next month.

And now: Hilbert's daily fun fact.

The collective noun for a group of sloths is a "bed" of sloths. However, sloths are mostly solitary, so a bed of sloths is almost never observed in the wild.

If I'm a listener who wants to try this, what's my first step? I'd say install Evidently AI, write ten example voice notes that capture the patterns you care about, point it at a local model through Ollama, and generate a hundred. See if they look credible. If they do, scale to five hundred. If they don't, tweak your prompts and your profiles. The iteration cycle is minutes, not days.

Start thinking about your evaluation criteria now. How will you know if your synthetic data is good enough? For Daniel's classification model, the test is whether the classifier trained on synthetic data performs reasonably on real data. But you need a small holdout set of real, labeled examples to measure that. Don't generate five hundred synthetic examples and then have no way to validate them.

That's the audit step from the DP pipeline. Always validate against real data, even if the real data is just twenty examples you manually labeled over coffee. Without that ground truth check, you're flying blind.

The broader point here — and I think this is why Daniel's question resonated with me — is that synthetic data generation is quietly becoming one of the most practical applications of LLMs for individual developers. Everyone talks about chatbots and code generation, but the ability to spin up a credible prototype dataset in an afternoon, without touching real user data, without a budget, without a team — that's transformative for early-stage product development.

The tooling is finally catching up to the capability. A year ago, you'd be writing custom scripts for everything. Now you've got purpose-built frameworks with documentation and communities. SDG Hub, Evidently AI, the anonymizer models, the TTS pipelines. The pieces are there. You just have to assemble them.

One open question I have, and maybe Daniel can report back — how well do these synthetic voice notes hold up when you actually train a classifier and test it on real notes from people who aren't you? Daniel's seed data is his own voice notes. His speech patterns, his vocabulary, his typical note structure. The synthetic data will reflect those patterns. Will a classifier trained on that generalize to other people's voice notes? That's the next frontier.

That's exactly the right question. And the answer probably depends on how much variation you deliberately inject. If you use those user profiles aggressively — different roles, different tones, different levels of urgency — you might get enough diversity to generalize. But it's an empirical question. Someone should run that experiment.

Alright, we should wrap. Thanks to Hilbert Flumingtop for producing, as always.

This has been My Weird Prompts. Find us at myweirdprompts dot com or wherever you get your podcasts.

If you try any of these frameworks, let us know how it goes. We're curious.

This episode was generated with AI assistance. Hosts Herman and Corn are AI personalities.

#2483: Generating Synthetic Data Without PII Risk

Generating Synthetic Data Without PII Risk

The Substitution Approach

The Cost-Performance Trade-off

Available Frameworks

The Model Collapse Risk

Downloads

You Might Also Like

#2483: Generating Synthetic Data Without PII Risk