#1925: The Plumbing That Keeps Science From Collapsing

Half of all links in academic papers are dead. Here’s the plumbing that keeps knowledge from vanishing.

Featuring

Daniel

Corn

Herman

0:000:00

Episode Details

Episode ID: MWP-2081
Published: Apr 2
Duration: 22:41
Audio: Direct link
Pipeline: V5
TTS Engine: chatterbox-regular
Script Writing Agent: Gemini 3 Flash
Topics: digital-forensics data-redundancy knowledge-management

AI-Generated Content: This podcast is created using AI personas. Please verify any important information independently.

The Vanishing Knowledge Problem
Imagine trying to revisit a childhood home, only to find it replaced by a parking lot. That is the digital equivalent of a 404 error. A study from 2015 revealed a staggering statistic: fifty percent of the links cited in scholarly articles published between 1997 and 2012 were dead. Half of the scientific record’s digital breadcrumbs have vanished. This is not just an inconvenience; it threatens the very foundation of research, making it impossible to verify results or build upon previous work. The solution to this crisis is a system that sounds dry but acts as the heavy-duty industrial plumbing of the internet: the Digital Object Identifier (DOI).

At its core, a DOI is a persistent, unique identifier for a digital object. It is fundamentally different from a URL. A URL is a location—an address that tells a browser where to find a file on a specific server. If that server moves or the file is renamed, the link breaks. A DOI, however, is like a Social Security number for a document; it never changes. When you click a DOI link, you are not going directly to a file. Instead, you are routed through a resolver service (like doi.org) that looks up the DOI in a global database and redirects you to the object's current location. If a journal changes its domain or a repository reorganizes its files, the publisher simply updates the registry once, and every DOI issued for their content continues to work forever.

The system is built on a hierarchy managed by the International DOI Foundation. Beneath it are registration agencies like Crossref for academic journals and DataCite for research data. These agencies issue prefixes to organizations, which then generate unique suffixes for individual items. The underlying engine is the Handle System, a robust piece of legacy tech developed by the Corporation for National Research Initiatives. While the Handle System is a general-purpose architecture for persistent identifiers, the DOI system is its most famous implementation—the specific car everyone drives on this digital highway.

This infrastructure is becoming critical for the reproducibility crisis in AI and open science. Simply citing a model repository like Hugging Face is no longer sufficient because models change; weights are updated, and repositories are reorganized. To ensure scientific rigor, researchers need to cite a specific snapshot of a model or dataset. Platforms like Hugging Face and Zenodo (operated by CERN) now integrate DOI generation, allowing researchers to assign a permanent ID to a specific version of a model or a dataset. This turns a fleeting digital broadcast into a permanent research artifact. For instance, the "My Weird Prompts" community archive on Zenodo contains over 1,900 records, each with a DOI. Even if the podcast's website vanished, these records would remain accessible through CERN’s data centers, preserved indefinitely.

The system relies on a social contract and federated trust. Organizations commit to long-term preservation plans. If a repository like Zenodo were to shut down, its data would be migrated to another archive, and the DOI resolver would be updated to point to the new location. This global network of libraries, universities, and publishers has a vested interest in keeping the system alive because their own citations depend on it.

This shift is changing the incentive structure of research. Traditionally, only published papers counted toward a researcher's career. Now, with DOIs for datasets and code, researchers can receive direct credit when others cite their raw materials. This encourages sharing and collaboration. However, the ease of obtaining a DOI raises questions about quality. While anyone can technically get a DOI for anything, reliable repositories have curation processes, and the value lies in the network of citations. Ultimately, DOIs are the backbone of a growing "Knowledge Graph" that maps the lineage of ideas, connecting people, organizations, and artifacts through persistent identifiers (PIDs) like ORCID for researchers and ROR for institutions. Without this plumbing, the structure of human knowledge would collapse into a pile of broken links.

Downloads

Episode Audio

Download the full episode as an MP3 file

Download MP3

Transcript (TXT)

Plain text transcript file

Transcript (PDF)

Formatted PDF with styling

#1925: The Plumbing That Keeps Science From Collapsing

You ever try to find an old bookmark from maybe five or six years ago, and you get that dreaded four-oh-four page not found? It is like the digital equivalent of walking back to your childhood home and finding out it has been replaced by a parking lot.

It is actually much worse than most people realize, Corn. There was a major study back in twenty-fifteen that looked at academic papers published between nineteen-ninety-seven and twenty-twelve. They found that fifty percent of the links in those scholarly articles were dead. Half of the scientific record’s digital breadcrumbs just vanished.

Fifty percent? That is a staggering amount of broken knowledge. If I am a researcher trying to build on work from a decade ago and the data source is just a dead link, the whole foundation of my project starts looking pretty shaky. Today’s prompt from Daniel is about how we actually fix that, or at least how the grown-ups in the room decided to fix it back in the day. We are talking about DOIs, the Digital Object Identifier system.

Herman Poppleberry here, and I have been diving into the infrastructure of this all morning. It is one of those things that sounds incredibly dry—like talking about the Dewey Decimal System for the internet—but it is actually the only reason modern science hasn't completely collapsed into a pile of broken URLs. By the way, fun fact for the listeners, today’s episode script is actually being powered by Google Gemini three Flash.

Nice. I like having a high-speed AI brain helping us navigate the plumbing of the internet. Because that is really what a DOI is, right? It is the heavy-duty industrial plumbing that keeps the information flowing when the surface-level stuff breaks.

That is a good way to put it. At its simplest level, a DOI is a persistent, unique identifier. Most people confuse them with URLs, but they are fundamentally different. A URL—a Uniform Resource Locator—is just an address. It tells your browser "go to this specific server, in this specific folder, and grab this file." If the webmaster renames that folder or moves to a new domain, that address is useless.

Right, it is the difference between an address and a Social Security number. If I move houses, my address changes, but my ID stays the same. If the internet wants to find me, it should look for the ID, not just the last place I lived.

Well, not exactly in the sense of a person, but for a digital object, the DOI is that permanent identity. When you click a DOI link, you aren't going straight to a website. You are going to a resolver service, usually at doi dot org. That service looks at the DOI, checks its global database, and says, "Ah, this object is currently living at this specific URL," and then it redirects you.

So if a journal changes its name from "The Journal of Very Important Sloths" to "Sloth Weekly" and moves all its files to a new server, they just update the registry once, and every DOI ever issued for their papers still works?

That is the magic of it. It shifts the burden of maintenance from the person citing the work to the person publishing it. It is a social contract as much as it is a technical one. The organization that issues the DOI—the registration agency—is making a professional promise to keep that redirect updated forever.

It feels like something we take for granted until you realize how much work is happening under the hood. You mentioned registration agencies. How does a piece of data actually get one of these? I assume I can’t just generate my own DOI in my basement and hope for the best.

No, it is a hierarchical system. At the top, you have the International DOI Foundation, which was launched back in the year two thousand. They oversee the whole thing. Beneath them, you have registration agencies like Crossref, which handles academic journals, or DataCite, which focuses on research data. These agencies give prefixes to organizations. So, a place like Zenodo or Hugging Face gets a prefix, and then they can generate suffixes for individual items.

I saw a DOI the other day that looked like a bunch of random gibberish. It was ten dot one-one-four-five slash, and then a string of numbers. Is there a secret code in there, or is it just a digital serial number?

It is structured! The "ten" part always signifies that it is a DOI within the Handle System. The next four or five digits are the prefix, which tells you who registered it. Everything after the forward slash is the suffix, which the registrant creates. It can be a simple number, or it can be a string that includes the volume and page number of a journal. The key is that it must be unique within that prefix.

So the Handle System is the actual engine? I have heard that name pop up in tech circles before. It sounds like something from a nineteen-seventies mainframe.

It is a bit of a legacy tech that turned out to be incredibly robust. The Handle System was developed by the Corporation for National Research Initiatives. It is basically a general-purpose architecture for managing persistent identifiers. The DOI system is just the most famous implementation of it. Think of the Handle System as the engine and the DOI system as the specific car that everyone is driving.

I love that. Old-school tech holding up the modern world. But let's talk about why this is expanding so fast right now. It started with PDFs of academic papers, but Daniel mentioned that it is moving into machine learning and open science. Why does a model on Hugging Face need a DOI? Can't I just link to the repository?

This is where it gets critical for the "reproducibility crisis" in science. If you are writing a paper and you say, "I used the BERT base uncased model from Hugging Face," that is not specific enough. Models change. They get updated, weights are tweaked, or the repository might get reorganized. If I try to run your code a year later and the model has "drifted," I won't get the same results.

Ah, so the DOI on Hugging Face acts like a time capsule. It points to a specific version, a specific snapshot of those weights at a specific moment in time.

Precisely. Hugging Face recently integrated DOI generation so that researchers can cite a specific "commit" or version of a model. When you have a DOI for a model, you are saying, "I used exactly this artifact." It brings the same level of rigor to AI research that we have expected from biology or physics for decades.

It makes sense. If you are doing medical research using an AI model to predict protein folding, you can't have the model changing under your feet while you are trying to verify the results. That would be like trying to measure something with a ruler that keeps shrinking and growing.

And it is not just models. It is datasets, too. One of the biggest players in this space is Zenodo. It is operated by CERN, the folks with the Large Hadron Collider. They realized early on that if we want "Open Science" to work, we need a place to dump everything—code, charts, spreadsheets, audio files—and give them a permanent home.

We actually have a stake in this ourselves. We have the "My Weird Prompts" community on Zenodo. I checked it this morning, and we have over nineteen hundred records on there. Every time we upload an episode metadata file or an audio clip, Zenodo slaps a DOI on it.

It is a huge relief, honestly. If our main website ever went offline, or if the hosting service we use for the podcast vanished, those nineteen hundred records are still sitting in the CERN data centers. Anyone with the DOI could still find the "My Weird Prompts" archive. It makes the podcast a permanent research artifact rather than just a fleeting digital broadcast.

I like the sound of that. "Permanent research artifact." It makes me feel a lot more distinguished than just a sloth talking into a microphone. But what happens if Zenodo goes bankrupt? Or if Hugging Face gets bought by a company that decides to delete all the old stuff? Does the DOI just break?

That is the "social contract" part I mentioned. These organizations commit to a "long-term preservation" plan. For example, if Zenodo were to shut down, their agreement with the digital preservation community ensures that their data would be migrated to another library or archive. The DOI resolver would then be updated to point to the new location. The identifier is persistent even if the host is not.

That requires a lot of trust in institutions. In an era where everyone is skeptical of big organizations, the DOI system feels almost quaintly optimistic. It relies on everyone agreeing to play by the rules for the next hundred years.

It is a "federated" trust, though. It is not just one guy in an office. It is a global network of libraries, universities, and publishers. If one node fails, the others have a vested interest in keeping the system alive because their own citations depend on it. It is like the inter-bank lending system, but for facts.

Let's look at the second-order effects here. If everything has a DOI—the paper, the data, the code, the model—how does that change the way researchers actually work? Does it change the "incentive structure" Daniel mentioned?

Huge change. Traditionally, the only thing that "counted" for a researcher’s career was a published paper in a journal. You could spend three years collecting a massive, beautiful dataset, but if you didn't write a paper about it, it didn't help your career. But now, because you can get a DOI for a dataset on Zenodo, people can cite the dataset directly.

So I can get "academic credit" just for providing the raw materials for other people's brilliance?

Yes! There are tools now that track "Data Citations." You can see how many times your dataset has been used in other people's work. It encourages scientists to share their raw data early and often because they know they will get the credit for it. It turns the "publish or perish" mentality into something a bit more collaborative.

I can see the downside, though. If it is so easy to get a DOI, does the "gold standard" start to lose its luster? If I upload a picture of my lunch to Zenodo and get a DOI for it, am I technically a "published researcher"?

Well, technically, yes, but this is where metadata comes in. A DOI is only as good as the metadata attached to it. When you register a DOI, you have to provide information: who created it, when, what is it, what are the keywords? Reliable repositories like Zenodo have curation processes. Our "My Weird Prompts" community, for instance, has to be managed. We don't just let anyone dump anything in there.

True. And I suppose the "cites" are what really matter. If no one ever cites your lunch photo, the DOI is just a lonely number in a database. It is the network of links that creates the value.

And that network is becoming a "Knowledge Graph." There are these things called PIDs—Persistent Identifiers. A DOI is a PID for an object. An ORCID is a PID for a person—it is a unique ID for a researcher so you don't confuse "John Smith" the physicist with "John Smith" the chemist. Then you have RORs, which are IDs for organizations.

So you can eventually map out the entire world of human knowledge. This person at this university used this dataset to create this model which was cited by this paper. You can see the whole family tree of an idea.

It is beautiful when you see it visualized. It makes the scientific process transparent. You can trace the lineage of a discovery back to the exact line of code or the exact sensor reading that sparked it. Without DOIs, that lineage is broken every time a website gets redesigned.

You mentioned "Version Control" earlier. I think that is a really important nuance. If I am a developer and I am constantly updating my software on GitHub, how do I handle DOIs? I don't want a thousand different DOIs for every tiny bug fix.

Zenodo has a clever solution for this. They offer "Concept DOIs" and "Version DOIs." A Concept DOI always points to the latest version of your project. If someone just wants "the latest My Weird Prompts data," they use the Concept DOI. But if a researcher wants to cite the exact state of the project as it existed on April second, twenty-twenty-six, they use the Version DOI for that specific release.

That is smart. It handles the "dynamic" nature of digital stuff. A book is static; once it is printed, it doesn't change. But code is alive. It is constantly evolving. The DOI system had to evolve to handle that "aliveness" without losing the "persistence."

And we are seeing this move into even more "wild" territory. Think about AI-generated content. If an AI generates a massive dataset or a complex piece of software, who gets the DOI? How do we verify the provenance? These are the questions the International DOI Foundation is grappling with right now.

It feels like we are heading toward a "Digital Dark Age" if we don't get this right. We have so much data being produced every second, and if we don't have a way to tag it and keep it findable, ninety-nine percent of it will be gone in a decade. We will be the most documented generation in history, but our descendants won't be able to read any of it because the links are all broken.

That is the nightmare scenario. Historians call it the "bit rot." It is not just that the links break, but the file formats themselves become obsolete. DOIs don't solve the format problem—you still need to make sure you aren't saving everything in a proprietary format that won't exist in ten years—but they at least solve the "where is it?" problem.

So, if I am a listener and I am working on a project—maybe I am a hobbyist coder or a student—what is the "pro tip" here? How do I use this info?

The first takeaway is: stop using "naked" URLs in your citations or your documentation if a DOI is available. If you are citing a paper or a dataset, look for that "ten dot" string. Use it. It ensures that anyone reading your work in the future can actually find what you are talking about.

And if I am the one creating the data? If I have a cool dataset of, I don't know, sloth migration patterns?

Use a repository that issues DOIs. Don't just put a zip file on your personal website or a random Dropbox link. Upload it to Zenodo, or if it is a model, put it on Hugging Face. It gives your work a level of professional permanence. It says, "I care about this enough to make it a permanent part of the digital record."

It also makes you look a lot more legit. Having a DOI for your project is like having a verified badge on social media, but for your brain.

It really is. It is a signal of quality and commitment to open science principles. And for the ML folks listening, seriously, check out the Hugging Face DOI integration. If you are publishing a paper about a new fine-tuned model, there is no excuse for not having a DOI for that model. It makes your work auditable.

I am curious about the future of this, though. We are seeing things like IPFS—the InterPlanetary File System—and Arweave, which are decentralized ways of storing data forever using blockchain-like tech. Do DOIs play nice with those? Or are they competing systems?

That is the million-dollar question. Right now, they are complementary. A DOI is a centralized registry that points to a location. IPFS uses "content addressing," where the address is derived from the file itself. Some people are already using DOIs to point to IPFS hashes. The DOI provides the "human-friendly" persistent name, and IPFS provides the "decentralized" storage.

So the DOI is like the entry in the phone book, and IPFS is the actual physical location of the person. You can have both.

The DOI system is flexible enough to point to anything. It doesn't care if the target is a web server at Harvard or a block on a decentralized chain. Its job is just to be the "source of truth" for where that object lives today.

It is funny how much of our digital world relies on these invisible layers of agreement. We spend all our time looking at the flashy UI and the AI features, but none of it works if we can't find the underlying data. It is like the electrical grid—you don't think about the transformers and the high-voltage lines until the lights go out.

And the lights are going out all the time in the form of link rot. Every time a startup goes bust or a university reorganizes its website, a piece of our collective memory gets dim. DOIs are the backup generators.

I think we should talk a bit more about the "legal and IP protection" aspect Daniel mentioned. How does a DOI help you protect your work? It is not a copyright, is it?

No, it is not a copyright, but it provides a "third-party verified timestamp." If I upload my research to Zenodo and get a DOI, there is now a permanent, unchangeable record that says "Herman Poppleberry uploaded this specific file on April second, twenty-twenty-six." If someone tries to claim they invented it two months later, I have an ironclad, independent proof of prior art.

That is huge for independent researchers who might not have a big legal department at a university to back them up. It is a "poor man's patent" in a sense, or at least a very strong receipt.

It is also useful for "versioning" your intellectual property. If you have an idea that evolves, you can show the progression. You can prove exactly when each breakthrough happened. In a world where AI can generate a million variations of an idea in seconds, having a verified, human-authored timestamp is going to become more and more valuable.

I can see that. "Provenance" is going to be the biggest buzzword of the next five years. Knowing where something came from, who made it, and how it has changed. The DOI system was built for a simpler time, but it turns out to be exactly what we need for the AI era.

It is accidental foresight. The people who built this in nineteen-ninety-eight were just tired of broken journal links. They didn't know they were building the foundation for a global, machine-readable graph of all human knowledge, but that is exactly what they did.

It is a good reminder that solving a boring, practical problem can sometimes have massive, visionary consequences. "Fix the links" sounds like a task for an intern. "Build a permanent identity system for all digital artifacts" sounds like a mission for a god. They are the same thing.

It really is. And it is something we should all be more aware of. Next time you see a DOI, don't just think of it as a weird URL. Think of it as a tiny piece of the wall we are building against the "Digital Dark Age." It is a small victory for permanence in a world that is increasingly ephemeral.

I feel a lot better about our nineteen hundred records on Zenodo now. We aren't just hoarding data; we are "contributing to the persistent record of humanity."

We are! And anyone can go look at it. If you want to see what a "community" looks like in the DOI world, search for "My Weird Prompts" on Zenodo. You can see how the metadata is structured, how the files are versioned, and how each one has that unique "ten dot" identifier. It is a living example of everything we have talked about today.

And if you are feeling ambitious, maybe deposit some of your own work. Whether it is a dataset of your local weather, a piece of code you wrote for a hobby, or even your own podcast metadata. Get a DOI. Join the "social contract."

It is a great way to start thinking like a digital archivist. We are all creators now, which means we all have a responsibility to be our own librarians. The tools are there, they are free, and they are incredibly powerful.

I think that is a perfect place to wrap this up. We started with the tragedy of fifty percent link rot and ended with a global network of persistent knowledge. Not bad for a day's work.

It is a rare "tech story" that is actually optimistic. Usually, we are talking about how things are breaking, but the DOI system is a story about something that actually works and is getting better.

Well, before we go, we have to do the business. Huge thanks as always to our producer, Hilbert Flumingtop, for keeping the gears turning behind the scenes. And a big thanks to Modal for providing the GPU credits that power this show—including the Gemini model that helped write today's deep dive.

This has been "My Weird Prompts." If you found this useful, or if you now have a sudden urge to go register DOIs for all your old spreadsheets, let us know. You can reach us at show at myweirdprompts dot com.

And if you want to support the show and help us reach more people, the best thing you can do is leave a review on your podcast app. It actually makes a difference in the algorithms, helping other curious minds find our weird little corner of the internet.

We are also on Telegram if you want to get notified the second a new episode drops. Just search for "My Weird Prompts" there.

All right, Herman. I am going to go see if I can find a DOI for my favorite nap spot.

I don't think "under the big oak tree" has a prefix yet, Corn, but I will look into it for you. See you next time.

Catch you later.

This episode was generated with AI assistance. Hosts Herman and Corn are AI personalities.

#1925: The Plumbing That Keeps Science From Collapsing

Downloads

You Might Also Like

#1925: The Plumbing That Keeps Science From Collapsing