#2526: How Peer Review Actually Works (and Fails)

The history of peer review, the Lancet's biggest scandals, and why arXiv is changing everything.

Featuring

Daniel

Corn

Herman

Listen

0:00

Episode Details

Episode ID: MWP-2684
Published: Apr 29
Duration: 38:40
Audio: Direct link
Pipeline: V5
TTS Engine: chatterbox-regular
Script Writing Agent: deepseek-v4-pro
Topics: misinformation open-source medical-history

AI-Generated Content: This podcast is created using AI personas. Please verify any important information independently.

Peer review is often described as the bedrock of scientific publishing — the gate that separates rigorous research from noise. But as this episode reveals, that bedrock is surprisingly young, deeply flawed, and currently being reshaped by preprint servers like arXiv.

A Surprisingly Recent Invention

The myth traces peer review back to the 1600s and the Royal Society's Henry Oldenburg, who circulated manuscripts to colleagues before publishing them in Philosophical Transactions. In reality, that was one editor using his personal network, not a systematic process. The formalized peer review we know today — anonymous reviewers, structured reports, revise-and-resubmit cycles — only became standard after World War II. Before that, editors like Max Planck made publication decisions themselves. Planck simply read Einstein's 1905 papers and said yes, no external review required.

Peer review became an institutional norm because science grew too large and specialized for any single editor to judge everything, and because spectacular failures demonstrated the cost of insufficient scrutiny.

The Lancet's Two Great Scandals

The Lancet, one of medicine's most prestigious journals, has been at the center of two landmark fraud cases that expose peer review's limits.

The first was Andrew Wakefield's 1998 paper linking the MMR vaccine to autism. Based on just twelve children, the paper launched a global vaccination panic that persists today. It took twelve years to retract, and only after an investigation found Wakefield had committed multiple ethical violations — unnecessary invasive procedures, undisclosed funding from lawyers suing vaccine manufacturers, and misrepresented patient histories. Peer review caught none of it. Reviewers didn't see raw data, didn't know about conflicts of interest, and couldn't verify the children's medical histories. As the episode notes, peer review checks methodology and plausibility, not honesty.

The second was Surgisphere in June 2020. A massive study claiming hydroxychloroquine increased COVID mortality was published, based on a proprietary database from a company called Surgisphere. When researchers examined the data, the paper collapsed: it reported more COVID deaths in Australia than Australia had recorded, claimed data from African hospitals without electronic health record systems, and the lead author had never seen the raw data. The Lancet retracted in twelve days — not because peer review caught the fraud, but because an army of researchers on social media and PubPeer tore it apart in real time.

The arXiv Revolution

arXiv launched in 1991 as a digitized version of physicists' existing preprint-sharing habits. It took over physics within a decade, and for artificial intelligence, it's now the primary publication venue. The advantages are clear: zero barrier to access, immediate community feedback from potentially hundreds of experts rather than two or three anonymous reviewers, and the possibility of diverse perspectives catching errors specialists might miss.

But there are real costs. There's no quality filter — anything passes basic screening. The burden of evaluating papers falls entirely on readers facing a firehose of preprints. Public feedback tends toward extremes: silence from junior researchers afraid to criticize senior colleagues, or pile-ons when controversial claims hit Twitter. A junior researcher whose preprint gets torn apart doesn't get a private revise-and-resubmit; they get a permanent public record of being wrong. Traditional peer review, for all its flaws, at least allows private failure.

What Peer Review Can't Do

The Hwang Woo-suk stem cell fraud at Science and the Jon Sudbø cancer data fabrication at the Lancet reinforce the same lesson: peer review is a check on methodology and plausibility, not on honesty. Fabricated data that's internally consistent will pass review. The only real safeguards are raw data transparency, pre-registration of studies, and post-publication scrutiny by a large community — which is exactly what arXiv enables at scale, while simultaneously creating new problems of its own.

Mentions

arXiv Preprint server for physics and AI
bioRxiv Preprint server for biology
eLife Journal with publish-then-review model
Nature Top interdisciplinary science journal
NeurIPS Major AI and machine learning conference
New England Journal of Medicine Top medical journal
PubPeer Platform for post-publication peer review
Science Leading scientific journal
Surgisphere Controversial COVID data company
The Lancet Prestigious medical journal

Downloads

Episode Audio

Download the full episode as an MP3 file

Download MP3

Transcript (TXT)

Plain text transcript file

Transcript (PDF)

Formatted PDF with styling

#2526: How Peer Review Actually Works (and Fails)

Daniel sent us this one — he's been thinking about peer review, specifically how the scientific world decides what counts as rigorous work and what doesn't. He's noticed the trend in artificial intelligence and arXiv toward pre-publication sharing, putting papers out before formal review, and he wants to understand the tradeoffs there. But he's also asking us to dig into the actual history of peer review, the scandals that made it necessary, and the uncomfortable fact that even the most prestigious journals have published work that turned out to be fraudulent. He specifically mentioned wanting examples involving the Lancet. There's a lot to unpack here.

Before we dive in — quick note, today's script is being written by DeepSeek V four Pro. Which I appreciate, because this topic requires some serious organizational horsepower. The history of peer review alone is several centuries deep, and then we're going to layer on the preprint revolution and the fraud cases. It's a three-course meal.

Daniel did basically hand us a syllabus. But it's a good one. Let's start with the history piece, because I think most people assume peer review is this ancient academic tradition, when it's actually surprisingly recent in its current form.

The myth is that peer review goes back to the sixteen hundreds, to the founding of the Royal Society. There's a grain of truth there. Henry Oldenburg, the first secretary, did circulate manuscripts to members for comment before publishing them in Philosophical Transactions. But that was one guy using his network, not a systematic process. It was more like asking a few smart friends what they thought.

Which is basically what arXiv does today, just with better distribution.

That's a sharp parallel. But the formalized peer review we think of today — anonymous reviewers, structured reports, revise-and-resubmit — that really only became standard after World War Two. Before that, journal editors made most publication decisions themselves. Einstein's Annus Mirabilis papers in nineteen oh five? Those went into Annalen der Physik with essentially no external review. The editor, Max Planck, just read them and said yes.

Planck had good instincts, apparently.

But the point is, peer review as an institution-wide norm is maybe seventy, eighty years old. It emerged partly because science got bigger and more specialized — no single editor could judge everything — and partly because of some spectacular failures that showed what happens when you don't have rigorous external scrutiny.

This is where the scandals come in. Daniel specifically asked about the Lancet, which has a genuinely wild history on this front.

The Lancet is one of the oldest and most prestigious medical journals in the world, founded in eighteen twenty-three. And it has published some of the most important medical research in history. But it's also been the vehicle for what I would argue is the single most damaging piece of medical fraud in the last century.

The Wakefield paper.

Andrew Wakefield and twelve co-authors published a study in the Lancet in February nineteen ninety-eight that claimed to find a link between the MMR vaccine and autism in children. Twelve children, a tiny case series, but the paper launched a global panic. Vaccination rates dropped. Measles came roaring back after being nearly eliminated in many countries.

What's wild is how long it took to fully unravel. The paper wasn't retracted until twenty-ten.

And the retraction only happened after a lengthy investigation by the UK General Medical Council, which found that Wakefield had committed multiple ethical violations. He'd subjected children to unnecessary invasive procedures like colonoscopies and lumbar punctures without proper ethical approval. He'd received funding from lawyers trying to sue vaccine manufacturers, which he never disclosed. And the paper misrepresented the children's medical histories — several of the children had documented developmental concerns before receiving the MMR vaccine, but the paper made it seem like everything appeared after vaccination.

The Lancet's editors later said the paper never should have been published. But here's the thing — it did go through peer review. It passed whatever screening the Lancet had at the time.

Right, and that's the uncomfortable question. Peer review caught none of this. The reviewers didn't see the raw data. They didn't know about the conflicts of interest. They couldn't verify that the children's histories were accurately reported. This gets to a fundamental limitation — peer review is a check on methodology and plausibility, not on honesty. If someone is willing to fabricate data, peer review is extremely unlikely to catch it.

Unless the fabrication is sloppy. Which brings us to another Lancet case, and this one is almost comically inept.

The Surgisphere one.

In June twenty-twenty, at the height of COVID, the Lancet published a major study on hydroxychloroquine. The study analyzed data from nearly a hundred thousand patients across hundreds of hospitals and concluded that hydroxychloroquine was associated with increased mortality. Major public health organizations paused clinical trials based on this paper.

Then people started looking at the data. The study was based on a proprietary database from a company called Surgisphere. The authors claimed Surgisphere had collected detailed electronic health records from hospitals around the world. But when researchers started asking questions, the whole thing fell apart almost immediately.

The numbers didn't add up. The paper reported more COVID deaths in Australia than Australia had actually recorded at that point. It claimed data from hospitals in Africa that turned out not to have electronic health record systems. The company's website was barely functional. And the lead author, a professor at Harvard, had never actually seen the raw data — he'd just analyzed what Surgisphere gave him.

The Lancet retracted the paper on June fourth, twenty-twenty, less than two weeks after publication. The New England Journal of Medicine simultaneously retracted a separate COVID paper based on the same Surgisphere data. That retraction speed was unprecedented. The Wakefield paper took twelve years. Surgisphere took twelve days. The difference was that in twenty-twenty, there was an army of researchers on social media and on platforms like PubPeer who tore the paper apart in public in real time.

Which is a form of post-publication peer review that didn't exist in nineteen ninety-eight. And that's where the arXiv model starts to look really interesting.

Let's talk about arXiv. It launched in nineteen ninety-one, started by Paul Ginsparg at Los Alamos National Laboratory. The original idea was simple — physicists were already sharing preprints by mailing paper copies to each other. Ginsparg just digitized that process. You upload your paper, it goes online, everyone can read it immediately. No paywall, no review delay.

It completely took over physics. By the late nineties, essentially every working physicist was posting to arXiv before submitting to journals. The journals didn't die — they adapted. Physical Review Letters and the rest just accepted that the preprint would already be public.

In computer science and artificial intelligence, arXiv is basically the primary publication venue now. Conference proceedings are still important, but if you want to know what's happening in AI right now, you're reading arXiv, not waiting for the journal version, which often appears a year later and is functionally a formality.

The pros are clear. Zero barrier to access. Anyone anywhere can read the latest research without an institutional subscription. And you get community feedback immediately, not after six months of anonymous review.

The feedback point is crucial. Traditional peer review typically involves two or three reviewers. They might catch problems, they might not. But when you post on arXiv, potentially hundreds of experts can scrutinize your work. The probability that someone will spot an error goes way up. And the feedback can be much more diverse — you might get comments from someone in a completely different field who sees a connection or a flaw that a specialist reviewer would miss.

There are real downsides, and I think the AI field is living through them right now. The biggest one is that there's no quality filter. Absolutely anything can go on arXiv. The moderators do some basic screening — they'll reject papers that are clearly nonsense or off-topic — but they don't evaluate scientific merit. So you end up with a firehose of preprints, and the burden of figuring out what's worth reading falls entirely on the reader.

There's a subtler problem. In traditional peer review, the reviewers are anonymous, which in theory lets them be brutally honest. But on arXiv, public feedback tends to be either very positive or very negative. People are reluctant to publicly criticize a colleague's work, especially junior researchers criticizing senior ones. So you get a lot of silence, or you get pile-ons.

The pile-on dynamic is real. Someone posts a paper making a controversial claim, and Twitter and the comment sections explode. That's not peer review. That's a reputational knife fight conducted in public. It selects for the most inflammatory takes, not the most careful ones.

It can destroy careers. A junior researcher who posts something that gets torn apart on social media doesn't get a revise-and-resubmit. They get a permanent public record of being wrong. Traditional peer review, for all its flaws, at least lets you fail privately.

That's a important point. The privacy of traditional review creates space for authors to make mistakes and fix them before the work is public. Preprint culture collapses that buffer. Everything is public from day one, and retractions or corrections, even when they happen, never fully erase the initial impression.

Let's talk about another scandal that's revealing in a different way. The Hwang Woo-suk case.

Hwang was a South Korean stem cell researcher who became a national hero in the early two thousands. In two thousand four and two thousand five, he published two papers in Science claiming to have created human embryonic stem cells through therapeutic cloning. These were enormous claims. The papers went through Science's peer review and were published to worldwide acclaim.

Then it all unraveled. The data was fabricated. The eggs used in the research had been obtained unethically, with junior researchers coerced into donating. Hwang was indicted for embezzlement and bioethics violations. But here's the part that connects to our discussion — one of the things that helped expose the fraud was that Korean investigative journalists and online forums started picking apart inconsistencies in the papers that peer reviewers had missed.

The whistleblower was a researcher named Ryu Young-joon, who went to the Korean television program PD Notebook. The journalists investigated, found that the stem cell lines didn't match the claimed genetic origins, and the whole thing collapsed. Science retracted both papers in January two thousand six. Hwang received a two-year suspended prison sentence.

The Lancet connection — around the same time, the Lancet published a paper by a Norwegian cancer researcher named Jon Sudbø, who claimed to have found a link between certain medications and reduced oral cancer risk. The paper was based on data from over nine hundred patients. Except the data was completely fabricated. The patients didn't exist. The entire dataset was invented.

Here's the embarrassing detail — one of the reviewers noticed something odd about the patient birth dates and asked for clarification. Sudbø sent back a response, and the reviewer accepted it. The paper was published.

The odd thing was that something like two hundred fifty of the nine hundred patients shared the same birthday. It was a glaring fabrication signal that any careful statistical check would have caught. The reviewer flagged it, Sudbø gave some hand-wavy explanation, and everyone moved on.

The paper was eventually retracted after a colleague of Sudbø's read it and realized the data had to be fake because some of the patients were described as having a subtype of cancer that the colleague knew didn't exist in that population. So it wasn't even the peer review process that caught it — it was a domain expert reading the published paper and thinking, wait, that's not possible.

Sudbø's case unraveled further. Investigators found he'd fabricated data in multiple papers. He lost his medical license, his PhD was revoked, and he was convicted of fraud. All of this happened after peer review at a top journal.

We have this pattern. Peer review at prestigious journals fails to catch outright fraud, repeatedly. The Lancet with Wakefield. Science with Hwang. The Lancet with Sudbø. The Lancet and the New England Journal of Medicine with Surgisphere. These aren't marginal journals with lax standards. These are the top of the pyramid.

The question is, does that mean peer review is broken, or does it mean we're expecting something from peer review that it was never designed to provide?

Peer review is fundamentally a plausibility check. Reviewers look at the methods, the analysis, the logic, and they ask — does this make sense given what we know? They can catch methodological errors, statistical mistakes, logical gaps. They're reasonably good at that. What they can't do is detect deliberate fraud by someone who understands the field well enough to fabricate plausible data. If you know what realistic data looks like, you can make up realistic-looking fake data.

Peer reviewers don't typically have access to the raw data. They're evaluating the paper, not redoing the analysis. There's a movement toward requiring data sharing as a condition of publication, and that would help. But even then, a determined fraudster can fabricate raw data files. The only real defense against fraud is replication — independent researchers trying to reproduce the results. And replication is expensive, time-consuming, and under-rewarded in academic careers.

Which brings us back to the preprint model. One argument for preprints is that they accelerate the replication process. If a paper is public immediately, anyone can try to reproduce it right away. You don't have to wait for the journal publication cycle. We saw this during COVID — the rapid sharing of preprints led to incredibly fast identification of problems. Surgisphere was exposed in days, not years.

There's a flip side to that speed. Bad preprints also spread misinformation at pandemic speed. There was a preprint claiming COVID was engineered with HIV sequences, which was completely wrong, but it got picked up by conspiracy theorists and circulated for months before being withdrawn. The authors eventually retracted it, but the damage was done.

That preprint is a perfect case study. It was posted on bioRxiv in January twenty-twenty, claimed to find insertions of HIV genetic material in the SARS-CoV-two genome, and implied this was evidence of engineering. Actual virologists looked at it and immediately recognized that the sequences were so short they were statistically meaningless — you could find them in lots of viruses by random chance. But by the time the scientific community had debunked it, the idea was already loose in the world.

Once an idea is loose, you can't put it back. Retractions don't go viral. The original false claim always has a bigger audience than the correction. There was a study on this — published in Nature — that looked at how retracted papers continue to be cited after retraction, often without any mention of the retraction. The false information persists because people don't check whether a paper they're citing has been retracted.

The preprint model amplifies both the good and the bad. Good science gets shared faster, bad science gets shared faster, and the correction mechanisms are asymmetric. The false claim spreads wide, the retraction reaches a fraction of that audience.

This is where I think the hierarchy of journals still serves a real function, even with all the flaws we've discussed. Journal reputation acts as a filtering mechanism. If I see a paper in Nature or Science or the Lancet, I have some prior expectation that it's been through a serious review process. That doesn't mean it's correct — we've just listed multiple cases where it wasn't — but it means someone with expertise has vetted it. If I see a random preprint on arXiv with no comments and no journal acceptance, I have no such signal.

The problem is that the journal hierarchy is also deeply flawed as a signal. There's a well-documented bias toward positive, novel, surprising results. Journals want to publish exciting findings. Null results, replication studies, incremental advances — those are much harder to publish in top venues. So the journal filter doesn't just filter for quality. It filters for excitement.

This is the file drawer problem. Studies that find no effect sit in file drawers, unpublished. The published literature is systematically biased toward positive findings. And that means even if every published paper passed rigorous peer review, the overall picture would still be distorted because you're only seeing the successes.

The preprint model, in theory, solves the file drawer problem. You can post your null result on arXiv. Someone doing a meta-analysis can find it. The full distribution of results becomes visible. In practice, people still don't post their null results very often. If you ran an experiment and got nothing, you're probably not going to spend time writing it up when you could be working on the next experiment that might yield a publishable finding.

Let's talk about another historical scandal Daniel's question points toward. The PACE trial controversy.

This one is more recent and more contested. The PACE trial was a large study of treatments for chronic fatigue syndrome, also called ME, published in the Lancet in twenty eleven. It claimed that cognitive behavioral therapy and graded exercise therapy were effective treatments. The paper went through peer review and was published with significant fanfare.

Then patient groups and independent researchers started digging into the data. They found that the authors had changed their outcome measures after the trial began. The original protocol defined "recovery" in one way, but the published paper used a much looser definition. Under the original criteria, the treatments showed much less benefit.

This is a classic problem that peer review often misses. Reviewers see the paper as submitted. They don't typically compare it against the original trial protocol or the pre-registered analysis plan. If the authors changed their definitions between protocol and publication, reviewers might never know.

The Lancet resisted calls for retraction. Patient groups waged a years-long campaign. Eventually, in twenty sixteen, the Lancet published a correction — not a retraction — but the controversy never really resolved. The paper is still in the literature, still cited, and the patient community remains deeply angry about it.

The PACE case illustrates something important. Journals are extremely reluctant to retract. Even when serious problems are identified, the default response is correction, clarification, expression of concern — anything short of full retraction. Retraction is seen as a nuclear option that damages the journal's reputation along with the authors'.

Which creates a perverse incentive. If retractions are rare and painful, the system is biased toward leaving flawed papers in the literature with minor corrections rather than removing them entirely. The literature accumulates errors over time.

There's a researcher named John Ioannidis who wrote a famous paper in two thousand five called "Why Most Published Research Findings Are False." He wasn't arguing that most researchers are frauds. He was arguing that the combination of small sample sizes, flexible analysis choices, publication bias toward positive results, and the structure of academic incentives makes it statistically likely that many published findings won't replicate.

The replication crisis hit psychology especially hard. Brian Nosek and the Open Science Collaboration attempted to replicate one hundred psychology studies published in top journals and found that only about forty percent replicated. These were studies published in the best journals, after peer review. The replication attempts were preregistered, meaning the analysis plan was locked in before data collection, so there was no flexibility to fish for significance. And more than half of the original findings couldn't be reproduced.

Where does this leave us? We have a peer review system that's relatively recent historically, that fails to catch fraud even at the most prestigious journals, that is biased toward positive and novel findings, and that produces a literature where a substantial fraction of published results may not replicate. And we have a preprint system that solves some of these problems — speed, access, reduced publication bias — but introduces new ones — no quality filter, potential for misinformation to spread rapidly, and public shaming dynamics.

I think the honest answer is that neither system is adequate on its own, and the future is probably some hybrid model. We're already seeing this emerge. Journals are adopting preprint-friendly policies. Some are moving to "publish then review" models where the paper goes online immediately and the reviews are published alongside it. eLife, for example, shifted to this model where they no longer make accept or reject decisions — everything that passes initial editorial screening is published, and the reviews and author responses are public.

ELife's model is interesting but controversial. Some researchers love it because it eliminates the wasteful cycle of submitting to multiple journals and getting rejected for reasons that have nothing to do with scientific quality. Other researchers hate it because it removes the prestige signal. If everything gets published, how do tenure committees evaluate candidates?

That's the practical problem. The journal hierarchy is deeply embedded in academic career incentives. Hiring, tenure, grants — all of these decisions rely heavily on publication venues. If you publish in Nature, that's a major career boost. If you post a brilliant preprint on arXiv that never gets journal-published, it might count for very little in a tenure review — even though the arXiv preprint might be more rigorously vetted by community feedback than a Nature paper that two reviewers glanced at.

The prestige signal and the quality signal are only loosely correlated. We know that prestigious journals publish flawed work. We know that excellent work appears in lower-tier journals or only on preprint servers. But the prestige heuristic persists because it's efficient. Nobody has time to read every paper and evaluate it on its merits. You need shortcuts.

That's where I think the AI field is doing something interesting, even if it's messy. The culture in machine learning is to post everything on arXiv and to evaluate work based on whether it's useful, whether it replicates, whether the code is available, whether other people build on it. The paper's venue matters less than whether it actually works.

The emphasis on code availability is a huge shift. In traditional academic publishing, the paper is the product. In machine learning, the paper is almost an advertisement for the code and the model. If you claim a result and don't release the code, people are skeptical. If you release the code and other people can reproduce your results, that's worth more than a Nature publication.

Even that has problems. Code availability doesn't guarantee reproducibility. Dependencies change, hardware configurations matter, random seeds affect results. There are papers where the authors released code but other researchers still couldn't reproduce the claimed results because of undocumented implementation details.

There's a great paper on this — "Troubling Trends in Machine Learning Scholarship" — that documented how many ML papers make unsupported claims, use unfair comparisons, or have errors in their mathematical derivations. And these are papers that passed peer review at top conferences like NeurIPS and ICML. The problems aren't unique to traditional journals.

Let's circle back to Daniel's question about the pros and cons of pre-publication sharing. I think the core tradeoff is speed versus filtering. Preprints give you immediate access to the latest research, they democratize access by removing paywalls, and they enable rapid community feedback. But they also flood the ecosystem with unvetted work, they can spread misinformation, and they don't provide the quality signal that journals offer.

I'd add that the quality signal from journals is weaker than most people assume, but it's not zero. A paper in a good journal has at least been read by someone with relevant expertise who didn't find obvious fatal flaws. That's not nothing. It's just a lot less than the public imagines.

The public imagines peer review as a kind of guarantee. "Published in a peer-reviewed journal" is treated as synonymous with "scientifically established." And that's just not what peer review means. It means a few people read it and thought it looked reasonable. That's it.

There's a famous quote from Richard Horton, who was the editor of the Lancet. He said — and this is a direct quote — "The mistake, of course, is to have thought that peer review was any more than a crude means of discovering the acceptability, not the validity, of a new finding." This is the editor of one of the world's top medical journals saying that peer review is a crude filter, not a validation mechanism. He said this in twenty fifteen, in a commentary about peer review. He wasn't being defensive. He was being honest about the limitations of the system he oversaw.

What should a smart consumer of science do? If you can't trust the journal stamp, and you can't trust preprints, how do you evaluate scientific claims?

I think the answer is to look for converging evidence. One study, no matter where it's published, is just one data point. You want to see multiple independent groups finding the same thing. You want to see replications. You want to see -analyses that combine results across studies. You want to see whether the finding holds up when different methods are used.

You want to look at who has incentives to find what. If a drug company funds a study finding that their drug works, that doesn't mean the study is wrong, but it means you should want independent replication before you're confident. If a researcher's entire career is built on a particular theory, they have an incentive to find supporting evidence and to dismiss contradictory evidence.

The incentive structure in science is one of those things that everyone knows about but nobody has figured out how to fix. Publish or perish is real. The pressure to produce positive, novel findings is intense. And the peer review system, which is supposed to be the quality control mechanism, is staffed by the same people who are subject to those pressures. Reviewers are unpaid, overworked, and reviewing papers in their spare time. It's remarkable that the system works as well as it does.

I want to mention one more historical scandal that's too important to skip. The Schön scandal in physics.

Jan Hendrik Schön. Early two thousands. Schön was a researcher at Bell Labs who was publishing at an incredible rate — something like one paper every eight days at his peak — in top journals like Nature and Science. He claimed to have made breakthroughs in organic semiconductors and superconductivity. He was being talked about as a future Nobel laureate.

Then other researchers noticed something weird. The same graphs appeared in multiple papers, labeled as different experiments. The noise in the data was identical, which is statistically impossible. Independent researchers couldn't replicate any of his results. Bell Labs launched an investigation and found that Schön had fabricated data in at least sixteen papers.

The investigation report was devastating. Schön had manipulated data, reused figures, and in many cases had never actually conducted the experiments he claimed to have done. Nature and Science retracted his papers. His PhD was revoked by the University of Konstanz. And the whole thing happened at Bell Labs, one of the most prestigious industrial research labs in history, with co-authors who were respected senior scientists.

The co-authors were cleared of misconduct, but they were criticized for not catching the fraud. They had put their names on papers describing experiments they hadn't personally verified. And that's another feature of the system — co-authorship is often distributed loosely, with senior researchers lending their names and credibility to work they haven't thoroughly checked.

The Schön case is particularly humiliating for peer review because the fabrication was not subtle. The same noise pattern appearing in different experiments is a glaring red flag. Any reviewer who had thought to compare figures across Schön's papers would have spotted it. But reviewers review one paper at a time. They don't typically cross-reference against the author's other publications.

Peer review failed to catch fraud that was hiding in plain sight, across multiple papers, at the most prestigious journals, for years. And it was eventually exposed by other researchers trying and failing to build on the published results.

Which brings us back to replication. Replication is the immune system of science. Peer review is more like a preliminary screening — it catches some obvious problems but misses many others. The real test is whether other researchers can reproduce the findings.

The preprint model accelerates the replication process. If a paper is public immediately, replication attempts can start immediately. If it sits in peer review for eight months before anyone outside the review circle sees it, that's eight months of delay before the immune system kicks in.

Again, the speed cuts both ways. A fraudulent or flawed preprint can do eight months of damage before anyone has a chance to debunk it. The same speed that accelerates correction also accelerates contamination.

I think the honest answer to Daniel's question is that pre-publication sharing is net positive for science but net negative for public understanding of science. It's good that researchers can share results quickly and get rapid feedback. It's bad that the public is exposed to unvetted claims that are presented as scientific findings.

That's a really clean way to put it. The scientific community has the expertise to evaluate preprints skeptically. They know that a preprint is a preliminary claim, not an established finding. But the public, and the media, often don't make that distinction. A preprint gets reported as "a new study finds," and the "not yet peer-reviewed" caveat, if it appears at all, is buried at the bottom.

During COVID, this was a massive problem. Preprints were driving news cycles. Policy decisions were being made based on preprints. The normal buffer between scientific claim and public acceptance collapsed entirely. There's a famous case — the Stanford study on COVID prevalence in Santa Clara County. The authors posted a preprint claiming the infection fatality rate was much lower than previously thought. It got enormous media attention, was cited in policy debates, and then the statistical problems were identified — selection bias, test specificity issues, confidence intervals much wider than the headline numbers suggested. The peer-reviewed version that eventually appeared was substantially more cautious. But the damage was done. The initial claim had already shaped public perception.

That's the asymmetry again. The initial claim gets the headlines. The corrections get a paragraph on page twelve.

What do we do about it? I don't think the solution is to go back to the pre-preprint era. That genie is out of the bottle. I think the solution is to improve scientific literacy so the public understands what a preprint is and isn't, and to improve the preprint infrastructure so that problems are flagged more visibly.

There are projects working on this. PubPeer allows post-publication commentary on papers, including preprints. Some preprint servers are experimenting with community review features. The idea is to make the review process more transparent and continuous, rather than a one-time gatekeeping event before publication.

There's a movement toward registered reports, where journals accept papers based on the research question and methodology before the results are known. This eliminates publication bias because the decision to publish doesn't depend on whether the results are positive or exciting. It also eliminates the incentive to massage the analysis to get significant results, because the analysis plan is locked in before data collection.

Registered reports are a clever innovation. They shift peer review to the point where it can actually improve the science — the design phase — rather than just evaluating the final product. And they solve the file drawer problem because the study gets published regardless of the results. The adoption has been slow, though. Only a fraction of journals offer registered reports, and many researchers are still unfamiliar with the format. Changing academic culture is hard.

Academic culture changes at the speed of tenure clocks.

Which is to say, very slowly.

Now — Hilbert's daily fun fact.

The average cumulus cloud weighs about one point one million pounds. That's roughly the weight of one hundred elephants, floating above your head.

If you're someone who reads scientific papers, whether in journals or on arXiv, what should you actually do with all of this? First, check whether the paper has been replicated. One study is a data point, not a conclusion. Look for -analyses, systematic reviews, or independent replications before you update your beliefs strongly.

Second, look at the preregistration. If the study was preregistered, you can check whether the authors analyzed the data the way they said they would. If the outcome measures or analysis methods changed between registration and publication, that's a red flag — not necessarily fraud, but it should make you more cautious.

Third, check the conflicts of interest. Who funded the study? Who stands to benefit from the results? This doesn't tell you whether the findings are correct, but it tells you what incentives were in play.

Fourth, look at the data and code availability. If the authors haven't shared their data or code, ask why not. In fields where data sharing is standard, failure to share is a warning sign.

Fifth, for preprints specifically, check whether the paper has been through any form of review. Has it been submitted to a journal? Have there been community comments? Has anyone tried to replicate it? A preprint with no engagement is just a claim. A preprint that has survived scrutiny is something more.

Sixth, be especially cautious with findings that align perfectly with what you already believe. Confirmation bias is powerful. If a study tells you exactly what you want to hear, that's when you should be most skeptical.

The bigger picture here is that science is a process, not a product. Peer review, preprints, replication, retraction — these are all parts of a system that corrects itself over time, imperfectly, messily, but eventually. The question isn't whether individual papers are trustworthy. The question is whether the system as a whole converges on truth.

Historically, it does. Not quickly, not smoothly, but it does. The Wakefield paper was eventually retracted. The Schön fraud was exposed. The Surgisphere data collapsed under scrutiny. The system works, just slower and messier than we'd like.

The preprint model makes the mess more visible. That's both its strength and its weakness. You see the sausage being made, and it's not always appetizing. But you also see the corrections happening in real time, and you see the scientific community doing what it's supposed to do — arguing, checking, replicating, refining.

I think Daniel's question gets at something fundamental. The way science communicates its findings shapes what science can discover. A system that rewards novelty and speed will produce different knowledge than a system that rewards rigor and replication. We're living through a transition in how that communication works, and we don't yet know what the equilibrium looks like.

The one thing I'm confident about is that no single mechanism is sufficient. Peer review alone isn't enough. Preprints alone aren't enough. Replication alone isn't enough. You need all of them, layered together, with the understanding that each layer has its own failure modes.

That's probably the best answer to Daniel's question. Pre-publication sharing is not a replacement for peer review. It's an additional layer in the system, with its own strengths and weaknesses. The trick is to use it for what it's good at — speed, access, broad feedback — while understanding what it's bad at — quality filtering, misinformation risk, and the loss of private failure.

Thanks to our producer Hilbert Flumingtop for keeping us on track. This has been My Weird Prompts. You can find every episode at myweirdprompts.com, and if you want to send us a question like Daniel did, that's the place to do it. We'll be back soon.

Take care, everyone.

This episode was generated with AI assistance. Hosts Herman and Corn are AI personalities.

#2526: How Peer Review Actually Works (and Fails)

Mentions

Downloads

You Might Also Like

#2526: How Peer Review Actually Works (and Fails)