What is claim-laundering in AI medical benchmarks?

Claim-laundering is how a narrow, guarded benchmark result loses its qualifiers as it travels from the paper to the press release to the headline. A rubric score on written vignettes, a diagnosis match on rare teaching cases, or a detection rate with no doctor in the study becomes, three steps later, 'AI outperforms doctors' in real medicine. It works by substituting the measured thing for a bigger one, quietly handicapping or omitting the human comparator, letting a conflicted party grade the exam, and reporting the optimistic ceiling instead of the default or the number that survives a new population.

Does a high diagnostic-AI accuracy or sensitivity number transfer to a clinic?

Not automatically. An accuracy or sensitivity figure depends on the operating threshold, the disease prevalence, and the population it was tuned on. An AI cervical-screening tool that performed well in the single-site data it was developed on detected only 60.1% of precancers when tested prospectively across five African countries — and HPV testing, the current recommended screen, caught 90.4% in the same study. For TB chest-X-ray AI validated on a South African cohort, borrowing a threshold from the literature dropped one tool to 64% sensitivity, and one product's specificity fell from about 87% to 37% in previously-treated patients. Ask for the threshold, the prevalence, and the false-positive cost.

How to Read an “AI Beats Doctors” Claim

Q: Why does an 'AI beats doctors' study usually not mean AI beats doctors?

Because the thing that was scored is almost never real clinical care. These studies grade the quality of a written answer to a pre-packaged text case — a rubric score, a match to a known diagnosis, a top-k list against a scripted differential — not whether a diagnosis was correct on an undifferentiated patient, a treatment was right, or anyone got better. In the 2026 Rwanda study (Nature Health), leading models scored about 4.5 out of 5 on an eleven-part answer-quality rubric, graded by six physicians on written vignettes, with no exam, no labs, and no patient outcome measured. A high answer-quality score is a real result; it is not a demonstration of clinical superiority.

Q: Was the human baseline in these AI studies given a fair chance?

Often not. In Microsoft's MAI-DxO study, the roughly 20% physician baseline came from 21 doctors solving the 56 held-out test cases of a 304-case NEJM benchmark while prohibited from using search engines, other AI, and online medical sources — tools clinicians normally rely on. In Google's AMIE study, the physicians were made to work through an unfamiliar text-chat interface the authors concede is not representative of usual practice, a format native to a language model. In the Rwanda study, some clinician answers were machine-translated before grading. When the human arm is denied its normal tools, format, or language, the reported gap partly measures the handicap, not the medicine.

Q: Is MAI-DxO really four times better than doctors?

The numbers are real; the framing strips their context. On SDBench — 304 rare NEJM case records — the default MAI-DxO system (paired with OpenAI's o3) reached about 80% diagnostic match, while the 21 physicians scored about 20% on the 56-case held-out split, barred from their usual references — the source of the 'four times' line. But an off-the-shelf o3 without Microsoft's orchestration already scores 78.6% on the same benchmark, so the specialized 'system' adds roughly one point. The widely quoted 'up to 85%' is a maximum-accuracy ensemble configuration costing about US$7,184 per case. 'Medical superintelligence' appears in the company blog, not the paper.

Q: Does this mean AI is useless in medicine?

No — and reading it that way is the opposite error. The models in these studies genuinely score well on hard tasks, and some may help clinicians when deployed as an aid and measured on real outcomes. The point of the seven questions is not that the technology is worthless but that a benchmark score is not a clinical outcome. 'Beats doctors on a rubric' and 'not proven at the bedside' are both claims to be checked, not verdicts. Demand the primary source, the matched comparison, and the patient outcome — and treat hype and dismissal with equal suspicion.

Every few weeks a study reports that an AI system matched or beat doctors — on a licensing exam, a set of hard cases, a screening image, a patient's question. The results are usually real, and some of these tools may genuinely help. But between "scored higher than clinicians on this test" and "is a better clinician" sits a stack of quiet substitutions, and the sentence that reaches a minister of health or a procurement committee has usually lost every one of them. You don't need to run the study to catch the gap. You need to know which questions a genuine "beats doctors" claim can survive — and which ones make a flattering number come apart. Every example below links to its primary source, so you can check the working yourself.

The one substitution to watch

Almost every “AI beats doctors” result swaps a measured thing for a bigger claimed thing: a rubric score for clinical quality, a diagnosis match on a curated case for competence on a real patient, a detection rate for a health outcome. The swap is usually true as far as it goes — which is exactly why it travels. The work of reading the claim is noticing what was swapped out.

Which questions separate a test score from a doctor?

Seven of them. Each starts from a claim you will actually hear, states what you are meant to conclude, shows what the sentence quietly leaves out, and ends with the one question that exposes the gap. One study runs through the whole guide: a 2026 Nature Health paper, funded by the Gates Foundation, reporting that large language models outperformed frontline clinicians in Rwanda. It is a careful study whose authors are candid about its limits — which is precisely why it shows how the qualifiers get lost downstream.

Question 01 — did the human get to practice medicine?

Physicians scored about 20%. Our AI scored up to 85% — four times better.

What you're meant to conclude

The machine is four times the diagnostician a trained doctor is.

What's hidden

What the doctors were allowed to do. That comparison is the headline of Microsoft's MAI-DxO study, which reports its orchestrated system reaching roughly 80% diagnostic accuracy against a ~20% average for physicians — the source of the "four times" line. But the physician baseline is 21 doctors working solo — with search engines, other AI, and online medical sources prohibited — on the held-out test set of hard New England Journal of Medicine cases (56 test cases, drawn from the 304-case SDBench benchmark). Real diagnosis is not a closed-book text quiz. A clinician takes a history, examines the patient, orders a test, asks a colleague, watches the illness evolve — and reaches for exactly the references this study took away. Strip a doctor of all of that and hand the same paragraph to a machine built to answer paragraphs, and you have not measured medicine; you have measured recall under a handicap the AI never faced.

The same move sits underneath the Rwanda study's setup, where clinicians wrote free-text answers to written vignettes — no patient in front of them, no examination, no follow-up. When you read that an AI "beat" clinicians, the first question is not how it scored. It is whether the humans were allowed to do the thing doctors actually do to be right.

The question to always ask

Did the human get to practice medicine — or just answer a paragraph?

Real medicine is information-gathering. If the clinician couldn't take a history, examine, order a test, or use the references they normally rely on, the written case handed the model the workup for free — and the "gap" is partly the handicap. Ask what tools, time, and information each side had.

Question 02 — what did the score actually grade?

Leading models outperformed local clinicians on every measure (P<0.001).

What you're meant to conclude

The AI gives better medical care than the region's doctors and nurses.

What's hidden

What "every measure" measures. In the Rwanda study, six senior physicians graded 506 written question-and-answer pairs on an eleven-part rubric — completeness, clarity, relevance, potential for harm, and the like — each on a 1-to-5 scale. The top models scored about 4.5 out of 5 and did so significantly above the local clinicians. But not one of those eleven metrics checks a diagnosis against a confirmed answer, measures a treatment, or asks whether a patient got better. It is an answer-quality rubric: it rewards a response that reads as complete, clear, and safe. Fluent, well-organised, appropriately-hedged prose is precisely what a large language model is built to produce. Scoring high on it is a real finding about the writing; it is not evidence that the writer would heal anyone.

The same substitution powers the leaderboards. OpenAI's HealthBench reports its best model at 60% — but that score is the fraction of physician-written rubric points that another model (GPT-4.1) judges a response to have met, not a count of correct diagnoses. "Better on the rubric" and "better for the patient" are different sentences, and only one of them was tested.

The question to always ask

What did the score actually grade — answer quality, or a patient outcome?

Completeness, empathy, and fluent prose are what models are optimized to produce. A right diagnosis checked against ground truth, the right treatment, a patient who recovered — that is a different, harder measurement. Ask which one the headline number is, and whether the study measured an outcome at all.

What “outperforms clinicians” actually scored: answer quality on a 1–5 rubric

The paper reports no single overall clinician score to set beside these models. The one human comparison it states is a gap: the top model averaged 0.83 of a point above local GPs (range 0.38–1.10) — on an eleven-part writing rubric, graded by six people, with no patient in the room.

Source: Nature Health (2026), “Large language models for frontline healthcare support in low-resource settings” — 6 physician raters, 11-metric rubric, 506 question–response pairs from 4 Rwandan districts. Funded by the Gates Foundation (INV-068056). pmc.ncbi.nlm.nih.gov/articles/PMC12880909/

Data table

Item	Mean rubric score (1–5; 5 = best) — English-prompted subset	Note
Gemini-2	4.56	s.d. 0.58
GPT-4o	4.53	s.d. 0.68
o3-mini	4.49	s.d. 0.58
DeepSeek R1	4.16	s.d. 1.06
Meditron-70B	3.99	s.d. 0.86

Read before citing

Every bar is a language model. There is NO single overall clinician score in the paper to place beside them — so no human bar can honestly be drawn. The 4.56 and 4.53 figures are Gemini-2 and GPT-4o, not doctors.
The one human comparison the paper does state is a gap, not a bar: the top model averaged 0.83 points above local GPs (range 0.38–1.10 across the 11 metrics).
These are subjective answer-quality scores on an 11-part written rubric (completeness, clarity, potential for harm, and so on), graded by six physicians on a 1–5 scale. None of the 11 metrics checks a diagnosis against ground truth, measures a treatment outcome, or asks whether any patient improved.
Scores shown are the English-prompted subset. Values are means; the paper reports a standard deviation (in the data table), not a confidence interval. Kinyarwanda prompting lowered the model scores by about 0.15 points.
Conditions were not matched. Clinicians answered from bilingual audio-plus-text and could replay the voice recording; the models received clean text prompts. Some clinician answers were machine-translated from Kinyarwanda to English before grading — a step the models, prompted in clean English text, never faced.
Single-turn question and answer only. The authors state this workflow “does not fully reflect the complexity of day-to-day practice” and does not guarantee patient-level benefit.

The chart is the whole claim in one frame: five models, clustered near the top of a five-point writing rubric, graded by six people on questions no patient ever asked in a room. The study reports no single overall clinician score to set beside them — so the widely-repeated "outperforms clinicians" rests on a stated 0.83-of-a-point average gap, not on a bar anyone can draw.

Question 03 — were the conditions matched?

In a head-to-head comparison, the AI beat the clinicians.

What you're meant to conclude

Same test, same rules — and the AI won cleanly.

What's hidden

That the two sides rarely take the same test. In the Rwanda study, the clinicians answered from bilingual audio-plus-text and could replay the voice recording of each case; the models received clean typed prompts. And some clinician answers were machine-translated from Kinyarwanda to English before grading — a step in the human pipeline that the models, prompted in clean English text, never faced. A translation artefact is not a clinical deficit, but it can land in the same column.

Google's AMIE study shows the format version of the same trap. AMIE was rated superior to primary-care physicians on 28 of 32 axes by specialist reviewers (and 24 of 26 by the patient actors) in a text-chat consultation — but the physicians were required to work through an unfamiliar synchronous chat interface the authors themselves call not representative of usual clinical practice. Typing a consultation is the native habitat of a language model and a foreign one for a doctor. When only one side is tested in its home format, in its first language, with its normal inputs, the gap you measure is partly the mismatch.

The question to always ask

Were the conditions matched?

Same inputs, same language, same format, same time, same grading rubric for human and AI? Or did the clinicians work in an unfamiliar interface, a second language, or through a translation step — while the model got the format it was built for? An unmatched comparison manufactures part of the gap.

Question 04 — a clean vignette, or an undifferentiated patient?

The AI solved cases that stumped experienced physicians.

What you're meant to conclude

It will handle the patients who walk into my clinic.

What's hidden

How pre-packaged the "cases" are. MAI-DxO's 304 cases are NEJM clinicopathological-conference records — chosen to be rare and instructive, written up retrospectively, with the correct answer already established and the relevant findings already assembled. The Rwanda inputs are curated written vignettes. A benchmark case arrives as a tidy paragraph: one patient, one question, the salient facts pre-selected, an answer that exists. A clinic delivers the opposite — undifferentiated, comorbid, mislabeled, mid-workup, the signal buried in noise that nobody has cleaned. A model that excels at the packaged version has shown it can do the crossword, not that it can find the clue.

This cuts both ways, and it is worth saying so: rare-case benchmarks are a legitimate stress test of reasoning, and doing well on them is not nothing. But "solved hard textbook cases" and "safe on the next unscheduled patient" are different competences, and the second is the one a deployment depends on.

The question to always ask

A clean vignette, or an undifferentiated patient?

A curated case with a known answer and the findings pre-assembled, or a real queue of mislabeled, comorbid, mid-workup people? Benchmarks are pre-cleaned; clinics are not. Ask whether the test cases resemble the patients the tool would actually see.

Question 05 — beats the doctor, or helps the doctor?

The AI outperformed doctors — so let it see patients.

What you're meant to conclude

We can replace the clinician with the model.

What's hidden

That "beats" and "helps" are different studies with opposite implications. Nearly every "beats doctors" result is the model alone, scored retrospectively on text. That tells you almost nothing about the only configuration anyone would responsibly deploy: a real clinician using the tool on a real patient, free to override it. Those studies exist, and they are more sobering. When a large clinician-randomized study in Nairobi actually put an AI copilot into live primary-care visits, it found improvements in the quality of documented decisions — and its one patient-outcome measure came back non-significant. We took that study apart in our investigation of the OpenAI–Penda Health "16% fewer errors" claim. A model that scores well solo is a reason to run the deployment study, not a substitute for it.

The question to always ask

Does it beat the doctor, or help the doctor?

Model-alone and clinician-plus-model are different studies with opposite deployment implications. A solo benchmark win says nothing about what happens when a real clinician can accept, edit, or ignore the output. Ask which was measured — and whether anyone tried it in a live clinic.

Question 06 — does an off-the-shelf model already score this high?

Our specialized medical-AI system reaches expert-level accuracy.

What you're meant to conclude

They built something new and powerful, made for medicine.

What's hidden

How much of the score the "system" is actually responsible for. Strip Microsoft's orchestration away and the off-the-shelf model underneath — OpenAI's o3 — already scores 78.6% on the same 304-case benchmark; the specialized MAI-DxO scaffolding lifts that to about 80%, a gain of roughly one point. Much of the "medical superintelligence" is the general model. And the teaching cases have circulated online for years, so the reader should also ask the contamination question: how much of the score is reasoning, and how much is having seen the answer?

“Four times better than doctors” — and how much of it is just the base model

The base model (o3, 78.6%) is most of the “system” (80%); the headline 85.5% is a ≈$7,184-per-case ensemble. And the 20% is a different, handicapped test — a 56-case held-out split with the doctors’ usual references taken away.

Source: Nori et al., “Sequential Diagnosis with Language Models,” arXiv:2506.22405 (2026); framing from Microsoft AI, “The Path to Medical Superintelligence.” arxiv.org/abs/2506.22405

Data table

Item	Diagnostic accuracy (%) on the SDBench NEJM cases	Note
Generalist physicians	20	56-case held-out set; no search / AI / online sources
Off-the-shelf o3 (no scaffold)	78.6	304-case benchmark
MAI-DxO + o3 (the “system”)	80	304-case benchmark; scaffold adds ~1.4 pts
Max-accuracy ensemble	85.5	≈ US$7,184 per case

Read before citing

This is NOT a like-for-like bar chart. The ~20% physician figure is measured on the 56-case held-out split; the AI figures are reported over the full 304-case SDBench benchmark. The two sides are not on the same set.
The physicians were barred from search engines, other AI, and online medical sources — references a working clinician normally uses. They also could not examine a patient, order a test, or follow the case over time.
Off-the-shelf o3, with none of Microsoft’s orchestration, already scores 78.6%. The specialized MAI-DxO “system” adds roughly one point to reach ~80%.
The widely quoted “up to 85%” (85.5%) is a maximum-accuracy ensemble configuration costing about US$7,184 per simulated case — not the default system, and more expensive than the physicians it is compared to.
SDBench is built from 304 deliberately rare NEJM clinicopathological-conference cases: retrospective, text-only, with the answer already known. No real patients, no outcomes, no safety measured. “Medical superintelligence” is the company blog’s phrase, not the paper’s.

Then ask who graded it. HealthBench is built by OpenAI, scored by an OpenAI model (GPT-4.1) acting as judge, and topped by an OpenAI model — and that automated judge is an imperfect stand-in for a clinician — it matches the physicians' own rubric judgments at a macro F1 of only about 0.71. Physician-written rubrics are a real strength of its design; a conflicted, imperfect grader is a real weakness of its scores. When the same party writes the exam, marks it, and sits it, ask who checked the marking.

The question to always ask

Does an off-the-shelf model already score this high — and who graded it?

If a plain model without the special scaffold nearly matches the "specialized" system, the novelty is thinner than the headline. Ask whether the test items could sit in the training data, and whether the grader is independent of the party being graded.

Question 07 — realistic prevalence, and the cost of a false positive?

The AI screen matches expert performance — deploy it and we'll catch the disease.

What you're meant to conclude

One accuracy number, and the screening problem is solved.

What's hidden

The threshold, the population, and the price of being wrong — the things a single number erases. Take AI screening for cervical precancer. An automated-visual-evaluation tool that performed strongly in the single-site data it was built on detected only 60.1% of precancers when tested prospectively across five African countries. It "beats" visual inspection with acetic acid (VIA, 36.6% sensitivity) — but VIA is a weak, increasingly deprecated comparator, and in the very same study HPV testing, the current recommended primary screen, reached 90.4%, about thirty points above the AI. "Beats the old method" is not "matches the standard of care."

“Beats VIA” — but the recommended screen beats the AI

“Beats VIA” clears a weak, deprecated bar. In the same women, HPV testing — the recommended primary screen — caught 90.4% of precancers, about 30 points above the AI.

Source: Five-country prospective study (Malawi, Rwanda, Senegal, Zambia, Zimbabwe), Lancet Global Health — AVE 60.1%, VIA 36.6%, HPV 90.4% sensitivity for CIN2+, 18,086 women. pubmed.ncbi.nlm.nih.gov/41386246/

Data table

Item	Sensitivity for precancer (%) — five-country, 18,086 women	Note
VIA (visual inspection with acetic acid)	36.6	the comparator AVE “beats”; WHO-deprecated
AI automated visual evaluation (AVE)	60.1	external, five-country
HPV testing (recommended primary screen)	90.4	same cohort

Read before citing

Sensitivity only — the ability to catch precancer. It says nothing about specificity, where the trade runs the other way: AVE buys sensitivity over VIA by producing more false positives, in settings where follow-up is scarce.
“Beats VIA” beats a weak, increasingly deprecated bar. In the same cohort, HPV testing — the current recommended primary screen — reached 90.4%, about 30 points above the AI.
The AI’s best combined arm (AVE-assisted VIA) reached 71.8%, still below HPV. No cancer-incidence, mortality, or over-treatment outcome was measured.
Single-site validations of AVE tools, drawn from the same population they were trained on, have reported substantially higher sensitivity; the drop to 60.1% is what happened on new, external populations — the classic single-site-optimism gap.

TB chest-X-ray AI shows the threshold trap. Across twelve commercial products validated on a South African cohort — with no radiologist comparison arm at all, so the "beats doctors" framing does not even apply — borrowing an operating threshold quoted in the literature dropped one tool to 64% sensitivity, and one product's specificity fell from about 87% to 37% among previously-treated patients at a fixed cut-off. The authors' warning is the lesson: be wary of extrapolating thresholds quoted in the literature. A screening number that ignores prevalence and the false-positive tab — the confirmatory test, the referral, the anxious patient, in systems where follow-up is scarce — is a specification, not a result.

The question to always ask

Realistic prevalence — and what does a false positive cost?

On a balanced benchmark, high accuracy is cheap. At true clinic prevalence, ask the false-positive rate, whether the operating threshold was tuned on this population, and who pays for each wrong flag — the follow-up test, the referral, the worry, the alert fatigue.

The four moves of claim-laundering

Read enough of these and the same laundering cycle appears every time — a guarded benchmark result shedding its qualifiers as it travels from paper to press release to headline:

Substitute the measured thing for a bigger thing. A rubric score, a match on rare teaching cases, a top-k list against a scripted differential, a detection rate with no doctor in the study — reported as clinical superiority.
Handicap or omit the human comparator. Doctors denied their normal tools, forced into an unfamiliar interface, graded through a translation step — or absent entirely, while "beats doctors" is implied.
Let a conflicted or absent party grade it. The vendor writes the test, supplies the model judge, and fields the winning model; or an automated grader stands in for a clinician who never reviewed a real output.
Report the ceiling, not the default. An expensive maximum-accuracy configuration, or a single-site score that craters on a new population, quoted as the everyday number.

The seven questions are just this cycle, run backwards.

Cut both ways

None of this means the models are weak or that medical AI is hype. The systems here genuinely score well on genuinely hard tasks, the physician-authored parts of these benchmarks are real craftsmanship, and a tool that helps a clinician — deployed as an aid, measured on outcomes — may do real good. The failure mode on the other side is just as expensive: reading every deflated benchmark as proof that "AI can't do medicine," or letting a fair critique become a reason to deny a useful aid to systems that are desperately short of clinicians.

The skill is neither credulity nor reflexive dismissal. It is asking, of any "beats doctors" number someone sets in front of you, which of the seven questions it can actually survive.

Evaluate a claim yourself

Take the seven questions to your own AI

Paste a study, benchmark, model card, or announcement below. It turns the seven questions into a prompt that makes any assistant show its working — separate what was scored from what was claimed, check whether the human was fairly resourced, and name what's missing. Copy it into ChatGPT, Claude, or Gemini.

Mode System or study (optional) Claim, study, or link to evaluate

Evaluation prompt

You are evaluating a claim that an AI system matches or outperforms human clinicians ({{SUBJECT}}), using the "Ground Truth" method (groundtruth.health). I will give you a study, benchmark, model card, or announcement. {{MODE}}

1. Did the human get to practice medicine — or just answer a paragraph? Real medicine is information-gathering. If the clinician couldn't take a history, examine, or order a test, the written case handed the model the workup for free.
2. What did the score actually grade — answer quality, or a patient outcome? Completeness, empathy, and fluent prose are what models are optimized to produce. A right diagnosis, the right treatment, a patient who got better — that's a different, harder measurement.
3. Were the conditions matched? Same inputs, language, format, time, and grading rubric for human and AI? Or did the clinicians work in an unfamiliar interface, a second language, or through a translation step?
4. A clean vignette, or an undifferentiated patient? A curated case with a known answer, or a real queue of mislabeled, comorbid, mid-workup people? Benchmarks are pre-cleaned; clinics are not.
5. Does it beat the doctor, or help the doctor? Model-alone versus the same clinician with the tool are different studies with opposite deployment implications. Ask which was measured.
6. Does an off-the-shelf model already score this high? If a plain model without the special scaffold nearly matches the ‘specialized’ system — and the test cases could sit in its training data — the reported gain may be contamination, not capability.
7. Realistic prevalence — and what does a false positive cost? On a balanced benchmark, high accuracy is cheap. At true clinic prevalence, ask the false-positive rate and who pays for it: the confirmatory test, the referral, the anxiety, the alert fatigue.

Prefer primary sources; if a number can only be traced to a press release or blog post, treat it as unverified. Above all, separate what was scored (often the quality of a written answer) from what was claimed (clinical performance on real patients). Do not fill gaps with assumptions — say "not stated" wherever the source is silent.

CLAIM / STUDY TO EVALUATE:
{{CLAIM}}

Ground Truth doesn't run the model for you — deliberately. The method is ours; the judgement stays yours.

The last word is a patient, not a percentage

Every trap in this guide is a substitution: a rubric score standing in for clinical quality, a handicapped baseline standing in for a working doctor, a curated vignette standing in for a real patient, a solo benchmark standing in for a deployment, a conflicted grader standing in for an independent one, a single-site ceiling standing in for the population you'd actually screen. Each swap is convenient, quotable, and usually true as far as it goes — which is why it survives the trip to the headline. Reading the claim well is mostly the work of noticing the swap and asking for the thing that was traded away.

A benchmark tells you a model wrote a good answer to a tidy question. Only a matched comparison, on real patients, measured on what happened to them, tells you whether it can take care of anyone.

So before an "AI beats doctors" result is funded, deployed, or used to argue that clinicians can be skipped, ask it the questions a headline can't answer: who was the human, what did you score, and did a patient ever get better? If those answers exist, everything else is context. If they don't, everything else is decoration.

Frequently asked questions

Why does an "AI beats doctors" study usually not mean AI beats doctors?

Because the thing that was scored is rarely real clinical care. These studies grade a written answer to a pre-packaged case — a rubric score, a match to a known diagnosis, a top-k list against a scripted differential — not whether a diagnosis was right on an undifferentiated patient or whether anyone got better. In the 2026 Rwanda study, leading models scored about 4.5 out of 5 on an eleven-part answer-quality rubric, graded by six physicians on written vignettes, with no exam, no labs, and no patient outcome. A high answer-quality score is real; it is not a demonstration of clinical superiority.

Was the human baseline given a fair chance?

Often not. In Microsoft's MAI-DxO study, the ~20% physician baseline came from 21 doctors solving the 56 held-out test cases (from a 304-case NEJM benchmark) while barred from search engines, other AI, and online medical sources. In Google's AMIE study, physicians were made to work through an unfamiliar text-chat interface the authors concede is not representative of usual practice. In the Rwanda study, some clinician answers were machine-translated before grading. When the human arm loses its normal tools, format, or language, the reported gap partly measures the handicap.

Is MAI-DxO really four times better than doctors?

The numbers are real; the framing drops their context. On SDBench — 304 rare NEJM cases — the default MAI-DxO system (with OpenAI's o3) reached about 80%, while the 21 physicians scored about 20% on the 56-case held-out split, barred from their usual references — the "four times" line. But off-the-shelf o3 already scores 78.6% on the same benchmark, so the specialized scaffolding adds roughly one point; the quoted "up to 85%" is a maximum-accuracy ensemble costing about US$7,184 per case; and "medical superintelligence" is the company blog's phrase, not the paper's.

What is "claim-laundering"?

It is how a narrow benchmark result loses its qualifiers as it travels from paper to press release to headline. A rubric score on written vignettes, a diagnosis match on rare teaching cases, or a detection rate with no doctor in the study becomes "AI outperforms doctors" three steps later. It works by substituting the measured thing for a bigger one, handicapping or omitting the human comparator, letting a conflicted party grade the exam, and reporting the optimistic ceiling instead of the number that survives a new population.

Does a high diagnostic-AI accuracy or sensitivity transfer to a clinic?

Not automatically — it depends on the operating threshold, the disease prevalence, and the population it was tuned on. An AI cervical-screening tool that did well on its own single-site data detected only 60.1% of precancers across five African countries, while HPV testing caught 90.4% in the same study. For TB chest-X-ray AI on a South African cohort, borrowing a literature threshold dropped one tool to 64% sensitivity, and one product's specificity fell from about 87% to 37% in previously-treated patients. Ask for the threshold, the prevalence, and the false-positive cost.

Does this mean AI is useless in medicine?

No — reading it that way is the opposite error. The models genuinely score well on hard tasks, and some may help clinicians when deployed as an aid and measured on real outcomes. The point of the seven questions is that a benchmark score is not a clinical outcome. "Beats doctors on a rubric" and "not proven at the bedside" are both claims to check, not verdicts. Demand the primary source, the matched comparison, and the patient outcome — and treat hype and dismissal with equal suspicion.

Sources

Every figure above, traceable to a public primary source

Rwanda answer-quality study — model rubric scores (Gemini-2 4.56 / GPT-4o 4.53 on the English subset; ~4.49 / 4.48 across all 11 metrics), the 0.83-point average gap over local GPs, the 11-metric rubric, six raters, 506 question–response pairs, bilingual audio inputs and pre-grading machine translation, and the authors' limitations. "Large language models for frontline healthcare support in low-resource settings," Nature Health (2026); funded by the Gates Foundation (INV-068056): pmc.ncbi.nlm.nih.gov
Microsoft MAI-DxO — ~20% for 21 physicians (barred from search, other AI, and online medical sources) on the 56-case held-out test set; MAI-DxO ~80% (with o3) and off-the-shelf o3 78.6% on the full 304-case SDBench benchmark; the 85.5% maximum-accuracy ensemble (~US$7,184/case). Nori et al., "Sequential Diagnosis with Language Models," arXiv:2506.22405: arxiv.org. "Medical superintelligence" framing: Microsoft AI, The Path to Medical Superintelligence: microsoft.ai
Google AMIE — superior on 28/32 axes (specialists) and 24/26 (patient actors), text-chat OSCE-style study with trained actors, and the "not representative of usual clinical practice" limitation. Tu et al., "Towards Conversational Diagnostic AI" (Nature, 2025; preprint): arxiv.org
OpenAI HealthBench — best model (o3) 60%; a model (GPT-4.1) grades responses against 48,562 physician-written rubric criteria and agrees with physicians at macro F1 ≈ 0.71. arXiv:2505.08775: arxiv.org; OpenAI announcement: openai.com
TB computer-aided detection — external validation of 12 commercial products on a South African cohort with no radiologist arm; 64% sensitivity when a literature threshold is borrowed; qXR specificity ~87%→37% in previously-treated patients at a fixed threshold; and the authors' caution against extrapolating thresholds: pmc.ncbi.nlm.nih.gov
Cervical AI (automated visual evaluation) — five-country prospective study: AVE 60.1% sensitivity vs VIA 36.6% and HPV 90.4%, 18,086 women analysed (Lancet Global Health; structured abstract): pubmed.ncbi.nlm.nih.gov
Deployment contrast — our own investigation of the OpenAI–Penda Health "16% fewer errors" claim (documentation-error reductions, non-significant patient outcome): groundtruth.health

Figures were checked against each primary source in July 2026, several via open-access preprints or structured abstracts where the journal version was paywalled. Numbers are quoted as reported by the original authors. Where a claim maps to a real study we name it only to teach, not to accuse. Corrections are welcome.

About Ground Truth

Ground Truth is an independent publication that scrutinizes AI claims in global health — the benchmarks, accuracy scores, and capability announcements that increasingly decide what gets funded, deployed, and believed. Our aim is not to praise or attack particular products, but to help readers judge the evidence for themselves.

Our standard: every factual claim is traced to a public primary source — the peer-reviewed paper, the preprint, the model card, the organization's own page. We separate what is verified from what is plausible but unconfirmed. We name organizations only when the public record supports it, and only to teach, not to accuse. We correct our own errors in public. We take no funding from, and hold no affiliation with, the companies, funders, or vendors whose work we examine.

Spotted something we got wrong, or a claim worth taking apart? Corrections and tips are welcome at corrections@groundtruth.health. Read the full editorial standard, independence statement, and corrections policy →