Analysis · Benchmarks
The scoreboard that can’t be drawn
Seven headline studies say AI matches or beats clinicians in global health. Line them up and there is no line to draw: they report on six different scales, only one offers a fair AI-versus-human comparison, and exactly one measured whether a patient got better — it didn’t.
Every few weeks a study says an AI has matched or beaten doctors, and the obvious next move is a bar chart: AI against human, tallest bar wins. We tried to build that chart for the seven most-cited “AI-versus-clinician” results in global health. It can’t be built — and why it can’t is the finding. Each cell below is traced to the study’s primary source; nothing is averaged or ranked across rows.
| Study & claim | What the score actually graded | A comparable human number? | The gap, on the study’s own scale | A patient outcome? |
|---|---|---|---|---|
| Rwanda: LLMs “outperform” GPsNature Health 2026 · PMC12880909 | Answer quality on an 11-metric, 1–5 rubric (506 written Q&A pairs, 6 physician raters). | No single clinician score exists. The paper reports per-metric gaps, not one human bar to place beside the models. | Top model averaged +0.83 of a point over local GPs (range 0.38–1.10 across the 11 metrics); models ≈ 4.5/5. | No. |
| Microsoft MAI-DxO: “4× better than doctors”SDBench · arXiv:2506.22405 | Diagnostic accuracy (%) on 304 deliberately rare NEJM text cases, answer already known. | Not on the same test. Physicians scored ≈ 20% on a 56-case held-out split, barred from references; the AI figures are on the full 304-case set. | o3 alone 78.6%, MAI-DxO+o3 80%, max ensemble 85.5% (≈ US$7,184/case) vs physicians ≈ 20%. | No. |
| Google AMIE: “superior” to PCPsNature 2025 · s41586-025-08866-7 | Consultation-quality rubric in a text-chat OSCE with simulated patient-actors. | Yes — but handicapped. The PCPs were made to work through an unfamiliar text-chat the authors call not representative of usual practice. | AMIE rated higher on 30 of 32 axes by specialists and 25 of 26 by patient-actors (published Nature figures). | No. Simulated patients. |
| OpenAI HealthBencharXiv:2505.08775 | A weighted, normalized score over 48,562 physician-written rubric criteria (some worth negative points), graded by an AI (GPT-4.1). | A physician-written response baseline exists — but it is a writing task, not clinical practice, and the grader is conflicted: OpenAI builds it, an OpenAI model marks it (judge–physician F1 ≈ 0.71), an OpenAI model tops it. | Best model (o3) scored 0.60 on the weighted rubric; the human reference is physicians’ own written responses, not a clinician taking a diagnostic test. | No. |
| OpenAI–Penda Health: “16% fewer errors”arXiv:2507.16947 | Physician-rated documentation errors on de-identified visit notes. | N/A — this is clinician-with-AI vs clinician-without (a help-the-doctor design, not beat-the-doctor). | 16% fewer diagnostic and 13% fewer treatment errors in the documented decisions. | Attempted — and null: 8-day “feeling better” 3.8% vs 4.3% (n.s.; ~60% unreachable; the authors call it exploratory). |
| Cervical AI (AVE): “beats VIA”Lancet Global Health · PubMed 41386246 | Sensitivity (and specificity) for precancer (CIN2+) against biopsy, 18,086 women, five countries. | Yes, matched — VIA (a health-worker visual screen) at 36.6%. But the recommended screen, HPV, beats the AI. | AVE 60.1% sensitivity vs VIA 36.6% (+23.5 pts), and −30 pts vs HPV’s 90.4%. On specificity the trade reverses: AVE 81.9% vs VIA 94.2%. | No. No cancer-incidence, mortality or over-treatment measured. |
| TB computer-aided detection (12 products)External validation · PMC11339183 | Chest-X-ray TB detection, sensitivity/specificity on a South-African cohort. | None — no radiologist arm. The tools are validated against a microbiological reference, not against the humans they would replace. | ≈ 64% sensitivity at a borrowed literature threshold; one product’s specificity fell ≈ 87% → 37% in previously-treated patients. | No. |
A red cell flags what is missing or mismatched — no comparable human number, a handicapped human arm, a conflicted grader, or no patient outcome. Every value is traced to the primary source in the last section.
1 · Six scales, no ladder. The seven report on six non-comparable scales — rubric points, diagnostic-accuracy %, rubric axes won, an AI-judged rubric score, screening sensitivity, error reduction. There is no exchange rate between them, so no single ranking is honest.
2 · One fair human comparison. A matched AI-versus-human test on a shared scale exists for exactly one of the seven. The rest miss it a different way each: no single human score, a human on a different test set, a human boxed into a chat window, a baseline marked by a conflicted judge, no human arm at all, and a design that never pits AI against an unaided human.
3 · One patient outcome — and it was null. Six stop at a proxy; the one study that measured whether patients improved found no effect.
Each of those is a question from our field guide on reading “AI beats doctors” claims, turned back on the studies that prompted it — the table is that guide applied at scale. Two rows are worth drawing out.
The one fair comparison goes to the human standard
Only the cervical-cancer screen pits AI against a real, matched human test on the same 18,086 women — and the AI (automated visual evaluation, 60.1% sensitivity) does beat the health-worker screen it was measured against (VIA, 36.6%). But VIA is a weak, deprecated bar: in the same cohort the recommended screen, an HPV test, reached 90.4%. The sensitivity gain is also bought with specificity — AVE 81.9% against VIA’s 94.2% — meaning more false positives, where a positive means a scarce, distant follow-up. “Beats a human” and “is the best option for the patient” are not the same sentence.
Nobody has shown a patient is better off
Only the Penda Health study reached past a proxy to ask whether patients improved — an eight-day “feeling better?” call — and it came back non-significant (3.8% vs 4.3%, ~60% unreachable, which the authors call exploratory). Across all seven, that is the only patient-outcome signal, and it is null.
How this was built
This dataset is an original synthesis: no prior source places these seven studies on one honest table and flags which “wins” have no comparable human number. Because the risk in an exercise like this is not a mistyped figure but a coding error — a study filed in the wrong column, a missing human arm quietly treated as a low one — we did not build it once and check it once. Two different frontier models (Anthropic’s Claude Opus 4.8 and OpenAI’s GPT-5.5, via the Codex CLI) coded all seven studies on all five dimensions independently, from the primary sources, without seeing each other’s work. We then compared the two tables cell by cell and resolved every disagreement back to the source — never by splitting the difference.
Two independent codings of all seven studies · the second model raised five corrections · every one held · each reconciled to its primary source.
The disagreements are recorded here rather than hidden, because on this publication the reconciliation is part of the evidence. The second model raised five corrections and was right on every one: (1) AMIE’s axis count — the preprint reports 28 of 32, but the peer-reviewed Nature paper we now cite reports 30 of 32 (and 25 of 26 for the patient-actors); (2) how HealthBench actually scores — a weighted, AI-marked rubric, not a raw share of criteria met; (3) that HealthBench does keep a physician-written human reference, which our first pass had wrongly called absent; (4) the cervical study’s specificity figures, which are in fact reported (AVE 81.9%, VIA 94.2%); and (5) the count of distinct scales, which we sharpened from four to six. Corrections to any cell will continue to be made in public.
What to take from the table
None of this says the systems are bad or the studies dishonest — most are careful and state their own limits. It says that “AI beats clinicians,” as a genre of headline, is measuring seven different things and has yet to show a patient is better off. Before repeating one of these numbers, find its row. The portable version of the questions is the field guide.
Cite this analysis
Ground Truth (2026). The scoreboard that can’t be drawn: what seven “AI-beats-clinician” health studies actually measured. Ground Truth. https://groundtruth.health/ai-vs-clinician-scoreboard/
The underlying scoreboard is released as an open dataset under CC BY 4.0 — reuse it with attribution. The primary source for every cell is listed below.
Sources
- Rwanda / LLMs vs GPs — “Large language models for frontline healthcare support in low-resource settings,” Nature Health (2026); six physician raters, 11-metric rubric, 506 question–response pairs, four districts; top model +0.83 pt over local GPs (range 0.38–1.10); funded by the Gates Foundation (INV-068056). pmc.ncbi.nlm.nih.gov/articles/PMC12880909
- Microsoft MAI-DxO / SDBench — Nori et al., “Sequential Diagnosis with Language Models,” arXiv:2506.22405 (2026); physicians ≈ 20% on the 56-case held-out split (references barred), o3 78.6% / MAI-DxO+o3 80% / max ensemble 85.5% (≈ US$7,184/case) on the full 304-case benchmark. arxiv.org/abs/2506.22405
- Google AMIE — Tu et al., “Towards Conversational Diagnostic AI,” Nature (2025); randomised text-chat OSCE (159 scenarios, 20 PCPs) with trained patient-actors; AMIE rated superior on 30/32 specialist axes and 25/26 patient-actor axes; interface “not representative of usual clinical practice.” Published figures: nature.com/articles/s41586-025-08866-7 (preprint arXiv:2401.05654 reports the earlier 28/32, 24/26).
- OpenAI HealthBench — arXiv:2505.08775 (2025); best model (o3) scored 0.60 on a weighted, normalized rubric over 48,562 physician-written criteria (some worth negative points), marked by a GPT-4.1 judge that agrees with physicians at macro F1 ≈ 0.71; a physician-written response baseline exists. arxiv.org/abs/2505.08775
- OpenAI–Penda Health — arXiv:2507.16947 (2025); clinician-randomized quality-improvement study; 16% / 13% relative reductions in physician-rated diagnostic / treatment documentation errors; 8-day patient-reported outcome non-significant (3.8% vs 4.3%, ~60% unreachable, “exploratory”). See our full investigation. arxiv.org/abs/2507.16947
- Cervical AI (AVE) — five-country prospective study (Malawi, Rwanda, Senegal, Zambia, Zimbabwe), Lancet Global Health; sensitivity for CIN2+ AVE 60.1% / VIA 36.6% / HPV 90.4%, specificity AVE 81.9% / VIA 94.2% / HPV 80.1%, among 18,086 women with confirmed status (526 CIN2+). pubmed.ncbi.nlm.nih.gov/41386246
- TB computer-aided detection — external validation of 12 commercial CAD products on a South-African cohort with no radiologist arm; ≈ 64% sensitivity at a borrowed threshold; qXR specificity ≈ 87% → 37% in previously-treated patients at a fixed threshold. pmc.ncbi.nlm.nih.gov/articles/PMC11339183