In July 2025, OpenAI published a real-world clinical study run with Penda Health, a network of primary-care clinics in Nairobi. Clinicians given an AI “copilot” called AI Consult, the companies reported, had “a 16% relative reduction in diagnostic errors and a 13% reduction in treatment errors compared to those without.” The result is genuine, it is statistically significant, and it was graded by independent doctors. It is also narrower than the words wrapped around it — and the gap is worth seeing clearly, because this is the study that will be cited when ministries of health decide whether to put a large language model between a clinician and a patient.

The claim, and where it traveled

OpenAI’s own write-up is careful. Its headline is “Pioneering an AI clinical copilot,” its subhead says clinicians “made fewer errors,” and — to its credit — it reports that the study found no significant difference in how patients actually felt afterward, adding that “follow-up studies are needed.” The amplification is where the claim grew. TIME ran it as “AI Helps Prevent Medical Errors in Real-World Clinics,” writing that “AI can reduce medical errors by as much as 16%.” A trade outlet went further still, headlining “Safer Diagnoses With AI Tool” and calling it “early evidence that LLMs can improve patient safety.” Somewhere in that chain, a reduction in errors physicians spotted in the notes became a reduction in harm to patients. The study measured the first. It did not measure the second.

What the study actually measured

This was not a conventional randomized controlled trial. It was a quality-improvement study (the authors’ own term, reported under the SQUIRE 2.0 guideline), run at 15 Penda clinics from January to April 2025. Access to AI Consult was randomized at the clinician level — 57 clinicians got the tool, 49 did not — across nearly 40,000 visits (20,589 in the AI group, 18,990 without). Clinicians knew whether they had it.

Crucially, an “error” was never observed at the bedside. To score errors, 108 independent physicians, blinded to which group a visit came from, read the de-identified documentation from a random sample of 5,666 visits — the history, vitals, notes, investigations, diagnosis and plan — and rated four categories on a five-point scale. A rating of 1 or 2 was defined as a “clinically meaningful error.” So the outcome is a physician’s judgment of a written record, not a measurement of what happened to the person in the room. On that measure, four categories improved, and all four reached significance:

What actually reached significance: physician-rated reductions in documented errors
010203040History & examinationHistory & examination: 31.8 (95% CI 21.9 to 40.5; p<0.001)31.8 [21.9, 40.5], p<0.001Diagnosis (the '16%')Diagnosis (the '16%'): 16 (95% CI 6.9 to 24.2; p=0.001)16 [6.9, 24.2], p=0.001Treatment (the '13%')Treatment (the '13%'): 12.7 (95% CI 6.8 to 18.3; p=0.001)12.7 [6.8, 18.3], p=0.001InvestigationsInvestigations: 10.3 (95% CI 1 to 18.8; p=0.034)10.3 [1, 18.8], p=0.034Relative reduction in the documented-error rate (%) — no-effect line at 0
Source: Penda Health & OpenAI, arXiv:2507.16947 (2025) — independent physician review of de-identified visit documentation
Data table
DomainRelative reduction (%)95% CIpSig.What was scored
History & examination31.821.9 to 40.5p<0.001yesphysician review of notes
Diagnosis (the '16%')166.9 to 24.2p=0.001yesphysician review of notes
Treatment (the '13%')12.76.8 to 18.3p=0.001yesphysician review of notes
Investigations10.31 to 18.8p=0.034yesCI nearly touches zero
Read before citing
  • These are reductions in errors that independent physicians identified while reading de-identified visit NOTES. They measure the quality of documented decisions, not whether patients were harmed less or recovered.
  • The one measure that reached toward patient benefit — an 8-day 'are you feeling better?' phone call — was NOT statistically significant: 3.8% of AI-group patients vs 4.3% of non-AI-group patients said they were not better, with about 60% of patients unreachable. The authors call this analysis 'exploratory rather than confirmatory' and say the study was 'not powered' for it.
  • This was a clinician-randomized quality-improvement study, not a blinded randomized controlled trial; clinicians knew whether they had the tool.
  • Funded by OpenAI, which 'was involved in the study analysis and reporting'; the tool runs on OpenAI's GPT-4o. A secondary set of AI graders — OpenAI's own o3 and GPT-4.1 — scored even larger reductions than the human physicians did.

The four numbers are real — take them seriously

It would be easy, and wrong, to wave this away. The reductions are prespecified, independently rated by more than a hundred physicians, and statistically significant at p = 0.001 for both headline categories, with confidence intervals that exclude zero. History-taking improved most (a 31.8% relative reduction); investigations least (10.3%, with a confidence interval that nearly touches the no-effect line). A tool that reliably nudges a busy clinician toward a more complete history and a better-documented plan is a real thing, and the study is better evidence for it than most vendor announcements ever offer. Ground Truth’s quarrel is not with the 16%. It is with what the 16% is a number about.

“Fewer errors in the notes” and “fewer harmed patients” are different measurements. This study made the first with rigor. It did not make the second at all.

The one patient measure came back quiet

The study did reach, once, toward the outcome that matters: eight days after a visit, patients were phoned and asked whether they were feeling better. Here the difference vanished. 3.8% of AI-group patients said they were not better, versus 4.3% without the tool — a gap the authors call “not statistically significant.” And they are unusually frank about why it can’t carry weight: about 60% of patients could not be reached, so the figure rests on a partial, complete-case sample. In their words, they “treat [it] as exploratory rather than confirmatory,” and note the study “was not powered to detect an effect of this magnitude.” This is not a finding that the tool failed patients. It is the absence of a finding either way — the study was not built to see one.

Who ran the study, and why it’s worth noting

None of this implies bad faith; the paper’s limitations section is candid, and its authors flag most of what this piece does. But a fair reader should still weigh who held the pen. The study was funded by OpenAI, which the paper says “was involved in the study analysis and reporting.” It is heavily OpenAI-authored — both senior co-authors are OpenAI employees — and AI Consult runs by default on OpenAI’s GPT-4o. And in a secondary analysis, the visit notes were re-graded not by physicians but by OpenAI’s own o3 and GPT-4.1 models — which found larger reductions than the human doctors did. The authors themselves raise the obvious worry: an AI-assisted tool, graded by AI models built by the tool’s funder, is a measurement that can drift toward the thing it is optimized to produce. The human-physician ratings, not the model ratings, are the result to trust.

What would move this verdict

The honest bar is the one the authors name themselves: “large studies powered for patient outcomes.” A trial that measured symptom resolution, correct referrals, or averted harm — with enough patients reached to detect a real effect — could turn “better notes” into “better care,” and this verdict with it. Until then, the defensible version of the claim is the modest one: with an AI copilot, clinicians at these clinics produced records that independent physicians judged to contain fewer errors. Whether their patients were safer is, on this evidence, not yet known. The distinction is not pedantry. It is the difference between a monitoring metric and an outcome — and the whole reason to read a health-impact claim slowly.

The bottom line

Believe the 16%. It is real, significant, and independently rated. Just hold it to what it measured: the quality of a documented decision, not the safety of a patient. The study’s own authors drew that line; the headlines erased it. When the powered patient-outcome trial arrives, we’ll read it here — and update this verdict in public if it earns a change.

Primary source: Penda Health & OpenAI, “AI-based Clinical Decision Support for Primary Care: A Real-World Study” (arXiv:2507.16947, 2025).
Claim as circulated: OpenAI · TIME · Digital Health News.
Ground Truth is independent of, and unaffiliated with, OpenAI and Penda Health. Corrections are made in public — see our policy.