The method — how to read an AI claim in global health

Reading a health-chatbot or digital-health impact claim

The eight questions that separate a real health effect from a flattering number. Full worked examples — built around the PROMPTS/Jacaranda randomized trial — sources, effect-size charts, and a copy-paste prompt live in the article: How to read a health-chatbot impact claim →

Is this a monitoring number or an impact number? Reach and engagement are monitoring; impact needs an outcome measured against a counterfactual.
Significant — but how big, in units I can picture? Ask for the effect size and its confidence interval.
Is this a measured health outcome, or a proxy for one? Separate self-report from a measured outcome; ask if the study was powered for it.
Has this been tested against a control — and what happened when it was? Ask for the randomized evidence.
What does the retention curve look like — not the install count? Reach is a moment; use is a curve.
Is this a randomized comparison, or a correlation among the already-engaged? Correlation is selection, not effect.
Validated on what — and what happens when it's wrong? A benchmark is not a bedside; ask where the human fallback sits.
What's the cost per outcome — not per user? With the effect size in the denominator and the full economic cost in the numerator.

Run these on a claim →

Reading a language or benchmark claim

The nine questions that separate a real capability from a headline number. Full worked examples, a per-language error explorer, primary sources, and a copy-paste evaluation prompt live in the article: How to read an African-language AI benchmark without getting fooled →

On what was it tested? Read or spontaneous speech, clean or noisy audio, whose accents, which subject matter.
What does the metric count as a mistake — and what does it ignore? For morphologically rich languages, ask for character error rate beside word error rate.
How was "right" decided, and by whom? A human native speaker, or an AI judge — and against an answer in which language?
Over how many items — and would the gap survive noise? A few dozen questions can't separate two close models.
Which languages, exactly — and on what base model? A leaderboard entry is not a shipped, working model.
Is this shipping or a preview — and is my language in the tested set? A language count is not a coverage guarantee.
Who produced the benchmark, and are they a player in it? A self-graded result needs an independent second opinion.
Is this a continental average hiding wide variance? Performance tracks transcribed-data volume, not difficulty.
Is a low score a verdict, or a specification? Which layer is failing — real misunderstanding, or an accent the recognizer mis-transcribes?

Run these on a claim →

The question underneath all of them

Is this number measuring the world the tool will be used in — the clean lab, or the messy point of care? Every rubric here is a way of asking that one question in a form you can check. A score is a property of a measurement, never of a model; keeping those two apart is the whole job, and it is one anyone can do.