The method
How to read an AI claim in global health.
Whatever the claim — a benchmark score, an accuracy figure, an impact statistic — the work is the same: find what the number actually measured, and name what was left out. Ground Truth resolves that work into a few reusable questions. They are collected here, free to copy, cite, or run on your own AI.
Reading a health-chatbot or digital-health impact claim
The eight questions that separate a real health effect from a flattering number. Full worked examples — built around the PROMPTS/Jacaranda randomized trial — sources, effect-size charts, and a copy-paste prompt live in the article: How to read a health-chatbot impact claim →
- Is this a monitoring number or an impact number? Reach and engagement are monitoring; impact needs an outcome measured against a counterfactual.
- Significant — but how big, in units I can picture? Ask for the effect size and its confidence interval.
- Is this a measured health outcome, or a proxy for one? Separate self-report from a measured outcome; ask if the study was powered for it.
- Has this been tested against a control — and what happened when it was? Ask for the randomized evidence.
- What does the retention curve look like — not the install count? Reach is a moment; use is a curve.
- Is this a randomized comparison, or a correlation among the already-engaged? Correlation is selection, not effect.
- Validated on what — and what happens when it's wrong? A benchmark is not a bedside; ask where the human fallback sits.
- What's the cost per outcome — not per user? With the effect size in the denominator and the full economic cost in the numerator.
Reading a language or benchmark claim
The nine questions that separate a real capability from a headline number. Full worked examples, a per-language error explorer, primary sources, and a copy-paste evaluation prompt live in the article: How to read an African-language AI benchmark without getting fooled →
- On what was it tested? Read or spontaneous speech, clean or noisy audio, whose accents, which subject matter.
- What does the metric count as a mistake — and what does it ignore? For morphologically rich languages, ask for character error rate beside word error rate.
- How was "right" decided, and by whom? A human native speaker, or an AI judge — and against an answer in which language?
- Over how many items — and would the gap survive noise? A few dozen questions can't separate two close models.
- Which languages, exactly — and on what base model? A leaderboard entry is not a shipped, working model.
- Is this shipping or a preview — and is my language in the tested set? A language count is not a coverage guarantee.
- Who produced the benchmark, and are they a player in it? A self-graded result needs an independent second opinion.
- Is this a continental average hiding wide variance? Performance tracks transcribed-data volume, not difficulty.
- Is a low score a verdict, or a specification? Which layer is failing — real misunderstanding, or an accent the recognizer mis-transcribes?
The question underneath all of them
Is this number measuring the world the tool will be used in — the clean lab, or the messy point of care? Every rubric here is a way of asking that one question in a form you can check. A score is a property of a measurement, never of a model; keeping those two apart is the whole job, and it is one anyone can do.