Why does self-reported improvement not equal a health outcome?

Self-reported knowledge, intentions, or behavior are proxies for health, not health itself. Knowing more about newborn danger signs, or reporting a clinic visit, is a step removed from a measured outcome such as a diagnosis, a birthweight, or a death averted. The PROMPTS authors state plainly that all their outcomes were self-reported and the study 'was not powered to detect effects on health outcomes.' Ask whether the claim rests on a measured clinical outcome, and whether the study was even designed to detect one.

How to Read a Health-Chatbot Impact Claim

Q: Can a statistically significant result still be too small to matter?

Yes. Statistical significance answers 'is there any difference?'; it says nothing about size. With a large sample, a tiny effect becomes significant. In the pre-registered PROMPTS cluster RCT of a maternal-health messaging platform (N=6,139), the significant effects were +0.06 to +0.09 standard deviations on self-reported indices. Plotted as distributions, a 0.09 SD gap means the treatment and control arms overlap by about 96 percent. Real, but small. Always ask for the effect size and its confidence interval in interpretable units, not just the p-value.

Q: Why were the PROMPTS trial's effects so small?

Two things, one certain and one an inference. The certain part: the outcomes were self-reported indices in standard deviations, where a real effect of 0.06 to 0.09 SD is already a near-total overlap between arms. The inference: participants were pre-selected — recruited at health facilities during antenatal care, with about 80 percent (4,965 of 6,139) having already received prior antenatal care and nearly 90 percent owning a phone — so the sample starts near the ceiling on care-seeking, which plausibly compresses any effect. The trial does not test that ceiling directly, so treat it as context for reading the small numbers, not a proven cause, and as a reason to read marketing figures from the same engaged population with the denominator in mind.

Q: Does an engagement–outcome correlation prove a health tool works?

No. 'Our most engaged users had the best outcomes' is selection, not causation — the people who engage differ systematically from those who don't. In the Mobile WACh NEO trial's engagement analysis, messaging was self-selected (younger, more educated, first-time mothers messaged more), and the associations contradicted each other: more messaging predicted more danger-sign knowledge but lower odds of early breastfeeding. The authors state they cannot conclude messaging directly affected any health outcome. Only a randomized comparison isolates what the tool caused.

Funders and ministries buy reach; patients need outcomes. The two get quoted in the same breath — "we reached two million people and improved health" — but they are different kinds of evidence, and the second almost never follows from the first. You do not need to run a trial to catch the gap. You need to know which questions a real impact claim can survive, and which ones make a flattering number fall apart. Every example below links to a public primary source, so you can check the working yourself.

The one idea

Three numbers get dressed up as impact. None of them is a health outcome.

Monitoring, not impact. Reach, installs, messages, active users — proof of delivery, not of effect.
Significant, not large. With a big enough sample, a near-zero effect clears p<0.05. Ask the size.
Correlated, not caused. "Our engaged users did better" is selection — the engaged were different people to begin with.

The last word is a health outcome measured against a randomized counterfactual — not a dashboard.

Which questions actually separate impact from noise?

Eight of them. Each starts from a claim you will actually hear, states what you are meant to conclude, shows what the sentence quietly leaves out, and ends with the one question that exposes the gap. They are ordered from the easiest tell — a reach number — to the subtlest — a cost that never touches an outcome.

Question 01 — reach vs. impact

Our platform has reached over two million people and exchanged tens of millions of messages.

What you're meant to conclude

Look how many people it's helping — this is working.

What's hidden

That reach is a monitoring metric, and monitoring is not evaluation. The WHO guide to monitoring and evaluating digital health interventions draws the line explicitly: monitoring tracks the quality and fidelity of the intervention's inputs; evaluation asks about its outputs and impacts — user satisfaction, process improvements, health outcomes, cost-effectiveness. A count of users or messages sits entirely on the monitoring side. It tells you the intervention was delivered and opened; it says nothing about whether a single health outcome changed. The archetype is the voice-IVR helpline that announces it has reached millions of callers on a hotline — a real, impressive delivery number — with no trial behind it of whether those callers were any healthier than people who never called.

And notice what kind of count it is, because reach numbers are usually inflated by construction. A "two million reached" figure is almost always cumulative — everyone ever registered — not who is still there this month. Cumulative sign-ups, monthly active users, and genuinely-engaged users (people who came back and sent more than a message or two) are three very different numbers that can sit an order of magnitude or more apart, and it is the biggest of the three that reaches the headline. The person who tried it once, found it unhelpful, and left is counted in that total forever.

The question to always ask

Is this a monitoring number or an impact number?

Reach, installs, messages, active users, "engagement" — all monitoring. Impact needs an outcome, measured against what would have happened anyway. And pin the denominator before you even get there: cumulative registrations, or active users this month — and "active" by what definition (opened once? sent a message? returned the next week?). Better still, ask for the whole funnel: eligible → reached → enrolled → still subscribed → meaningfully engaged → outcome. Every arrow loses people, and the honest number is how many reach the far end, not how many entered the top — a headline that quotes the first box and implies the last is hiding the drop-offs in between. If the headline counts people or messages, it hasn't started measuring impact yet.

Question 02 — significance vs. effect size

A randomized trial found statistically significant improvements across almost every domain.

What you're meant to conclude

The effects are big, and the tool clearly works.

What's hidden

How small "significant" can be. The PROMPTS cluster RCT — a genuinely rigorous, pre-registered, independently evaluated trial of a maternal-health messaging platform across 40 Kenyan facilities (N = 6,139) — found exactly this shape of result: statistically significant improvements of +0.06 to +0.09 standard deviations on five self-reported indices. The p-values are real, earned by a large sample. But an effect that size is a near-total overlap between the treatment and control groups. A p-value answers "is there any difference?"; it is silent on "how big?" — and here the honest answer to "how big" is "barely visible." The forest plot shows all six domains; the distribution plot shows what a 0.09 SD gap actually looks like.

There is also a plausible reason the effect looks small that is about who was measured, not what the messages did. Participants were recruited by enumerators at health facilities during antenatal care — a sample that had, by definition, already walked into the formal system. In the trial's own baseline, about 80% (4,965/6,139) had already received some antenatal care and nearly 90% owned a mobile phone. It is hard to move "routine care seeking" much in women who are already, demonstrably, seeking routine care: a group that starts near the ceiling leaves little room for an effect to register. That is a standard restricted-range concern, not something this trial sets out to prove — so read it as context for the small numbers, not their demonstrated cause. It is also the context that a reach total or a "twice as likely" headline, drawn from the same pre-selected, care-engaged, phone-owning population, never sets beside the number.

And notice who is in neither arm. Because enrollment happens at an antenatal visit, both the treatment and the control group are made of women who already attend care. Women who never reach antenatal care — often the most isolated and highest-risk — sit outside the trial entirely, so it can say nothing about them. A facility-enrolled tool is measured on the already-connected and stays silent about the group a maternal-health programme most needs to reach; an average effect that reads as "it works" can be an effect only for the people who were easiest to help.

Scroll the chart sideways to read it →

What the PROMPTS RCT actually found: modest effects on self-reported indices

Source: Vatsa et al. (2025), PLOS Medicine 22(2):e1004527 — cluster RCT, 40 facilities, 8 counties, Kenya; N=6,139. doi:10.1371/journal.pmed.1004527

Data table

Domain	Effect (SD)	95% CI	p	Sig.	Index composition
Newborn care	0.09	0.07 to 0.12	p<0.001	yes	3 pre-registered outcomes
Knowledge	0.08	0.03 to 0.12	p=0.002	yes	4 pre-registered
Birth preparedness	0.08	0.02 to 0.13	p=0.018	yes	2 pre-reg + 1 exploratory
Routine care seeking	0.07	0.03 to 0.11	p=0.003	yes	6 pre-reg + 2 exploratory
Postpartum care content	0.06	0.01 to 0.12	p=0.043	yes	6 exploratory (none pre-registered)
Danger-sign care seeking	0.035	-0.01 to 0.08	p=0.096	no	the acute-safety domain — not significant

Read before citing

Effects are on self-reported, survey-based indices — not measured health outcomes. The authors state all outcomes were self-reported and the study “was not powered to detect effects on health outcomes.”
Estimates are adjusted intention-to-treat effects in SD units; p-values are Holm–Bonferroni family-wise-error-adjusted across the six domain indices.
The danger-sign (acute care-seeking) index — the one safety-critical domain — was not significant (95% CI includes 0; p=0.096). The abstract reports only its CI and p, not a point estimate; the 0.035 marker is the CI midpoint (hollow to signal null).
The postpartum-care-content index is composed entirely of exploratory (non-pre-registered) outcomes.

A “significant” effect can still be a near-total overlap

A 0.09 SD gap between the arms means the two distributions overlap by about 96%. Pick one treated woman and one control woman at random: the treated one scores higher about 53% of the time — versus 50% if the arms were identical.

The newborn-care index — the largest of the six statistically significant effects — moved the treatment arm 0.09 SD to the right of control. Plotted as two distributions, that gap is the whole effect.

Source: Effect size from Vatsa et al. (2025), PLOS Medicine 22(2):e1004527 (newborn-care index +0.09 SD, 95% CI 0.07–0.12). The curves are standard-normal distributions separated by that exact effect.

Data table

SD from control mean	Control arm density	Treatment (PROMPTS) density
-3	0.0044	0.0034
-2.5	0.0175	0.0139
-2	0.0540	0.0449
-1.5	0.1295	0.1127
-1	0.2420	0.2203
-0.5	0.3521	0.3352
0	0.3989	0.3973
0.5	0.3521	0.3668
1	0.2420	0.2637
1.5	0.1295	0.1476
2	0.0540	0.0644
2.5	0.0175	0.0219
3	0.0044	0.0058

Read before citing

Illustrative, not raw per-participant microdata. The two curves are standard normals separated by the paper's reported standardized effect (0.09 SD). Real index distributions aren't perfectly normal, but the separation — the thing that matters — is exact.
0.09 SD is the LARGEST of the six significant effects; the others (0.06–0.08 SD) overlap even more. The small p-values reflect the large sample (N=6,139), not a large gap between arms.
The paper's own figures agree — and one is reproduced just below: Figure 4 (antenatal/postnatal visit counts) and Figure 5 (postpartum-content components) plot the two arms directly and they nearly coincide. A p-value answers “is there any difference?”; these distributions answer “how big?”
Pre-selected sample (a caveat on interpretation, not a paper finding). Participants were recruited at health facilities during antenatal care, and about 80% (4,965/6,139) had already received prior ANC. A group already near the ceiling on care-seeking leaves little room for an effect to register — a plausible reason a small effect need not mean a weak tool, though the trial doesn't test it. Reach and “twice as likely” marketing draw on the same engaged population.

Those curves are an illustration. The trial's own data say the same thing: here are two of the care-seeking domains it measured — antenatal and postnatal visit counts — with the treatment and control arms plotted directly. They sit almost on top of each other.

Scroll the chart sideways to read it →

The paper's own data agree: the two arms almost completely overlap

Antenatal and postnatal visit counts, treatment (blue) vs control (grey), with the recommended minimum marked. The arms sit almost on top of each other — a slight rightward shift toward the recommended counts is the ~0.07 SD care-seeking effect, made visible.

Source: Reconstructed from Figure 4 of Vatsa et al. (2025), PLOS Medicine 22(2):e1004527 (open access, CC BY 4.0). doi:10.1371/journal.pmed.1004527

Data table

Panel	Visits	Treatment	Control
Antenatal care	0	0	0
Antenatal care	1	0.01	0.01
Antenatal care	2	0.03	0.045
Antenatal care	3	0.135	0.16
Antenatal care	4	0.285	0.305
Antenatal care	5	0.25	0.24
Antenatal care	6	0.15	0.145
Antenatal care	7	0.065	0.06
Antenatal care	8	0.028	0.025
Antenatal care	9	0.018	0.012
Antenatal care	10	0.006	0.004
Antenatal care	11	0.002	0.001
Antenatal care	12	0.001	0
Antenatal care	13	0	0
Antenatal care	14	0	0
Postnatal care	0	0.02	0.02
Postnatal care	1	0.5	0.565
Postnatal care	2	0.39	0.33
Postnatal care	3	0.06	0.055
Postnatal care	4	0.015	0.012
Postnatal care	5	0.005	0.004
Postnatal care	6	0.002	0.001
Postnatal care	7	0.001	0
Postnatal care	8	0	0
Postnatal care	9	0	0
Postnatal care	10	0	0

Read before citing

Bar heights are DIGITISED BY EYE from the paper's Figure 4 — the proportions are approximate and rounded, and the point is the shape, not any single bar. See the paper for the authoritative figure.
Both panels are self-reported visit counts, treatment vs control, with the recommended minimum — 4 antenatal, 2 postnatal — shown as a dashed line.
The distributions nearly coincide: slightly more treatment-arm women reach the recommended visit counts, which is what a ~0.07 SD routine-care-seeking effect looks like as a distribution. It is statistically detectable at N=6,139 and visually tiny — the same lesson as the curves above.

The question to always ask

Significant — but how big, in units I can picture?

Ask for the effect size and its confidence interval, in interpretable units. A large sample turns tiny differences significant; significance is a statement about your certainty that a gap exists, not about whether the gap is worth anything.

A note on the example

PROMPTS is the one organization we name here, and on purpose: it is a teaching case, not a target. It commissioned a pre-registered trial, handed the evaluation to independent researchers at Harvard and Innovations for Poverty Action, reported confidence intervals, and published its modest and null results honestly. That is the opposite of hiding outcomes. The lesson is how to read an effect size — not that anyone behaved badly.

The contrast worth noticing is with marketing. The same platform's public page says mothers who received its messages were "almost twice as likely to start using PPFP" (postpartum family planning) — a large relative number, from an earlier and smaller trial, on a single behavior. A "twice as likely" headline and a 0.06–0.09 SD index effect are not the same kind of number, do not come from the same study, and answer different questions. Neither is dishonest; they just measure different things, and only one has the rigorous trial behind it.

Question 03 — self-reported vs. measured

Mothers in the program knew more about newborn danger signs and sought more care.

What you're meant to conclude

The program improved health.

What's hidden

That these were self-reported survey indices, not measured clinical outcomes — and that the study was never built to detect a health outcome in the first place. The PROMPTS authors say so in as many words: "All outcomes were self-reported, and the study was not powered to detect effects on health outcomes." Reporting that you know the danger signs of newborn illness, or that you attended a clinic, is a proxy — one or two steps upstream of the thing that matters: a complication caught, a birthweight, a death averted. It is entirely reasonable to measure proxies; knowledge and care-seeking are on the causal path. The error is reading a proxy as if it were the outcome. The trial's own conclusion calls for future work to establish health-outcome impact — an admission that this study, by design, did not.

The question to always ask

Is this a measured health outcome, or a proxy for one?

Separate knowledge, intentions and self-reported behavior from measured outcomes (a diagnosis, a lab value, a death averted). Then ask the harder question: was the study even powered to detect a health outcome, or only a proxy?

Question 04 — the null that rigor returns

SMS reminders improve retention and outcomes — the evidence is clear.

What you're meant to conclude

This is a settled, evidence-backed intervention. Fund more of it.

What's hidden

What happens when the same intervention is tested rigorously: it often doesn't move the outcome. In the WelTel PMTCT trial, a Kenyan RCT of weekly interactive text-messaging for mothers in HIV care across six clinics, the messaging arm did no better than control on retention — a risk ratio of 1.02 (95% CI 0.92–1.14, p = 0.697); the authors conclude it "was not associated with improved retention in PMTCT care." The Mobile WACh NEO trial, a two-way maternal-newborn SMS-plus-nurse-chat program, was likewise null on its primary outcomes: "the primary analysis revealed no significant differences in outcomes between the intervention and control groups." Null results are not the field failing. They are what honest measurement looks like when a plausible idea is tested against a real counterfactual.

The question to always ask

Has this been tested against a control — and what happened when it was?

Ask specifically for the randomized evidence. A stack of positive before-and-after or observational studies sitting next to one null RCT usually means the RCT is the one telling the truth.

Question 05 — engagement vs. sustained use

Users are highly engaged — strong daily activity and thousands of active users.

What you're meant to conclude

People love it and keep coming back.

What's hidden

That engagement curves collapse, fast, and an "active users" headline is a snapshot taken at the top of the cliff. The largest independent look at real-world usage — Baumel and colleagues' panel study of 93 popular mental-health apps (median 100,000 installs), measured from actual device usage rather than developer reports — found, across the 59 of the 93 apps (63%) that had retention data, a median 15-day retention of 3.9% and 30-day retention of 3.3%; daily opens fell more than 80% between day 1 and day 10, and across all 93 apps a median of just 4.0% of installers opened the app on any given day. The authors' conclusion is the whole warning: "only a small portion of users actually used the apps for a long period of time." A big install or active-user number, quoted once, hides the shape of the curve underneath it.

Averages hide it too. Every one of those churned users still counts toward "reach," and an "average messages per user" is a mean — a handful of power users pull it up while the typical user sends one or two messages and never returns. In a cumulative total, a person who sent two messages, judged the answers low-quality, and quit looks identical to one who engaged for months. That is why the honest numbers are distributional: the median messages per user, the share who sent more than one, and the share still there after week one.

Scroll the chart sideways to read it →

Reach is not use: real-world health-app retention collapses within days

Source: Baumel, Muench, Edan & Kane (2019), J Med Internet Res 21(9):e14567 — 93 popular Android mental-health apps, independent usage-panel data. doi:10.2196/14567

Data table

x	% of installers still active (median app)
0	100
15	3.9
30	3.3

Read before citing

Independent usage-panel data (not developer-reported) across 93 popular mental-health apps, median 100,000 installs. The 15-/30-day retention medians are over the 59 of 93 apps (63%) that had retention data — a subset the study reports as representative; the 4.0% daily-open-rate median is over all 93.
Endpoints are exact reported medians: 15-day retention 3.9%, 30-day 3.3%; day 0 = 100% by definition. The decline is FRONT-LOADED — the paper reports a >80% drop in daily open rates between day 1 and day 10 — so the true curve falls faster early and flatter late than the straight segments shown.
Median daily open rate (any given day) was 4.0%. These are consumer mental-health apps in high-income markets, not global-health chatbots; cited to show the reach-vs-retention gap, not a like-for-like rate.

The question to always ask

What does the retention curve look like — not the install count?

Ask for the retention curve, the median messages per user, and the share who returned after week one — not a cumulative total or a mean, both of which the one-and-done user quietly inflates. Reach is a moment; use is a curve. (These are high-income consumer apps, cited for the reach-vs-retention gap — a prompted, free clinical service can retain better, but the burden is to show the curve, not assert it.)

Question 06 — engagement vs. causation

Our most engaged users had the best health outcomes.

What you're meant to conclude

Using the tool more caused better outcomes — so drive engagement and scale it.

What's hidden

That the people who engage are different from the people who don't — so an engagement–outcome correlation measures who they are, not what the tool did. The Mobile WACh NEO engagement analysis is a clean dissection of this claim. Messaging was self-selected: younger, more educated, unmarried, first-time mothers messaged more. The engagement–outcome associations then pointed in opposite directions — more messaging went with a greater rise in newborn danger-sign knowledge (Adj Est 0.39; 95% CI 0.09–0.68), but with lower odds of early breastfeeding (aOR 0.62; 95% CI 0.45–0.86). Contradictory signs are the fingerprint of confounding. The authors decline the causal read directly: "we are unable to conclude that messaging directly impacted any health outcomes ... mothers who engaged more may have sought additional support" — that is, they were already the care-seeking type.

The question to always ask

Is this a randomized comparison, or a correlation among the already-engaged?

An engagement–outcome correlation is selection, not effect. Only a randomized contrast — engaged-vs-not decided by chance, not by the user — isolates what the tool actually caused.

Question 07 — "safety validity" vs. clinical safety

The model was validated / tops a medical benchmark — so it's safe for patients.

What you're meant to conclude

It passed a test, so it's safe to answer real patients.

What's hidden

That acing a benchmark is a different test than being safe on open patient questions. When sixteen physicians red-teamed four leading public chatbots on 222 patient-posed medical questions (888 responses), the rate of problematic answers ran from 21.6% to 43.2% across models, and outright unsafe answers — ones a physician judged could lead to harm — from 5.0% to 13.5% — from models that score well on medical-licensing-style exams. The authors' conclusion is blunt: "millions of patients could be receiving unsafe medical advice from publicly available chatbots." A benchmark score, or the word "validated," tells you a model cleared some fixed test; it does not tell you how often it is wrong on the messy questions real people ask, and it says nothing about whether there is a path to a human when it is.

Scroll the chart sideways to read it →

Physicians red-teamed four public chatbots on patient questions

Source: Draelos et al. (2026), npj Digital Medicine 9:241 — 16 physician evaluators rated 888 responses to 222 patient-posed medical questions across four public chatbots. doi:10.1038/s41746-026-02428-5

Data table

Item	Problematic responses (% of 222 patient-posed questions), physician-rated	Note
Claude (Anthropic)	21.6	5.0% unsafe
Gemini (Google)	27.5
GPT-4o (OpenAI)	31.5	13.5% unsafe
Llama-3 70B (Meta)	43.2	13.1% unsafe

Read before citing

Two thresholds. “Problematic” = a physician judged the answer not fully acceptable (incomplete, misleading, or unsafe); “unsafe” = a stricter subset judged capable of leading to harm. Bars show the broader problematic rate; unsafe rates are noted per bar where the paper reports them.
These are general-purpose consumer chatbots answering patient questions in free text — not validated medical devices, and with no clinician in the loop. The numbers say nothing about a purpose-built tool with a human-escalation pathway.
222 advice-seeking questions across internal medicine, women's health and paediatrics; 16 physician raters. A model that scores well on medical-exam benchmarks can still fail here — exam accuracy is a different test than safety on open patient questions. Unsafe answers ranged from 5.0% (Claude) to 13.5% (GPT-4o); Gemini's standalone unsafe rate was not reported.

The question to always ask

Validated on what — and what happens when it's wrong?

Ask what the safety claim was measured on, whether clinicians reviewed the actual outputs (not just an exam score), and where the escalation pathway to a human sits. A benchmark is not a bedside, and a tool with no human fallback is one bad answer from harm.

Question 08 — cost-per-user vs. cost-per-outcome

It's low-cost and scalable — cents per user.

What you're meant to conclude

It's a cost-effective way to improve health.

What's hidden

That a cost per participant is not a cost per outcome — and here the PROMPTS trial is unusually concrete. The same paper reports Jacaranda's own estimate that the platform "costs 74 cents per participant for their lifetime on the platform," covering message delivery, technical infrastructure, field enrollment and county engagement. That is a genuinely low unit cost, transparently itemised — though it is the program's own figure, and a cost per mother enrolled, not per mother helped. Divide 74 cents by a 0.06–0.09 SD, mostly self-reported effect, and the cost per unit of health actually changed is unknown — which is why the same authors leave cost-effectiveness as an open question rather than a claim. Two things even this transparent unit cost doesn't value: the public-system opportunity cost behind each enrollment — the government nurses' and facilities' time the program leans on, which the figure doesn't price in — and any comparison to what the same money, or the same staff hours, would achieve spent another way. And the wider base of evidence is skewed: reviews find the economic evidence for digital health is "disproportionately concerned with high-income countries and hospital settings", with LMIC cost-effectiveness evidence, in one review's words, "urgently required." A "cents per mother" line borrows the credibility of a cost-effectiveness claim without doing the arithmetic that would earn it.

The question to always ask

What's the cost per outcome — not per user?

Ask for cost per health outcome achieved — per case averted, per death avoided — with the effect size in the denominator, and the full economic cost in the numerator: not just the platform's running cost but the health-worker and facility time it leans on, and what the same money or staff hours would buy elsewhere. A low cost per mother or per message is a good sign — not, on its own, a cost-effectiveness case.

What does a good evaluation actually look like?

What stays fixed when the technology won't hold still

One complication first. A chatbot built on a large language model is a moving target: the model behind it can be swapped or upgraded in weeks, so a two-year pre-registered trial can hand down a verdict on a version nobody is running anymore. That is a real mismatch, and it means the machinery of evaluation has to move faster — rolling read-outs, continuously refreshed test sets, staged rollouts you measure as you go — rather than one frozen-version trial reported years later. New methods are genuinely needed. But a short list does not change with the technology, and it is the part to insist on:

A measured health outcome. The headline endpoint is a health outcome — or an honest, pre-named proxy — not a reach count, an engagement rate, or a benchmark score. If nobody's health was measured, nothing about health was evaluated.
A counterfactual to compare it against. The outcome is set beside what would have happened without the tool — a control group, randomized wherever that is feasible, because that is the only comparison that separates the tool's effect from the trend it is riding.
Enough sample to see an effect that matters. The study is large enough to detect a difference worth acting on, and states what size it could and couldn't have caught. (PROMPTS, at N=6,139, still wasn't powered for health outcomes — sample size is a claim to check, not assume.)
A human in the loop. For anything touching clinical safety, qualified people — clinicians, not an automated metric or the model grading itself — review real outputs, and there is a live path to a human when the model is wrong. A benchmark score doesn't see the harm; a physician does.
Independence. The people who judge the tool don't profit from the verdict. Vendor involvement is fine when it is declared and not the sole scorekeeper.

The classical RCT furniture — pre-registration, confidence intervals, a named stopping rule, cost per outcome — still helps, and PROMPTS is a useful exemplar of all of it: pre-registered (NCT05110521; AEARCTR-0008449), independently evaluated by Harvard and Innovations for Poverty Action, effect sizes reported with intervals, a null safety result published honestly. Hold it up as proof the core can be met — not as a template every fast-moving chatbot must copy. When the model changes monthly, keep the five above and let the rest of the method evolve to keep pace.

Cut both ways

This guide can read as an argument that nothing works. It isn't. The effects in the best-run trial here are real — modest, self-reported, but real — and the organization that produced them evaluated itself more honestly than most. The failure mode on the other side is just as costly: reading every null RCT as proof that "digital health doesn't work," or every small effect as a reason to defund and wait. A 0.06 SD improvement delivered to hundreds of thousands of people, at low marginal cost, can be worth funding — if someone has done the cost-per-outcome arithmetic. "Not proven" is a specification, not a verdict, and it deserves the same scrutiny as "revolutionary."

The skill is neither hype nor defeatism. It is asking, of any number someone sets in front of you — including your own — which of the eight questions it can survive.

Evaluate a claim yourself

Take the eight questions to your own AI

Paste a study, press release, evaluation, or vendor claim below. It turns the eight questions into a prompt that makes any assistant show its working — quote the number, name what's missing, and keep measured outcomes apart from monitoring metrics. Copy it into ChatGPT, Claude, or Gemini.

Mode Program or tool (optional) Claim, study, or link to evaluate

Evaluation prompt

You are evaluating an impact or effectiveness claim about a digital-health or AI health tool ({{SUBJECT}}), using the "Ground Truth" method (groundtruth.health). I will give you a study, press release, evaluation, or vendor claim. {{MODE}}

1. Is this a monitoring number or an impact number? Reach and engagement are monitoring; impact needs an outcome measured against a counterfactual.
2. Significant — but how big, in units I can picture? Ask for the effect size and its confidence interval.
3. Is this a measured health outcome, or a proxy for one? Separate self-report from a measured outcome; ask if the study was powered for it.
4. Has this been tested against a control — and what happened when it was? Ask for the randomized evidence.
5. What does the retention curve look like — not the install count? Reach is a moment; use is a curve.
6. Is this a randomized comparison, or a correlation among the already-engaged? Correlation is selection, not effect.
7. Validated on what — and what happens when it's wrong? A benchmark is not a bedside; ask where the human fallback sits.
8. What's the cost per outcome — not per user? With the effect size in the denominator and the full economic cost in the numerator.

Prefer primary sources and randomized evidence; if a claim rests on before-and-after or observational data, say so, and treat the effect as unconfirmed. Do not fill gaps with assumptions — say "not stated" wherever the source is silent.

CLAIM / STUDY TO EVALUATE:
{{CLAIM}}

Ground Truth doesn't run the model for you — deliberately. The method is ours; the judgement stays yours.

The last word is an outcome, not a dashboard

Every trap in this guide is a substitution: a monitoring number standing in for an impact number, a p-value standing in for a magnitude, a proxy standing in for a health outcome, a correlation standing in for a cause, a benchmark standing in for safety, a cost-per-user standing in for a cost-per-outcome. Each substitution is convenient, quotable, and usually true as far as it goes — which is exactly why it travels. The work of reading a claim well is mostly the work of noticing the swap, and asking for the thing that was swapped out.

A reach curve tells you a tool was opened. An engagement chart tells you who came back. Only an outcome, measured against a randomized counterfactual, tells you whether anyone was better off.

So before a digital-health or AI tool is funded, scaled, or announced as a success, ask it for the one number a dashboard can't fake: the health outcome it changed, against the outcome that would have happened anyway. If that number exists, everything else is context. If it doesn't, everything else is decoration.

Frequently asked questions

What is the difference between a monitoring metric and an impact metric?

A monitoring metric measures whether an intervention was delivered and used — reach, installs, messages, active users. An impact metric measures whether it changed a health outcome, against a counterfactual. WHO's monitoring-and-evaluation framing splits the two: monitoring tracks the quality and fidelity of the intervention's inputs; evaluation asks about outputs and impacts, including health outcomes and cost-effectiveness. A reach number tells you a tool was used, never that anyone got healthier.

Can a statistically significant result still be too small to matter?

Yes. Significance answers "is there any difference?"; it says nothing about size. With a large sample a tiny effect becomes significant. In the pre-registered PROMPTS trial (N=6,139), the significant effects were +0.06 to +0.09 SD on self-reported indices — a 0.09 SD gap means the arms overlap by about 96%. Real, but small. Ask for the effect size and its confidence interval, not just the p-value.

Why were the PROMPTS trial's effects so small?

Two things — one certain, one an inference. Certain: the outcomes were self-reported indices in standard deviations, where even a real 0.06–0.09 SD effect is a near-total overlap between arms. The inference: participants were pre-selected — recruited at health facilities during antenatal care, with about 80% (4,965/6,139) having already received prior ANC — so the sample starts near the ceiling on care-seeking, which plausibly compresses any effect. The trial doesn't test that ceiling directly, so treat it as context for the small numbers, not a proven cause.

Why does a self-reported improvement not equal a health outcome?

Self-reported knowledge, intentions or behavior are proxies for health, not health itself. The PROMPTS authors state that all their outcomes were self-reported and the study "was not powered to detect effects on health outcomes." Knowing about danger signs, or reporting a clinic visit, is upstream of a measured outcome such as a diagnosis or a death averted. Ask whether the claim rests on a measured outcome — and whether the study was designed to detect one.

Does an engagement–outcome correlation prove a health tool works?

No — it's selection, not causation. The people who engage differ systematically from those who don't. In the Mobile WACh NEO engagement analysis, messaging was self-selected, and the associations contradicted each other (more messaging predicted more danger-sign knowledge but lower odds of early breastfeeding). The authors state they cannot conclude messaging directly affected any outcome. Only a randomized comparison isolates what the tool caused.

Does a validated or benchmark-topping medical AI mean it is clinically safe?

Not on its own. Passing a medical-exam benchmark is a different test than being safe on open patient questions. In a 2026 npj Digital Medicine study, sixteen physicians red-teamed four public chatbots on 222 patient questions and found problematic-answer rates of 21.6% to 43.2%, and outright unsafe answers of 5.0% to 13.5%. Ask what the safety claim was measured on, whether clinicians reviewed real outputs, and where the human-escalation path is.

Does a null trial mean digital health does not work?

No. A null randomized trial is what honest measurement looks like, not proof a field is useless. Rigorous SMS trials — WelTel PMTCT, Mobile WACh NEO — returned null primary results, while other pre-registered trials find small but real effects. The lesson is literacy, not cynicism: demand the randomized evidence and the effect size, and treat "it's solved" and "it's hopeless" alike as claims to check.

What does a good evaluation look like when the model changes every few weeks?

A short core holds no matter how fast the technology moves: a measured health outcome (not a reach or engagement count), a counterfactual to compare it against (randomized where feasible), a sample large enough to detect an effect that matters, a human in the loop (clinicians reviewing real outputs, plus a live escalation path), and independence from anyone who profits from the verdict. The classical RCT furniture — pre-registration, confidence intervals, a named stopping rule, cost per outcome — still helps, but a chatbot built on an LLM updates in weeks, so a two-year fixed-protocol trial can judge a version no one runs anymore. The cadence has to adapt — rolling read-outs, continuously refreshed test sets — even as that core stays fixed.

Sources

Every figure above, traceable to a public primary source

PROMPTS / Jacaranda cluster RCT — effect sizes (+0.06 to +0.09 SD), null danger-sign domain, and the self-reported / "not powered for health outcomes" limitation (Vatsa et al., PLOS Medicine 2025; 22(2):e1004527): journals.plos.org. Pre-registered at ClinicalTrials.gov NCT05110521 and AEA RCT Registry AEARCTR-0008449. The same paper reports Jacaranda's own unit-cost estimate of ~US$0.74 per participant for their lifetime on the platform.
The "almost twice as likely to start using PPFP" marketing line (first-party): jacarandahealth.org
Retention decay — median 15-day 3.9% / 30-day 3.3%, >80% drop in daily opens by day 10, across 93 mental-health apps (Baumel et al., J Med Internet Res 2019; 21(9):e14567): jmir.org
Engagement is not causation — self-selected messaging and contradictory associations in a null trial (Peng et al., PLOS Digital Health 2025; 4(8):e0000968): journals.plos.org
Null SMS RCT — WelTel PMTCT, no effect on retention (RR 1.02, 95% CI 0.92–1.14, p=0.697) (Scientific Reports 2023; s41598-023-35817-x): nature.com
Physician red-teaming of patient-facing LLMs — problematic 21.6–43.2%, unsafe 5.0–13.5% (Draelos et al., npj Digital Medicine 2026; 9:241): nature.com
Monitoring vs. impact — WHO (2016), Monitoring and evaluating digital health interventions: apps.who.int
WHO (2019), Recommendations on digital interventions for health system strengthening — "digital health interventions are not a substitute for functioning health systems": who.int
Cost-effectiveness evidence skewed to high-income settings; LMIC evidence "urgently required" (behavior-change review, Interactive J Med Res 2023): i-jmr.org
Cost-effectiveness of digital health — systematic review noting evidence "disproportionately concerned with high-income countries and hospital settings" (Gentili et al., 2022): ncbi.nlm.nih.gov

Figures were re-checked against each primary source in July 2026. Effect sizes, confidence intervals and null results are quoted as reported by the original authors; the overlap figure in the distribution chart is a standard-normal illustration of the paper's own reported effect size. Corrections are welcome.

About Ground Truth

Ground Truth is an independent publication that scrutinizes AI and digital-health claims in global health — the reach figures, accuracy scores, and impact announcements that increasingly decide what gets funded, deployed, and believed. Our aim is not to praise or attack particular products, but to help readers judge the evidence for themselves.

Our standard: every factual claim is traced to a public primary source — the peer-reviewed paper, the trial registry, the WHO document, the organization's own page. We separate what is verified from what is plausible but unconfirmed. We name organizations only when the public record supports it, and only to teach, not to accuse. We correct our own errors in public. We take no funding from, and hold no affiliation with, the companies, funders, or vendors whose work we examine.

Spotted something we got wrong, or a claim worth taking apart? Corrections and tips are welcome at corrections@groundtruth.health. Read the full editorial standard, independence statement, and corrections policy →