Why can a 90 percent accuracy score be misleading?

Because the words 'accuracy' and 'percent' say nothing about what was measured or how. A score can be computed against an answer written in a different language than the user speaks, graded by a single AI model, over a few dozen questions. That number can be high while the system is unsafe, unnatural in the target language, or running on a recognizer that mis-heard the question. The percentage describes the test, not the real-world experience.

How to Read an African-Language AI Benchmark

Q: How do I know if an AI accuracy claim is trustworthy?

Decompose it into three parts: what was tested (which data, in which conditions), how it was graded (by a human native speaker, or an AI judge, and against an answer in which language), and how many items it was tested on. Accuracy is never a property of a model; it is a property of a specific measurement. If a claim omits the conditions, the metric definition, or the sample size, treat it as marketing, not evidence.

Q: What does word error rate (WER) actually measure?

Word error rate is the percentage of words a system transcribes incorrectly, counting insertions, deletions and substitutions. It marks a whole word wrong for a single wrong letter or ending, which penalizes languages that build long words from many pieces, such as many African languages. For those languages, character error rate (CER) usually tells a more honest story and is almost always lower than the WER, though it is rarely the number quoted.

Q: Are all African languages equally well supported by AI?

No. The variation is enormous. On the same in-the-wild benchmark, Kinyarwanda and Swahili hold around 8 to 14 percent word error rate, while Fulani ranges from 34 to 59 percent, and languages such as Amharic and Yoruba degrade sharply outside clean conditions. The gap is driven mostly by how much transcribed speech data exists for each language, not by the languages being intrinsically harder. 'Works for African languages' usually means it works for the two or three with the largest datasets.

Q: Does 'supports African languages' mean my language works?

Not necessarily. 'Supports African languages' is a category, not a guarantee. Open the model card and read the actual list of languages the shipped model covers, and check what base model it was built on. A language appearing on a companion leaderboard, or in a headline language count, does not mean there is a working, released model for it, or that it was tested at the same quality as the flagship languages.

You do not need to run the models or read the papers to catch most misleading AI claims. You need to ask where the number came from. This guide walks through claims being made today about voice AI for African-language health — Kinyarwanda, Swahili, Hausa, Yoruba, Amharic, Fulani and others, where the users are patients on phones and the stakes are real — and uses each one to teach a question you can ask of any AI number you meet next. Including numbers about your own work, and numbers someone quotes at you in a meeting. Every example below links to its primary source, so you can check the working.

The one idea

Almost every headline AI score hides three things. Learn to ask for all three and most claims deflate to something honest and useful.

The conditions. What, exactly, was it tested on?
The metric. What does the score count as a mistake — and what does it quietly ignore?
The scoring. Who or what decided the answer was right, and over how many tries?

Nine claims, taken apart

Claim 01 — the conditions

3.2% word error rate — almost as good as English.

What you're meant to conclude

This system understands the language nearly perfectly.

What's hidden

That 3.2% comes from KinSPEAK, a Kinyarwanda system tested on clean, studio-recorded, read-aloud speech. Put the same language into spontaneous, real-world audio and the best systems land in the low-to-mid teens — the in-the-wild benchmark AfriVox-v2 reports Kinyarwanda holding around 8–14% word error rate. A general-purpose model that has not been carefully adapted to the language does worse still. Every one of those numbers is real. They describe completely different situations — and the clean-studio number is the one that reaches the headline, while the noisy-phone number is the one a patient actually lives with.

The question to always ask

On what?

Read speech or spontaneous? Quiet or noisy? Whose accents? Which subject matter? A score with no test conditions attached is not a weak claim. It is not a claim at all.

Claim 02 — the scoring

It answered health questions in Kinyarwanda with 88% accuracy.

What you're meant to conclude

88 out of 100 answers were correct, in Kinyarwanda.

What's hidden

In the evaluation this figure comes from — a set of low-resource voice tests run by Gooey.AI with the Gates Foundation, covering Kinyarwanda, Swahili and Kikuyu — native speakers recorded the questions, but the gold-standard answers were written in English, and a single AI model graded each response for how close it was, in meaning, to that English answer, over roughly 25–30 questions per language. The patient never hears the English. So the score never checked whether the reply was fluent or even grammatical in the target language, whether it was safe, or whether the recognizer had heard the question correctly — a capable model can smooth over a garbled transcript and still produce a plausible English-equivalent. The team is candid that they route through English because target-language evaluation is hard; that honesty is welcome, and it does not change what the number can tell you. "88% accuracy" resolves to "88% resemblance to an English sentence, in the opinion of one model." A useful screen — not evidence that the system is safe to put in front of a patient.

The question to always ask

Accuracy of what, judged how?

Break any accuracy claim into three parts: what was tested, how it was graded, and how many times. Accuracy is never a property of a model. It is a property of a measurement — and the measurement is often narrower than the word suggests.

We read the question set behind that number

There is a more basic problem than the grading: most of these are not health questions. We pulled the evaluation's own golden datasets. The Hindi set asks for the capital of India, the national currency, the country's longest river, the year of independence, the last time India won the cricket World Cup, and the name of a South Indian pancake — across its twenty items, not one is about health, and only one touches agriculture. The Kikuyu set asks who was president of Kenya in 2024 and which wildlife migration crosses the Maasai Mara.

The evaluation frames its own queries as spanning "health, agriculture, general knowledge" — general-purpose, not health-specific — and the golden sets we read are dominated by the last of the three. It is, in the main, a general-knowledge quiz. The program does run a separate, health-worker-specific evaluation for community health workers in Nigeria, with a clinical-safety rubric — but that is not the test these headline African-language accuracy figures come from. When a trivia score is read as progress on health, the number and the mission have come apart. Ask what was tested, and the "health" in "health accuracy" is the first thing that doesn't survive the question.

Claim 03 — the sample size

Model A beats Model B — 0.88 to 0.85.

What you're meant to conclude

A is better. Pick A.

What's hidden

How few tries produced that gap. Evaluations like the one above run on 25 to 30 questions. Three points of difference across 25 questions is well inside the range you would get from luck — reshuffle the questions and the ranking can flip. You can watch the instability in the same evaluations: in a later round, one model family swung by more than sixty points between adjacent versions on the same test. When neighbouring results jump that far, the test is telling you it is too small to trust at this resolution. A short benchmark produces rankings. Rankings are not the same as real differences.

The question to always ask

Over how many items — and would the gap survive noise?

If a comparison doesn't state its sample size or a margin of error, treat "beats" as decoration, not a finding.

Claim 04 — the metric

15% word error rate, so 85% accurate — far behind English at 5%.

What you're meant to conclude

This language's technology is a long way behind.

What's hidden

What word error rate mechanically counts. It marks an entire word wrong for a single wrong letter or ending. Kinyarwanda, Swahili, Zulu, Amharic and many other African languages build long words by gluing pieces together — so one mistaken piece fails the whole word and inflates the score in a way that doesn't happen the same way in English. Measure the identical output by characters instead of whole words and the error rate often falls severalfold: in one Kinyarwanda scaling study the character error rate was under 2% where the word error rate was above 7%. The character number is usually available and almost never quoted, because the word number looks more dramatic.

The question to always ask

What does this metric count as a mistake — and who does that punish?

For languages with rich word-formation, ask for the character error rate beside the word error rate. And never line up word error rates across two different languages as if they mean the same thing.

What another 1,000 hours of speech actually buys

Swipe the chart sideways to read it →

Kinyarwanda word error rate and character error rate as training data grows from 1 to 1,400 hours (Whisper large-v3, fine-tuned; log scale). Two things to read. The same system, two metrics: character error rate (slate) runs three-to-four times below word error rate (red) at every volume — exactly the gap Claim 04 is about. What data buys: the first 200 hours do almost all the work (47.6% → 9.8% WER), while the next 1,200 buy barely two more points (→ 7.1%) — and one hour of fine-tuning actually pushes WER above the untuned baseline before fifty hours transform it. The authors' own caveats matter: the test set is clean, so real-world audio runs higher, and 38.6% of the worst errors traced to noisy transcripts — quality, not just quantity. Full series (hrs → WER / CER): 1 → 47.6/17.0, 50 → 12.5/3.3, 100 → 10.9/2.8, 150 → 10.2/2.6, 200 → 9.8/2.6, 500 → 8.2/2.2, 1,000 → 7.7/2.0, 1,400 → 7.1/1.9. Source: Akera et al., Sunbird AI (2025) — arxiv.org/abs/2510.07221.

Claim 05 — the fine print

Microsoft released a speech model for African languages — so Kinyarwanda is covered.

What you're meant to conclude

The language is handled. Use the big-name model.

What's hidden

The actual list. Microsoft's Paza, released in early 2026, is often described as "a Whisper fine-tune for low-resource African languages." Both halves are wrong. It is built on Phi-4 multimodal, not Whisper, and the versions it actually ships cover six languages — Swahili, Dholuo, Kalenjin, Kikuyu, Maasai and Somali, all spoken in Kenya — and Kinyarwanda is not among them. The confusion comes from a companion leaderboard, PazaBench, that scores 39 languages. Scoring a language on a leaderboard is not the same as shipping a model that works in it. "Supports African languages" is a category, not a promise that yours is inside it.

The question to always ask

Which languages, exactly — and built on what?

Open the model card and read the language list. A continent-sized claim usually resolves to a short, specific list, and your language is either on it or it isn't.

Claim 06 — shipping vs. demo

New model does real-time voice translation in 70+ languages.

What you're meant to conclude

It will handle a live phone conversation in the language I care about.

What's hidden

Which 70, in what tier, at what maturity. When Google launched Gemini 3.5 Live Translate in June 2026, the headline was real-time speech-to-speech in 70+ languages. But the documented pairs lean heavily toward major European and Asian ones; coverage differs depending on whether you use the consumer app, the meetings product, or the developer interface; and it shipped as a preview, with no reliability guarantee. For a lower-resource language — Kinyarwanda, Hausa, Amharic — "supported" can quietly mean "routed through general translation infrastructure," not the flagship real-time mode the headline is selling.

The question to always ask

Is my language in the tested set — and is this shipping or a demo?

A capability claim with no conditions attached — which languages, which surface, generally available or preview, verified how — is a headline, not a specification.

Claim 07 — the scorekeeper

This model tops the benchmark.

What you're meant to conclude

Independent testing crowned a winner.

What's hidden

Who built the benchmark. The model sitting at the top of the AfriVox-v2 African-speech benchmark, Sahara-v2, was built by Intron Health — the same company that built AfriVox-v2. That does not make the result wrong, and self-published benchmarks can be perfectly honest — but a scoreboard kept by one of the players is a starting point for scrutiny, not the end of it. The fix is not suspicion; it is corroboration from a test nobody in the race controls.

The question to always ask

Who produced this benchmark — and do they have a horse in the race?

When the scorekeeper is also a competitor, look for an independent result before you cite the ranking as settled.

Claim 08 — the continental average

African-language AI is good now.

What you're meant to conclude

The problem is basically solved across the continent.

What's hidden

Which African language. "African languages" is not one thing — Africa has over 2,000 of them, and AI performance across them is wildly uneven. On the same in-the-wild benchmark, Kinyarwanda and Swahili hold around 8–14% word error rate while Fulani — spoken by tens of millions across the Sahel — swings between 34% and 59%, with languages like Amharic and Yoruba degrading sharply outside clean audio. The gap is not that some languages are intrinsically harder; it is data. Kinyarwanda punches far above its population because of an unusually large open speech corpus of roughly 2,000 hours; Fulani has many more speakers and far less transcribed audio. A continental average hides a near-solved language and a barely-functional one inside the same number — you can see the whole landscape in the data below.

The question to always ask

Which language, specifically — and does it have the data behind it?

"Works for African languages" almost always means "works for the two or three with the biggest datasets." Ask where your language sits, and how many hours of transcribed speech exist for it.

Claim 09 — the other direction

The models still can't understand the language well enough — we can't deploy yet, we just need to keep funding it.

What you're meant to conclude

Nothing is usable yet; the responsible move is to wait, and keep spending until the scores reach parity.

What's hidden

The difference between recognizing speech and understanding it. A word error rate is a transcription score, and when researchers decomposed where the gap for underserved speakers actually comes from, most of it sat in the acoustic layer — the models are thrown by accent, pronunciation, rhythm and pitch, not by the grammar or meaning of the language. The landmark study of speech-recognition disparities found the gap arises in the acoustic model — the system confused by phonetic and prosodic features rather than grammatical or lexical ones. The errors that remain concentrate in specific, high-stakes items — names and numbers — rather than spreading as general incomprehension: on the in-the-wild benchmark below, even the best model still misses roughly one in five numbers and one in four named entities. And an accent-shaped gap is narrow and closable — fine-tuning on accented clinical speech has cut medical transcription error by 25–34%. So "not good enough" quietly does the mirror image of the hype claims. It takes the harshest number — verbatim accuracy on noisy, accented audio — reads it as "the AI can't understand the language," and turns that into a case for indefinite delay and open-ended budgets. The honest read runs the other way: comprehension is closer than the word error rate implies, the remaining gap is a specific accent-and-data problem you can name and cost, and the move is to deploy for what already works — with a read-back step to catch the names and numbers that matter — while funding the specific gap, not "the language" in the abstract.

The question to always ask

Not good enough for what — and is the gap comprehension, or accent?

A low score is a specification, not a verdict. Ask which layer is actually failing — real misunderstanding, or an accent the recognizer mis-transcribes — and whether "not ready" is a finding or a funding position. Hype and defeatism misread the same number in opposite directions.

Now try it on a real one

Everything above is a question to ask. Here is a real benchmark to ask them of: AfriVox-v2, which tested modern speech models on unscripted, in-the-wild audio across twenty African languages and ten domains. Below the grid, we put all nine of the guide's questions to this one benchmark — and show how to read each answer.

First, the scoreboard — and who keeps it (Claim 07). The best performer, Sahara-v2, was built by the team that built the benchmark; note too that the frontier multimodal model trails the smaller speech-native ones — text reasoning is not acoustic transcription. Average word error rate on in-the-wild African speech, lower is better:

Sahara-v2 (region-tuned)

23.8

Gemini 3 Flash (multimodal LLM)

32.1

Omnilingual CTC 7B

32.2

Omnilingual CTC 1B

33.9

Omnilingual CTC 300M

39.2

Now that best model's full landscape — every language, every domain (Claim 08, made concrete). Blue is usable; red is not. Rows sort best-to-worst; hover any cell for the exact figure and how much test data sits behind it. An asterisk marks a small, noisy test set.

Word error rate:lowerhigher no / too few samples

Sahara-v2 word error rate (percent) by language and domain, from the AfriVox-v2 in-the-wild benchmark, July 2026. Lower is better. An asterisk marks a small or noisy test set (under 500 sentences); a dash means no or too few samples.
Language	General	Health	Agriculture	Finance	Government	Education	Culture	Transport	Sports	Telecom	Average
Kinyarwandan=2597	8.4	14.1	8.3	9.3	9.2	9.3	10.6	12.0	10.7	13.4	10.5
Twi*n=376	9.5	10.2	–	7.1	–	33.7	6.0	8.3	20.0	–	13.5
Swahilin=2350	14.8	12.2	13.6	13.8	18.2	13.8	17.4	14.5	17.4	15.4	15.1
Yoruban=2064	15.9	16.7	13.7	16.1	17.6	14.9	16.4	15.8	16.6	19.6	16.3
Tswanan=1939	12.0	15.8	9.0	14.1	14.3	14.8	20.5	13.9	32.2	18.5	16.5
Hausan=2083	17.7	18.4	13.9	16.9	18.8	17.7	18.5	17.3	17.3	18.9	17.5
Arabicn=3366	18.5	11.4	26.6	13.2	22.5	13.1	24.7	17.5	12.9	20.4	18.1
Lugandan=2131	21.0	18.0	18.9	17.1	16.8	16.3	20.9	23.0	18.7	16.1	18.7
Igbon=1363	16.4	20.4	15.0	17.3	18.8	17.9	17.7	21.9	22.3	20.5	18.8
French*n=376	13.5	13.0	23.8	20.2	21.9	12.6	–	26.6	20.0	–	18.9
Zulun=2720	13.9	8.7	13.5	15.5	17.4	15.8	23.3	25.2	29.6	27.2	19.0
Sesothon=2769	15.4	20.2	9.2	17.1	16.1	21.6	24.6	21.0	21.8	30.2	19.7
Xhosan=2430	19.1	19.1	18.2	22.8	26.5	25.6	27.6	26.4	32.2	31.4	24.9
Afrikaansn=3301	19.3	19.9	22.5	18.5	22.3	22.9	34.5	19.1	38.6	31.5	24.9
Shonan=876	20.7	8.4	25.0	30.2	28.5	24.5	28.1	22.3	26.3	36.7	25.1
Akan*n=449	21.3	24.8	44.5	21.6	–	24.5	20.1	19.7	23.0	30.6	25.6
Amharic*n=277	24.6	21.7	28.5	30.2	30.6	26.4	19.7	30.0	30.3	31.9	27.4
Pedin=2450	15.3	22.8	17.5	29.6	27.3	28.5	29.0	27.9	32.8	50.0	28.1
Fulani*n=46	35.4	34.7	–	59.4	44.1	42.2	43.2	36.8	39.3	50.0	42.8

* small test set (under 500 sentences): Amharic, Akan, French, Fulani, Twi — read those rows as rough signals. Ga is omitted (too few samples in every domain). A blank cell means no data, not perfect accuracy.

Source: AfriVox-v2 (Awobade, Ashungafac & Olatunji, Intron Health, 2026), Tables 2, 5 and 7 — arxiv.org/abs/2605.03590. A July 2026 snapshot of a fast-moving field; sample sizes and caveats are shown inline in every cell.

How to read this benchmark: the nine questions, answered

You now have nine questions. Here is each one put to AfriVox-v2, with what the data says and how to read it — the same moves you’d run on any benchmark someone sets in front of you.

Claim 01 — the conditions
On what?
In the data — AfriVox-v2 is unscripted, in-the-wild audio — podcasts, interviews, real conversation — which is why Kinyarwanda sits at 8–14% here rather than the 3.2% a clean studio test reports.
How to read it — This is close to the world you'd deploy in, so trust it over a demo — but it's recorded media, not a stressed caller on a bad line, so treat it as a realistic ceiling, not the floor.
Claim 02 — the scoring
Accuracy of what, judged how?
In the data — The error rates are word error rate against verbatim transcripts written and double-checked by native speakers — a strong standard; but which domain column an utterance falls in was labelled by an AI at about 42% precision.
How to read it — Trust the numbers; treat the domain split as soft. The error rate is human-graded; the category is a machine's guess.
Claim 03 — the sample size
Over how many items?
In the data — Every row shows its sample size. Kinyarwanda rests on 2,597 sentences and Swahili on 2,350; Fulani on 46 and Amharic on 277 — which is why those rows carry an asterisk.
How to read it — A gap between two well-sampled languages is real; a single Fulani cell is a rough signal. Check the n= before you quote it.
Claim 04 — the metric
What does the metric punish?
In the data — These are whole-word error rates. For Kinyarwanda, Zulu and other languages that build long words from many pieces, one wrong piece fails the entire word, so the number runs harsher than the felt experience — and the gentler character error rate isn't shown.
How to read it — Read each cell as an upper bound on “how wrong.” For a morpheme-rich language the character-level error is usually several times lower than the word number here.
Claim 05 — the fine print
Which languages, exactly — on what base?
In the data — Twenty languages are listed, but the usable set is smaller: Ga had too few samples to report at all, five more are small-n, and the models that advertise “1,600+ languages” (the Omnilingual family) sit near the bottom of the scoreboard.
How to read it — Don't read the row list as “all twenty work.” Read which rows have both a real sample size and a low error rate.
Claim 06 — shipping vs. demo
Shipping, or a demo?
In the data — These are research numbers on model checkpoints, not a product with a service guarantee. The top model, Sahara-v2, is something you'd have to integrate; the one shipping API in the grid, Gemini 3 Flash, trails it.
How to read it — A strong cell means “this is achievable today,” not “there's a supported service in your language tomorrow.” A benchmark is not a product.
Claim 07 — the scorekeeper
Who keeps the scoreboard?
In the data — Sahara-v2 tops the grid — and Intron Health, which built Sahara-v2, also built this benchmark. The paper discloses it, and every model ran under identical conditions.
How to read it — Take the number-one spot as a credible starting point, not a settled verdict. Look for a result on a benchmark Intron doesn't control before citing the ranking as final.
Claim 08 — the continental average
Which language, specifically?
In the data — The grid is the answer to “African-language AI is good now”: Kinyarwanda averages about 10%, Fulani about 43% — a four-fold spread hiding inside one phrase — and it tracks data, not difficulty (Kinyarwanda's ~2,000-hour corpus versus Fulani's near-absent transcribed audio).
How to read it — Never accept the average. Find your language's row, then your domain's cell — Shona is 8.4% in Health but 36.7% in Telecom.
Claim 09 — the other direction
Not good enough for what — comprehension, or accent?
In the data — The benchmark breaks errors out by type: even the best model misses about 20% of numbers and 23% of named entities, against roughly 16% on general speech. The failures cluster in names and numbers — specific, catchable items — not in general understanding.
How to read it — Before concluding “it can't understand the language,” ask what's actually failing. Here it's mostly accent and entity transcription — the kind of thing a read-back step catches — not comprehension.

The bottom line

So — is African-language voice AI good enough? Read honestly, this one benchmark answers: for Kinyarwanda health, close; for Fulani finance, not remotely — measured on the harsher of two metrics, by a scorekeeper with a horse in the race, on domain labels a machine guessed, with the worst rows resting on a few dozen sentences.

Every one of those qualifications came from asking the nine questions. That is the entire skill. It is also why the honest answer to “is it good enough” is never a number — it is for which language, in which domain, measured how, and how do you know.

The field guide

Run these against any AI claim, in any domain

On what was it tested? Read or spontaneous speech, clean or noisy audio, whose accents, which topic. The conditions are the whole story.
What does the metric count — and what does it hide? Ask for character error rate alongside word error rate for languages with rich word-formation.
How was "right" decided, and by whom? A human native speaker? An AI judge? Against an answer written in which language?
Over how many items — and would the gap survive noise? A few dozen questions can't separate two close models. Small samples make rankings, not differences.
Which languages, exactly — and on what base model? Read the shipped model's language list. A leaderboard entry is not a working model.
Is this shipping or a preview — and is my language in the tested set? A language count is not a coverage guarantee.
Who produced the benchmark, and are they a player in it? A self-graded result needs an independent second opinion.
Is this a continental average hiding wide variance? "African languages" spans near-solved and barely-functional. Ask about your specific language, and how much data exists for it.
Is a low score a verdict, or a specification? Ask what it's not-good-enough for, and which layer is failing — real misunderstanding, or an accent the recognizer mis-transcribes. "Not ready" can be a funding position, just as "solved" can be a sales one.

And the one that sits underneath all of them: is this number measuring the world you'll deploy in — the clean lab, or the noisy phone in someone's hand? A figure from a fast-moving field also has a shelf life; a ranking from six months ago may already be obsolete. When in doubt, ask what world the number came from, and when.

What the score can't hear

Every number in this guide shares one limit, even a number you have fully taken apart: it was made in a lab, against a written transcript. Word error rate is a lexical metric — it counts mismatched words and weights them all equally, so a dropped “um” and a “not” flipped to “now” score the same, though one is nothing and the other reverses the meaning. Researchers have long shown that a lower word error rate does not reliably translate into a better experience for the person using the system. The metric has no notion of whether a reply was fluent in the language, whether it was safe, or whether the listener understood.

This is not an argument that the metric is worthless. For plain transcription, and for pulling facts back out of a recording, word error rate tracks reality well enough — retrieval holds up below about 25%. But it is a proxy, and the further a task moves from “type out these words” toward “hold a safe, helpful conversation in someone’s language,” the less the proxy sees. Even the reference it is scored against is human and imperfect: on hard audio, a better model can be marked wrong for catching words the human transcriber missed.

The thing a benchmark cannot do is hear a conversation. It cannot tell you whether a nurse in Kigali, speaking Kinyarwanda at her normal pace over a bad line, was understood — or whether the answer she got back was one she could act on. The field knows this. It is why evaluation keeps moving out of the studio and into unscripted, in-the-wild recordings, why even a careful scaling study flags that its clean test set may not represent deployment, and why the teams that route their evaluations through English say plainly that judging the answer in the actual language is the hard, unfinished part.

Treat every score here the way you would treat a lab result before a diagnosis: necessary, clarifying, and not the last word. The last word — the ground truth — is a native speaker, in a real conversation, telling you whether it worked.

So before a voice system goes in front of the people who will depend on it, put it in front of them first, and listen. That is the test no leaderboard can run for you.

Frequently asked questions

How do I know if an AI accuracy claim is trustworthy?

Decompose it into what was tested (which data, in which conditions), how it was graded (a human native speaker, or an AI judge, and against an answer in which language), and how many items it covered. Accuracy is a property of a measurement, not of a model. If a claim omits the conditions, the metric, or the sample size, treat it as marketing until proven otherwise.

What does word error rate (WER) actually measure?

The percentage of words transcribed incorrectly, counting insertions, deletions and substitutions. It marks a whole word wrong for one wrong letter or ending, which penalizes languages that build long words from many pieces. For those languages, character error rate is usually lower and more honest — and rarely the number quoted.

Why can a 90% accuracy score be misleading?

Because "accuracy" and "percent" say nothing about what was measured or how. A score can be computed against an answer in a language the user doesn't speak, graded by a single model, over a few dozen questions. It can be high while the system is unsafe, unnatural in the target language, or running on a recognizer that mis-heard the question. The number describes the test, not the experience.

Are all African languages equally well supported by AI?

No — the variation is enormous. On the same in-the-wild benchmark, Kinyarwanda and Swahili hold around 8–14% word error rate, while Fulani ranges from 34% to 59%, and languages like Amharic and Yoruba degrade sharply outside clean conditions. The gap is driven mostly by how much transcribed speech data exists for each language. "Works for African languages" usually means it works for the two or three with the largest datasets.

Does "supports African languages" mean my language works?

Not necessarily. It is a category, not a guarantee. Open the model card, read the list of languages the shipped model actually covers, and check the base model. A language on a leaderboard, or in a headline count, is not the same as a released model that works in it at flagship quality.

Does a low benchmark score mean African-language AI is unusable?

No. A score is fitness for a specific purpose, not a verdict. For speech, much of the gap for underserved languages sits in the acoustic layer — accent, pronunciation, prosody — rather than in understanding grammar or meaning, and the errors that remain concentrate in specific items like names and numbers that a read-back step can catch. Fine-tuning on accented speech has cut medical transcription error by 25–34%. "Not ready" can be a motivated claim that justifies delay and budgets, and deserves the same scrutiny as "it's solved."

Does a low word error rate mean a voice AI system actually works?

Not on its own. Word error rate measures transcription against a written reference; it weights every word equally and doesn’t reliably predict whether a conversation was understood, safe, or useful — a dashboard can read “green” while clinical notes come out wrong. It’s a useful screen, not a verdict. The real test is native speakers using the system in a live conversation, in the conditions it will actually run in.

Sources

Every claim above, traceable to primary material

KinSPEAK — Kinyarwanda ASR, 3.2% / 15.9% WER and syllabic tokenization: arxiv.org/abs/2308.11863
AfriVox-v2 — in-the-wild African-language benchmark (Intron Health), cross-domain WER and Sahara-v2: arxiv.org/abs/2605.03590
Kinyarwanda data scaling — WER/CER vs training hours (Akera et al., Sunbird): arxiv.org/abs/2510.07221
Meta Omnilingual ASR — 1,600+ language recognition, CER reporting: arxiv.org/abs/2511.09690
Microsoft Paza — Phi-4-multimodal base, six Kenyan languages, PazaBench: Microsoft Research
Google Gemini 3.5 Live Translate — 70+ language real-time speech-to-speech: blog.google
Gooey.AI × Gates Foundation — low-resource language voice evaluations and methodology: gooey.ai/language-evaluation
Gooey.AI × Gates Foundation — the evaluation writeup, which describes the queries as "health, agriculture, general knowledge," grades against English gold answers via LLM-as-judge, and reports the per-language accuracy figures: published evaluation document, with the accompanying golden-answer datasets (Hindi, Swahili, Gikuyu, Kinyarwanda) — the general-knowledge question sets quoted above.
Racial disparities in automated speech recognition (Koenecke et al., PNAS 2020) — the recognition gap arises in the acoustic model, not the language model: pnas.org
Performant ASR for medical entities in accented speech (Olatunji et al.) — accent fine-tuning cuts medical WER 25–34%: arxiv.org/abs/2406.12387
Limitations of word error rate — WER is lexical and does not guarantee downstream or end-user improvement: arxiv.org/abs/2604.21928
Word error rate overview — where WER does and doesn’t predict task performance (retrieval robust below ~25%): ScienceDirect
“Word error rate is broken” — how WER can penalize more-accurate models and misjudge meaning: AssemblyAI

Figures reflect a fast-moving field and are current as of July 2026; specific model rankings and version numbers date quickly. Corrections and additions are welcome.

About Ground Truth

Ground Truth is an independent publication that scrutinizes AI claims in global health — the benchmarks, accuracy scores, and capability announcements that increasingly shape what gets funded, deployed, and believed. Our aim is not to praise or attack particular products, but to help readers read the evidence for themselves.

Our standard: every factual claim is traced to a primary source — the paper, the model card, the registry, the leaderboard. We separate what is verified from what is plausible but unconfirmed. We correct our own errors in public. We take no funding from, and hold no affiliation with, the companies, funders, or vendors whose work we examine.

Spotted something we got wrong, or a claim worth taking apart? Corrections and tips are welcome at corrections@groundtruth.health. Read the full editorial standard, independence statement, and corrections policy →