Against WER: Phonemic Evaluation, Orthographic Transparency, and the Script Advantage for Manding ASR
Automatic speech recognition for Manding languages is usually reported through Latin-script word error rate. This paper argues that the metric is scientifically weak for the research question at hand. If the goal is to evaluate whether an ASR system recognizes Bambara, Maninka, Dioula, or related Manding speech, then the scoring units should preserve the acoustic-phonemic distinctions carried by the language. Latin Bambara orthography is useful and socially real, but it is not a lossless measurement interface: it u
Full Public Reader
Abstract
Automatic speech recognition for Manding languages is usually reported through
Latin-script word error rate. This paper argues that the metric is scientifically
weak for the research question at hand. If the goal is to evaluate whether an ASR
system recognizes Bambara, Maninka, Dioula, or related Manding speech, then the
scoring units should preserve the acoustic-phonemic distinctions carried by the
language. Latin Bambara orthography is useful and socially real, but it is not a
lossless measurement interface: it uses digraphs for single phonemes, leaves tone
unmarked or inconsistently represented, and allows convention-dependent variation.
, by contrast, was designed for Manding phonology and gives the ASR system a
more transparent character target.
The core contribution is a metric argument. I formalize the difference between a
transparent script map $f_N:\Phi \rightarrow \Sigma_N$ from phonemic units to
script units and a variable-length Latin transcription relation
$R_L \subset \Phi^* \times \Sigma_L^*$. Under normalization assumptions, edit
distance over a bijective or near-bijective script preserves phoneme-edit structure
more directly than word error rate over a many-to-many transcription convention. It
does not become a perfect phoneme error rate: tone policy, diacritics, punctuation,
Unicode normalization, reference quality, and scorer granularity still matter. But
character error rate is more interpretable for Manding ASR than Latin WER
because a character substitution is closer to a sound-symbol substitution, while a
Latin word error can mix acoustic error, digraph segmentation, spelling convention,
and tokenization.
The paper also defines the claim boundary needed for the 20.57\
the broader project. The anchor is meaningful because it is a direct ASR
number over script-native output; it should not be translated into a Latin WER
leaderboard claim or used to assert that beats Latin under every matched
condition. The rigorous conclusion is narrower and stronger: for Manding ASR,
CER is the better primary measurement target when the scientific object is
phonemic speech recognition rather than agreement with a Latin orthographic
convention.
Introduction
ASR metrics are not neutral. They decide which errors count, which distinctions are
visible, and which writing system becomes the default infrastructure for speech
technology. In high-resource English ASR, word error rate is convenient because
there is a dominant written standard, large reference corpora, and a long benchmark
history. In Manding ASR, the situation is different. Bambara and related languages
are written in Latin orthographies and in ; these scripts do not preserve the
same information, and their scoring units do not have the same linguistic meaning.
This paper argues against treating Latin WER as the final metric for the current
research program. The goal is not merely to make a system output readable Latin
Bambara. The goal is to evaluate script-native recognition for Manding speech and
to preserve the phonemic structure that was designed to encode. A word-level
Latin metric can be useful for applications that require Latin output. It is not
the right primary metric for deciding whether a direct ASR system recognizes
speech.
The argument has three parts. First, the scripts differ structurally. Latin Bambara
uses variable-length conventions such as ny and ng for single
phonemic units, while gives many Manding sound distinctions dedicated script
units or explicit combining machinery. Second, the metrics differ mathematically.
WER operates over whitespace-delimited tokens and is sensitive to segmentation,
orthographic convention, and word normalization. CER over normalized operates
over smaller units that are closer to the acoustic-phonemic sequence. Third, the
research claims differ. A 20.57\
not a universal leaderboard result. Its value is that it provides a concrete
phonemically interpretable measurement regime.
Research Questions and Hypotheses
The paper is organized around four questions.
Caption: Research questions and evidence requirements.
| ID | Question | Evidence needed |
|---|---|---|
| RQ1 | Does the target script change what ASR errors mean? | Formal relation between phonemic units, script units, and scorer units. |
| RQ2 | Is Latin WER sufficient for Manding ASR? | Analysis of digraphs, tone, segmentation, and word-boundary assumptions. |
| RQ3 | Is CER equivalent to phoneme error rate? | Normalization and mapping scope conditions; explicit boundary between CER and PER. |
| RQ4 | How should the 20.57 arithmetic, and non-comparability conditions. |
The corresponding alternative hypothesis is that CER is more phonemically
interpretable than Latin WER for Manding ASR. The corresponding null is that script
choice only changes surface rendering and that Latin WER and CER measure
equivalent ASR quality.
The null is implausible if a Latin word error can hide multiple phonemic
confusions, if a single phoneme can span multiple Latin characters, or if tone and
diacritic policy changes the reference without changing the acoustic event. The
paper does not need to prove that every model beats every Latin model to
reject the stronger assumption that script choice is only cosmetic.
Claim taxonomy
The paper separates three claims that are often conflated. A *social script
claim* says that Latin and are both legitimate writing practices for Manding
communities. A measurement claim says that the two scripts induce different
scoring units and therefore different error meanings. A *model-performance
claim* says that a particular ASR system achieves a particular error rate under a
specified script. This paper defends the measurement claim. It does not deny Latin
literacy, and it does not use metric theory alone to prove that one trained model
must outperform another.
Caption: Claim levels for script and metric arguments.
| Claim level | What it establishes | What it does not establish |
|---|---|---|
| Social validity | A community uses and recognizes a script. | That the script is the best scorer for every ASR task. |
| Orthographic transparency | Script units preserve more target phonemic contrasts. | That references are error-free or dialect-neutral. |
| Metric validity | Scorer units align with the scientific object being measured. | That a particular model has reached the best possible score. |
| Benchmark performance | A model reaches a reported score under fixed conditions. | That the score transfers to another script, split, or hyperparameter regime. |
Script Structure
as a Manding script
is encoded in Unicode at U+07C0--U+07FF [citation: unicode2006nko]. The script
was created for Manding languages and is associated with a literacy movement that
includes education, publishing, religious use, and digital writing
[citation: donaldson2017clear]. For ASR, the important property is not cultural
symbolism alone. It is the script's linguistic engineering: organizes writing
around Manding sound structure, including vowels, consonants, and diacritic marks
that can express tone and related distinctions.
Latin Bambara as an adapted orthography
Latin Bambara is a legitimate orthography, but it is not a transparent measurement
interface. A single phoneme may be represented by a digraph such as ny or
ng. A sequence of Latin letters can therefore mean either one phonemic unit
or multiple adjacent units depending on context. Tone is not represented in the
same explicit way that can represent it. Word boundaries, apostrophes,
diacritics, French-influenced conventions, and normalization choices can all affect
WER without corresponding cleanly to acoustic errors.
Caption: Representative metric-relevant script contrasts. The table is schematic; publication should include language-community review before treating any row as a complete orthographic standard.
| Feature | Latin Bambara metric effect | metric effect |
|---|---|---|
| Digraphs | One phoneme may span multiple characters; CTC must learn segmentation. | Dedicated script units reduce digraph segmentation burden. |
| Tone | Often absent from standard Latin scoring; acoustic pitch distinctions can be discarded. | Combining marks can preserve tonal distinctions when references encode them. |
| Word boundaries | WER is highly sensitive to tokenization and spacing policy. | CER can be reported with explicit character denominator and normalization. |
| Spelling variation | Alternate Latin conventions can count as word errors. | Normalization still matters, but script units are closer to phonemic units. |
| Bridge conversion | Transliteration can add its own errors. | Direct ASR avoids Latin-to- conversion at inference. |
Metric Problem
What WER assumes
First, WER is coarse. A one-letter error and a many-letter replacement both count
as one word substitution. Second, WER is sensitive to segmentation. A spacing
difference can create insertion and deletion errors even when the sound sequence is
similar. Third, WER inherits the orthography's information loss. If Latin references
do not encode tone, WER cannot reward a system for recognizing tone. Fourth, WER is
not script-neutral. It privileges whichever script has the accepted word
tokenization convention.
What CER measures
Formal Claim
definition: Transparent script map.
Let $\Phi$ be the set of phonemic units for the target evaluation inventory and
$\Sigma$ the scorer units after normalization. A script map $f:\Phi\rightarrow
\Sigma$ is transparent for an evaluation protocol if it is injective over the
contrasts that the protocol claims to measure and if its inverse is defined for the
normalized reference units used by the scorer.
proposition: Transparent-script edit preservation.
If a normalized script map $f_N:\Phi\rightarrow\Sigma_N$ is injective over the
phonemic contrasts measured by an ASR evaluation, then Levenshtein edit distance
over $f_N(\phi_{1:U})$ preserves phoneme-level substitutions, insertions, and
deletions up to explicitly modeled normalization choices. A variable transcription
relation $R_L\subset\Phi^*\times\Sigma_L^*$ does not in general preserve that edit
structure.
proof.
Under injectivity, each measured phonemic unit has a distinct normalized script unit.
A substitution $\phi_i \rightarrow \phi_j$ with $i\ne j$ maps to one script-unit
substitution $f_N(\phi_i)\rightarrow f_N(\phi_j)$. A deletion or insertion similarly
maps to one deletion or insertion of the corresponding script unit, except where the
evaluation protocol explicitly collapses marks, punctuation, tone, or boundaries.
Thus script edit distance is aligned with phoneme edit distance after the chosen
normalizer.
For a variable relation, a single phonemic unit may map to multiple written units,
and a written sequence may admit multiple phonemic parses. A phoneme substitution
can become two character edits; a boundary error can become a word insertion and
deletion; and a tonal distinction can become invisible if the transcription omits
tone. Therefore the written edit distance no longer preserves phoneme-edit
structure in general.
Orthographic Transparency and ASR Labels
CTC decoders learn alignments between acoustic frames and output labels
[citation: graves2006connectionist]. The output labels are not passive. They define the
units the model must emit. If a label corresponds directly to an acoustic-phonemic
unit, the alignment problem is cleaner. If a label sequence encodes one sound as two
letters, omits tone, or depends on spelling convention, the CTC model must learn
both speech recognition and orthographic composition.
Normalization Protocol
Metric validity depends on normalization. Without an explicit normalizer, two CER
values can differ because of Unicode form, combining-mark policy, punctuation, or
spacing rather than acoustic recognition. A publishable ASR paper should
therefore define the scorer pipeline before reporting the number.
longtable{p{0.24\linewidth}p{0.32\linewidth}p{0.30\linewidth}}
Caption: Normalization decisions that must be declared for Manding ASR scoring.
Decision & scoring question & Latin scoring question
\endfirsthead
Decision & scoring question & Latin scoring question
\endhead
Unicode form & Are base letters and combining marks normalized consistently? & Are
precomposed Latin characters and diacritics normalized consistently?
Scorer unit & Is a unit a codepoint, grapheme cluster, or normalized character
class? & Is a unit a character, byte, word, or grapheme cluster?
Tone and diacritics & Are tonal marks retained, collapsed, or scored separately? &
Are tone marks absent, optional, or normalized away?
Punctuation & Are punctuation, Quranic marks, apostrophes, and sentence symbols
scored or stripped? & Are apostrophes, hyphens, commas, and French-style marks
scored or stripped?
Whitespace & Are spaces included in CER or only used for segmentation? & Are word
boundaries canonical enough for WER?
Digits and symbols & Are digits transliterated, normalized, or left unchanged? & Are
Arabic, Latin, and local numeric conventions harmonized?
Dialect variants & Are alternate spellings treated as errors or accepted variants? &
Are regional Latin conventions collapsed or scored separately?
longtable
The normalizer is not a bureaucratic detail. It defines the scientific object. If
tone marks are retained, the metric asks whether the system recognizes or preserves
tone-bearing written distinctions. If tone marks are collapsed, the metric asks a
weaker question. Both may be legitimate, but they are different experiments.
Caption: Metric-reporting checklist for Manding ASR. A paper that reports only WER or only rounded CER is under-specified.
| Field | Required detail |
|---|---|
| Script | Latin, , or another target; include whether output is direct or post-converted. |
| Normalizer | Unicode normalization, punctuation policy, digit policy, tone/diacritic policy, casing if Latin. |
| Scorer units | Word, character, grapheme cluster, codepoint, or phoneme-derived unit. |
| Denominator | Exact reference word/character count, not only rounded percent. |
| Split | Train/validation/test row counts and split hash or split file. |
| Artifact | Prediction rows, reference rows, metrics file, vocabulary, and model checkpoint. |
| Comparability | Explicitly state whether compared runs share corpus, split, seed, optimizer, learning rate, and scorer. |
Metric Failure Modes
Latin WER and CER fail in different ways. WER can be too coarse for phonemic
analysis, while CER can be too literal if normalization is not linguistically
controlled. A rigorous paper should name these failure modes rather than pretending
one scalar metric eliminates all ambiguity.
Caption: Representative metric failure modes and how to report them.
| Failure mode | Why it matters | Required mitigation |
|---|---|---|
| Latin digraph split | A single phoneme written as two letters can create multiple character edits. | Report whether Latin CER, WER, or phoneme-derived scoring is being used. |
| Tone omission | A tonal acoustic distinction may be invisible in a Latin reference. | State tone policy and avoid claiming PER when tone is not scored. |
| Combining-mark drift | Unicode mark order or composition can change character edits. | Normalize before scoring and publish the normalizer. |
| Boundary instability | Space or punctuation differences can dominate WER. | Report word-boundary policy and consider CER for script-native output. |
| Variant spelling | Community-acceptable alternatives may be counted as errors. | Use variant lexicons only when defined before evaluation. |
| Reference uncertainty | Human transcripts may contain errors or dialect choices. | Separate model error from reference disagreement where possible. |
CER, PER, and the Proxy Boundary
CER is more phonemically interpretable than Latin WER, but it should not be
renamed PER without additional machinery. PER requires a phoneme inventory, a
phonemic parser, dialect policy, tone policy, and a mapping from written units to
phonemic units. Let $g:\Sigma_N^*\rightarrow\Phi^*$ be a normalized
script-to-phoneme parser. A true PER score would compute edit distance after
applying $g$ to hypothesis and reference. CER is a proxy when $g$ is close to
identity for the measured contrasts. It ceases to be a proxy when combining marks,
loanwords, dialectal variants, or orthographic conventions break the identity.
The publishable formulation is therefore:
quote
CER is a script-native character metric whose scorer units are closer to
Manding phonemic units than Latin word tokens are. It is not automatically identical
to PER unless the paper defines and validates the script-to-phoneme mapping used for
evaluation.
quote
Matched Evaluation Protocol
Future work that compares Latin and must hold more than the audio constant.
It must hold the corpus snapshot, train/validation/test split, feature extraction,
optimizer, learning rate, seed, patience, scorer code, and artifact export
constant. Otherwise a script difference may be confounded with ordinary training
variation.
Minimum protocol for a matched Latin-- ASR evaluation.
| Control | Requirement |
|---|---|
| Corpus | Same audio rows, same inclusion/exclusion policy, same quality filters. |
| Split | Same train/validation/test row identities and split hash. |
| Features | Same acoustic encoder, feature cache, tensor-shape policy, and feature count. |
| Training | Same architecture class, optimizer, learning rate, batch size, dropout, patience, seed schedule, and stopping rule. |
| Targets | Script-native references produced by a documented alignment or transcription policy, not post-hoc conversion. |
| Scoring | Published normalizer, edit counts, denominators, and row-level prediction/reference exports. |
| Reporting | Separate within-script score, cross-script comparison, and claim boundary. |
\section{The 20.57\
The archived 20.57\
the clearest example of the scoring regime. It is a direct CTC output scored
as character sequences. Its artifact metadata records a 290,596-pair corpus
snapshot, a 232,476/29,060/29,060 split, learning rate 0.0003, batch size 32,
dropout 0.1, seed 42, best validation loss 0.6358872798606507, and 47 trained
epochs.
The anchor should be described as:
quote
an archived trajectory ASR checkpoint reporting 20.57\
recorded settings.
quote
It should not be described as a TAR or TTT result, as an AGP result, or as a proof
that beats Latin under every matched hyperparameter setting. Those claims
are different. The metric claim is already strong enough: CER gives the
project a script-native, phonemically interpretable anchor.
[figure: figures/fig1_cer_comparison.pdf]
Caption: CER comparison figure from the ASR line of work. In the final series this figure is used as context, not as an unqualified matched superiority proof.
Claim Boundaries
The paper intentionally avoids three overclaims. First, it does not say CER
is identical to phoneme error rate. A future PER metric should define grapheme
cluster handling, combining marks, tone, nasalization, dialect variants, and
punctuation policy explicitly. Second, it does not say Latin output is useless.
Latin Bambara is socially real and may be required for many applications. The claim
is about measurement validity for a script-native Manding ASR research question.
Third, it does not use the 20.57\
Latin. The later low-learning-rate runs were not comparable to the anchor because
they used a different learning-rate regime.
Conclusion
For Manding ASR, metric choice is a scientific decision. Latin WER measures
agreement with a Latin word-token convention. CER measures script-native
character agreement in a writing system designed around Manding phonology. The two
metrics do not answer the same question.
The immediate conclusion is that public discussion of the 20.57\
center the metric. The result matters because it is a direct ASR anchor with
explicit edit arithmetic, not because it can be collapsed into an ordinary Latin
WER leaderboard. A rigorous research program should report both when needed, but it
should not pretend they are equivalent.
plainnat
references
Promotion Decision
Compile/render the source, verify references and figures, then add to the curated atlas.
Source Anchor
nko-brain-scanner/paper/final/02-phonemic-evaluation/paper.tex
Detected Structure
Latex · Abstract · Method · Evaluation · References · Math · Figures · Architecture