Grand Diomande Research · Full HTML Reader

Featural Acoustic Coding with N'Ko — Program Documentation

Language as Infrastructure proposal experiment writeup candidate score 32 .md

Full Public Reader

Featural Acoustic Coding with N'Ko — Program Documentation

What this is

This document is the dense, self-contained account of a research direction that began as a reaction to a single paper and turned into a structural extension of an existing N'Ko language-technology program. The reaction paper is Lexical Acoustic Coding, abbreviated LAC, which proposes that a short sound can be transmitted between two language-model agents as a readable English sentence and then re-rendered from that sentence, trading exact sample recovery for perceptual similarity. The reaction was that natural English is the weakest part of that design and that a script engineered for sound, specifically N'Ko, would carry the same information more faithfully and far more compactly. That reaction was developed into a method called Featural Acoustic Coding, abbreviated FAC, a written paper that compiles cleanly, a battery of experiments that were actually run rather than asserted, and a strategic reconciliation with the brain-scanner research program that already produced a script-native N'Ko speech recognizer at roughly twenty percent character error rate. The honest conclusion, stated up front so nothing here oversells, is that FAC has real merit but not as a standalone competitor to LAC; its merit is highest as the acoustic tone channel and the generative dual of the recognizer that already exists, and the standalone codec framing should be demoted to a section rather than promoted to a flagship.

The origin and the precise disagreement with LAC

LAC works by analyzing a waveform into a small set of interpretable acoustic descriptors that span temporal, spectral, harmonic, and psychoacoustic properties, quantizing each descriptor into a word-like label drawn from a shared vocabulary, and composing those labels into an English sentence such as the one about a mid-power punch with a moderate-onset envelope that stays front-loaded and clipped. A receiver parses the sentence back into acoustic intervals and synthesizes the nearest waveform, with a closed-loop refinement step that nudges the synthesis toward the target. LAC situates itself against three neighbors in the audio representation space: captions, which are readable but too loose to resynthesize from; neural codecs such as SoundStream, EnCodec, and the Descript Audio Codec, which reconstruct extremely well but emit opaque integer tokens with no human meaning; and raw acoustic descriptor vectors, which are interpretable but live in a continuous space that neither people nor models communicate naturally. LAC's contribution is to quantize the descriptor vector into a lexical codebook and serialize it as natural-language text so that it is simultaneously readable, editable, grounded, and native to the text channel a model already speaks.

The disagreement is narrow and it is about the carrier, not the idea. Natural orthographies were never designed to encode sound; they accreted over centuries and they encode it badly in three specific ways. First, their labels are arbitrary, so the word warm has no acoustic relationship to spectral warmth and the mapping from word to descriptor bin must be invented and shared rather than read off the script. Second, English writing has no native notation for pitch at all, so a falling tone has to be spelled out as a phrase rather than written as a mark. Third, English graphemes do not decompose along acoustic axes, so onset, spectral tilt, harmonicity, and pitch, which vary independently in the signal, get flattened into a clause whose grammar carries none of that structure. LAC succeeds in spite of its carrier by leaning on the language competence a frontier model already has, and the claim of this work is that a designed featural script removes all three problems by construction.

Why N'Ko specifically, and the featural mapping

A featural script is one whose symbols and their composition correspond to phonetic features rather than to whole words or arbitrary spellings. Korean Hangul is the textbook example. N'Ko is the example that matters here because Solomana Kanté engineered it in 1949 for the tonal Manding languages, which means it carries explicit, systematic notation for exactly the dimension every natural orthography lacks, namely pitch, and it does so with a clean one-to-one phoneme-to-character mapping and no spelling exceptions. The crucial empirical observation of the FAC paper is that several of LAC's major descriptor axes fall directly out of the existing N'Ko script rather than needing to be invented. The seven N'Ko vowels partition the formant plane from the bright, high-second-formant front vowel to the dark, low back vowel, so vowel identity is a ready-made, perceptually ordered code for spectral centroid or color. Consonant manner of articulation is, acoustically, onset character: plosives are sharp percussive attacks, fricatives are noisy onsets, nasals are soft resonant onsets, and approximants are smooth glides, so the onset consonant slot encodes attack-transient type. The tone inventory is the correction that governs the current implementation: Unicode defines seven N'Ko combining tone marks from U+07EB through U+07F1: short high, short low, short rising, long descending, long high, long low, and long rising. Folding length into class yields high, low, rising, falling, and unmarked mid, and U+07EE is the native falling or descending mark. The older five-mark high/low/rising/long/very-long reading is a codebook compatibility artifact, not the Unicode truth. A nasal coda adds a resonant tail, which maps to a sustain feature. Only the higher timbral axes, harmonicity, spectral spread, roughness, and dynamics, have no native N'Ko symbol, and for those the paper defines a small, disciplined set of designed combining diacritics, which is a far smaller design problem than LAC's, since LAC must invent vocabulary for every axis while FAC only invents it for the timbral remainder.

The codec and the existing substrate

Formally, FAC defines an encoder that maps a waveform to a string of N'Ko acoustic syllables and a decoder that synthesizes audio from that string, where each syllable is a tuple of onset, nucleus, coda, tone, length, and an optional stack of timbral diacritics. The base tuple indexes a codebook that already exists in the released N'Ko toolkit: the file syllable_codebook.py enumerates roughly three thousand six hundred forty tonal syllables under a strict consonant-vowel and consonant-vowel-nasal grammar, storing for each one its onset, nucleus, coda, tone, pattern, international phonetic alphabet value, and a unique index, and it describes itself explicitly as the retrieval target for a joint-embedding speech recognizer. That single sentence in the existing code is the hinge of the entire strategic argument, because it means the sound-to-syllable direction already exists in the recognizer and FAC is simply the inverse, the syllable-to-sound direction. The encoder reuses the same descriptor front end LAC validates and then quantizes into the featural codebook rather than into English words, mapping spectral centroid to the nearest vowel, onset descriptors to a manner class and therefore an onset, fundamental-frequency register and contour to a tone mark, duration to a length mark, sustain to a coda, and the residual timbre to the diacritic stack. The decoder inverts each syllable to descriptor intervals and drives a synthesizer toward the nearest match, with the same kind of closed-loop refinement LAC uses, except that because the code is featural the error signal is per-slot and interpretable, so a tone mismatch is unambiguously a pitch error and a nucleus mismatch is unambiguously a spectral-color error.

The experiments that were actually run, and their honest results

The load-bearing claim, called hypothesis two, is that FAC has an efficient pitch channel because pitch is written with a tone mark rather than approximated with adjectives. This was tested first on controlled synthetic tonal contours in a rate-distortion framing, comparing three codec families at matched bit budget: a level-only lexical baseline that spends all its bits on pitch register and reconstructs a flat contour, a lexical baseline that additionally spends a word on contour, and FAC-native, which uses the native N'Ko tone inventory including rising and falling. The first attempt produced a fake-perfect result of about eighteen cents that was traced to a grid-alignment artifact in the synthetic corpus, where tones had been placed at exactly the three register centers so whichever codec happened to receive a three-level grid won by coincidence; this was caught and the corpus was rewritten with continuous register so no codec could win by grid luck. The corrected synthetic result is clean and it keeps its nuance: when half the events are contour tones, a level-only summary hits an error floor near eighty cents that no amount of register budget can break, because it physically cannot represent within-event pitch movement, whereas the contour-carrying codes reach about eighteen cents, a four-point-four-fold reduction. FAC-native and the contour-augmented lexical baseline tie on pitch error, which means the fidelity win is not magic but rather the act of making contour first-class. N'Ko's advantage over a well-designed lexical code is token cost and orthographic decomposition: N'Ko writes register and contour in one glyph while the lexical code needs two words. In the null case with no contour tones every codec ties as it should.

The decisive escalation was to real audio, and the real audio existed locally: the brain-scanner corpus contains a recording of the researcher's parents speaking Manding, segmented per speaker and transcribed into N'Ko. Two extraction problems surfaced and one was fixed. The pitch tracker was first run with a floor of eighty hertz, which railed the father's entire range because he is a low male voice with a median near seventy-one hertz, pinning ninety-one percent of frames at the floor; lowering the floor to fifty hertz recovered his true distribution. The mother's track is octave-halved by the tracker, reading sixty-six to one-hundred-twenty-one hertz when a female speaker truly sits near one-hundred-thirty to two-hundred-forty, so her track and the combined number were excluded as untrustworthy and flagged for a per-speaker floor. On the clean speaker the predicted ordering held but the effect was small, about five percent, with a level-only floor near sixty-two cents and the contour-carrying code near fifty-nine, and the reason is decisive for the whole program: conversational speech carries only about forty cents of within-syllable pitch movement at the syllable scale, so register quantization error dominates and is shared by both codes. This forced a reframing that strengthens rather than weakens the case, namely that FAC's durable contribution is token efficiency and native-for-free representation rather than a universal pitch-fidelity win, and that raw fundamental-frequency error superiority is conditional on tonal density, so the decisive test requires tone-bearing material where pitch does linguistic work rather than casual conversation that under-exercises the very axis the script encodes.

That tone-bearing material was then obtained from the babamamadidiane channel of numbered N'Ko teaching lessons, whose video titles themselves carry tone marks, and running the same pitch-fidelity test on a lesson produced a level-only floor near sixty-six cents against an extended-code value near fifty-six, a fifteen percent reduction at a within-syllable contour depth of seventy-seven cents. Taken together the three regimes trace a clean scaling law that makes the whole story coherent: the synthetic contours at three hundred sixty cents of excursion gave a four-point-four-fold advantage, the didactic lesson speech at seventy-seven cents gave a one-point-two-fold advantage, and the conversational speech at forty cents gave essentially none, so the FAC pitch advantage grows monotonically with tonal density and is modest on real speech, which is exactly why the token-efficiency argument, not the fidelity argument, must lead the paper.

The tone seam and the relationship to the brain-scanner recognizer

The reason this stopped being a codec paper and became part of a larger program is the discovery, by reading the canonical brain-scanner paper, that the recognizer and FAC are duals sharing one machine. The recognizer is an anticipatory Transformer connectionist-temporal-classification decoder: frozen Whisper large-v3 features are projected to seven-hundred-sixty-eight dimensions and temporally downsampled, then a six-layer Transformer connectionist-temporal-classification head emits N'Ko characters, and a trajectory module estimates a seven-dimensional state per timestep, comprising commitment, uncertainty, transition pressure, recovery margin, phase stiffness, novelty, and stability, which is injected as an additive attention-logit bias before emission. The archived anchor reports roughly twenty percent character error rate on a two-hundred-thirty-two-thousand-pair Bambara corpus, and the paper is careful that this is an anchor with provenance rather than a leaderboard claim, that its true spine is a measurement-theoretic argument that N'Ko's bijective phoneme-to-character mapping makes character error rate phonemically interpretable in a way Latin word error rate is not, and that the single largest hedge in the entire paper is tone, because the recognizer is toneless and tone is meant to be recovered downstream by a separate text-context language model trained on optically-character-recognized N'Ko, which is the planned but unbuilt Paper Eight.

This is precisely the hole FAC fills. The recognizer uses the trajectory of Whisper features to recognize content and therefore discards acoustic detail, while FAC uses the same features to preserve content and therefore keeps the tone and timbre the recognizer throws away, and the trajectory state is a dynamics descriptor of the same path that FAC describes with content features. The tone-resolution layer is the one place these two meet: Paper Eight resolves tone from text context, which is a prior over what tone makes linguistic sense, while FAC resolves tone from the acoustic fundamental frequency, which is the evidence sitting in the signal that the text-only model ignores, so the two are not competitors but the prior and the likelihood of a single estimator. The honest causal path from FAC to a lower character error rate runs through the self-improving loop rather than through a decoder hack: the loop trains the recognizer, labels a quarter-million AfVoices utterances, optically character-recognizes toned N'Ko from YouTube, uses a tone model to fix the labels, and retrains on the cleaner data, and FAC supplies acoustic tone evidence to the weakest, most hand-waved step in that loop, so better tone resolution yields cleaner labels which yield a better retrain which yields lower error, with the caveat from the real-speech pilot that the gain should be expected in tone-diacritic accuracy rather than as a dramatic global error drop.

How FAC attaches to the architecture, with the option to reject

There are three ways to attach FAC to the anticipatory Transformer and the choice must be deliberate. The option to reject is replacing the connectionist-temporal-classification target with the full FAC featural codebook of thousands of tonal syllables plus diacritics, because that target is data-starved and would raise the error rate, which is the same tension the program already found when a morpheme-constrained subword tokenizer improved boundary preservation but needed a larger corpus to compete on compression. The option to build first, because it is low risk and leaves the twenty-percent anchor untouched, is a parallel acoustic head on the shared encoder, so that head A remains the existing toneless connectionist-temporal-classification recognizer and head B is an acoustic tone and register predictor trained on fundamental-frequency targets with a combined multitask loss, after which head B's tone posteriors feed the Paper Eight fusion alongside the text prior; this also lets the reconstruction objective act as an auxiliary regularizer on a low-resource encoder. The ambitious option, which is the genuinely novel architecture worth a paper of its own, is to fold FAC's interpretable acoustic descriptors directly into the trajectory state, so that the seven learned and abstract dynamics scalars are augmented with named acoustic coordinates such as fundamental-frequency register, pitch contour, spectral centroid, onset transient, and harmonicity, and the attention bias is then conditioned on actual acoustic geometry rather than only on latent dynamics, which grounds the trajectory in interpretable features in the same spirit as the transparent-script measurement thesis and is uniquely available to this program because it owns both the recognition and the reconstruction halves.

The program, how the papers converge, and how to release

The correct way to see the whole body of work is one thesis with pillars, where the thesis is that N'Ko is computational infrastructure for Manding language technology, controlling what models can represent, how acoustic evidence aligns to symbols, whether a reported error rate is meaningful, and how sound itself can be encoded. The first pillar is representation, asking whether a model can see N'Ko, and it contains the Dead Circuits and Script Invisibility work on tokenization, activation geometry, and circuit duplication. The second pillar is recognition and measurement, asking whether a model can hear N'Ko and whether the metric is real, and it contains the Living Speech construction of the recognizer, the script-advantage and generalization and trajectory-attention papers numbered four through six, and the transparent-script proposition that is the program's strongest single idea. The third pillar is reconstruction and tone, asking whether a model can render N'Ko and resolve its tone, and it contains Paper Eight and FAC, which together are the prior and the evidence of tone resolution and the generative dual of recognition. Orthogonal to all three is a systems track, the neural-engine offload, distributed training, and low-resource retrieval papers numbered seven, nine, and ten, which are how the work was trained cheaply rather than part of the linguistic claim and should be narrated separately so they do not muddy the thesis. The release strategy that follows from this is that the canonical consolidation is the flagship and should be released first because it is mature, carefully hedged, and self-contained around the anchor and the measurement thesis, that it must not be gated on FAC, that focused companion papers should go to their natural venues with recognition aimed at speech conferences and representation aimed at computational-linguistics conferences and tone-and-reconstruction held until the tone-seam experiment yields a real number, and that FAC is not a standalone flagship but the reconstruction pillar and the acoustic half of tone, with its lexical-coding comparison demoted to a section or a workshop note. The pillars should be labeled by function as representation, recognition, measurement, governance, and reconstruction, and the program named for the infrastructure thesis.

The tooling built and the corpus harvest

Several reusable pieces were built and verified rather than sketched. The pitch-fidelity experiment exists as a single self-contained module with a rate-distortion harness, a synthetic tonal-corpus generator, and the four codecs, and it runs and prints its tables. The real-corpus harness extracts fundamental frequency with the probabilistic-YIN tracker, segments voiced regions into syllable-scale events, calibrates register per speaker, and reuses the same four codecs, and it is what produced the parents-audio and lesson numbers. The tone-seam version-zero harness contains an acoustic tone classifier that maps fundamental frequency to one of high, mid, low, rising, and falling using register-relative per-speaker normalization with no training, a controlled validation that reaches one hundred percent tone-class accuracy on multi-speaker synthetic syllables and therefore proves the mechanism and the normalization rather than a real-world number, a real and tested parser that extracts gold tone classes from the tone marks in toned N'Ko text, a scoring function that is ready to compute a tone-diacritic error rate the moment aligned audio and gold tone exist, and a real-audio path that shows the pipeline flowing on the parents recording. The video acquisition was unblocked by installing a current yt-dlp through Homebrew after the seven-month-old user-site copy was rejected by YouTube with a forbidden error, and an examination of a downloaded frame established the important structural fact that the lessons are screen-shared documents of typed and handwritten toned N'Ko with a teacher who explains rather than narrates them, which means the lessons supply Paper Eight's toned text corpus and a reservoir of tone-bearing audio but not aligned acoustic-tone ground truth, so the first real tone-diacritic error rate requires read speech of known toned text rather than lesson commentary. Because no Tesseract N'Ko model exists, optical character recognition is performed through the Gemini vision-language interface using the available key, which is the same vision approach Paper Eight planned, and a single validating frame yielded one hundred eighty-three N'Ko characters carrying forty-seven tone marks of coherent toned text, after which a harvest script was written to pull several lessons, extract sixteen-kilohertz audio and sampled frames, optically recognize each frame, deduplicate, and accumulate a toned-text corpus with per-entry character and tone-mark counts.

Honest status and what remains

Nothing here is dressed up beyond what was verified. The FAC paper compiles cleanly to six pages with resolved citations. The synthetic and real pitch experiments ran and produced the scaling law described, with the synthetic artifact caught and corrected rather than reported. The tone-seam mechanism is proven on controlled data and runs on real audio, but no real tone-diacritic error rate has been computed because that requires read-speech ground truth that lesson videos structurally cannot provide, and that gap is stated rather than papered over. The mother's pitch track needs a per-speaker floor before her data is usable. The full encoder beyond the pitch slice and the synthesizer that would enable a perceptual-similarity test do not yet exist. The single most valuable next step is to obtain a short read-speech recording of a known toned N'Ko passage so the harness can produce the first genuine tone-diacritic error rate, after which the parallel acoustic head is the architecture step that puts acoustic tone into the recognizer, and the trajectory-state fusion is the writeup that unifies the recognizer and the reconstruction dual into a single interpretable object. The meaning of the whole effort is that a designed indigenous script is not a cosmetic rendering but computational infrastructure that improves representation, recognition, measurement, and now reconstruction, and that FAC is the piece which makes that claim generative and which closes the tone gap the strongest existing paper currently has to hedge.

Promotion Decision

Attach run IDs, datasets, metrics, and reproduction commands.

Source Anchor

nko-acoustic-coding/DOCUMENTATION.md

Detected Structure

Method · Evaluation · Code Anchors · Architecture