Grand Diomande Research · Full HTML Reader

Acoustic World Models and the N'Ko Speech Inscription Bridge

An acoustic world model is the deeper architecture implied by the work so far. It is not merely a better recognizer, and it is not simply another decoder sitting after Whisper. It is a different way of deciding what speech is inside the system. A normal ASR pipeline treats speech as a signal whose purpose is to become text. The model receives audio, compresses it into features, predicts symbols, and then the rest of the system tries to clean up those symbols. An acoustic world model treats speech as a world of acou

Language as Infrastructure experiment experiment writeup candidate score 26 .md

Full Public Reader

Acoustic World Models and the N'Ko Speech Inscription Bridge

An acoustic world model is the deeper architecture implied by the work so far. It is not merely a better recognizer, and it is not simply another decoder sitting after Whisper. It is a different way of deciding what speech is inside the system. A normal ASR pipeline treats speech as a signal whose purpose is to become text. The model receives audio, compresses it into features, predicts symbols, and then the rest of the system tries to clean up those symbols. An acoustic world model treats speech as a world of acoustic events. Text becomes one possible explanation of that world, not the world itself.

This shift matters because the current N'Ko system has already shown the limit of the recognizer-only framing. The iPhone proof established that the acoustic serving stack can run on device. Audio can move through a Whisper-style mel frontend, a split CoreML encoder, an anchor CTC head, greedy decoding, bounded correction, and a Swift ranker. The live harness can record microphone input, route it into the same machinery, and produce files that can be copied back from the phone. That is a real serving milestone. But the visible live output also showed that a completed inference path can still produce repetitive garbage. The system can finish the computation and still fail to know what it is allowed to say.

That is the moment where an acoustic world model becomes necessary. The failure is not only a decoder failure. It is an ontology failure. If the system believes that the primary object is the transcript, then any decoded string looks like an answer. If the system believes that the primary object is acoustic evidence, then the decoded string is only a hypothesis. The hypothesis can be accepted, rejected, deferred, or archived. The Speech Inscription Bridge v0 is the first implementation of that second worldview.

The current bridge says that every live or recorded ASR run should compile into audio evidence, a typed transcript decision, an acoustic and FAC claim scaffold, and a controlled N'Ko proof rendering. That structure is important because it changes the meaning of a run. A run is no longer a black box that produces text. A run becomes an evidential event. The captured waveform, prepared waveform, logits, argmax path, hashes, manifest, and proof rendering all become parts of a durable record. The transcript is not allowed to float away from the evidence that produced it.

The acoustic world model is the future representation engine that belongs inside this evidential structure. It would learn the structure of acoustic reality before the system commits to symbols. In the current system, the Whisper-style encoder and CTC head are doing most of the representational work. They produce hidden states, logits, and candidate symbols. A future acoustic world model would go further. It would represent the speech event as a structured latent state containing timing, rhythm, amplitude, speaker conditions, noise, boundaries, tone, voicing, place, manner, nasality, vowel shape, duration, and confidence geometry. It would give the system a way to reason about what happened acoustically before asking what should be written.

The mathematical picture is straightforward. Let E be the observed evidence. E includes the captured audio, the prepared audio, the mel representation, encoder states, CTC logits, argmax paths, timing, thresholds, hashes, and provenance. Let Z be the latent acoustic world state that explains E. Z is not text hidden inside sound. Z is a structured account of the acoustic event itself. Let H be the set of possible linguistic explanations supported by Z. A transcript is one possible member of H. A rejection is also a valid outcome when the evidence does not support any explanation strongly enough. Let D be the governed decision that determines what the system is allowed to preserve as transcript, what it must reject, and what it should send to labeling or future calibration.

The old ASR question is "What text T maximizes P(T given E)?" The acoustic world model question is "What acoustic state Z explains E, what hypotheses H does Z support, and what decision D survives governance?" That extra structure is the difference between a transcript machine and an evidential speech system. It is also the reason the bridge is not a cosmetic patch. The bridge is the beginning of the system's legal structure. It decides what counts as admissible evidence, what counts as a claim, and what kind of surface the system is allowed to show.

The live CTC overfire screenshots make this concrete. When CTC emits hundreds of nonblank frames, the raw output can look like repeated symbolic noise. A normal UI might display it. A slightly better UI might truncate it. The bridge does something more meaningful. It classifies the event as rejected_overfire when emission density or scalar length violates the admissibility threshold. It can classify rejected_low_audio when the signal is too weak or too short. It can classify rejected_non_nko when the candidate contains inadmissible symbols. It can classify rejected_unstable when the signal exists but the stable N'Ko evidence is not strong enough. It can classify needs_label when the run has value but cannot yet become training data without human grounding. These states are not just error labels. They are the first categories of the acoustic world.

That is why the AudioEvidenceBundle matters. It is not just a debug folder. It is the memory of the event. In a low-resource speech system, experience is precious even when recognition fails. A failed recording may reveal microphone gain problems. It may reveal CTC overfire. It may reveal English speech being fed into a Malinke or N'Ko-oriented path. It may reveal that trimming removed too much context. It may reveal that the head is unstable under live audio even though fixtures pass. If those failures are only seen on screen and forgotten, the system learns nothing. If they are archived as evidence packets, they become material for future governance, calibration, and training.

This is also where the memory constraint has influenced the architecture in a productive way. A system with unlimited conversational memory could keep explaining what went wrong without forcing itself to preserve evidence. Here, the pressure has pushed the work toward manifests, hashes, validators, audit files, proof summaries, and durable packets. That external memory is not incidental. It is the institutional memory of the acoustic system. The model should not merely remember in weights. It should remember through evidence.

A true acoustic world model would use those evidence packets as the training substrate. It would not only learn from accepted transcripts. It would learn from rejected events, from unstable emissions, from low-audio failures, from overfire patterns, from human labels, and from future reinterpretations. This is crucial for low-resource languages because the data desert is not only a lack of examples. It is a lack of structured, reusable machine experience. Every governed packet turns one moment of speech into an artifact that can be replayed, audited, labeled, compared, and eventually absorbed into a better model.

Featural Acoustic Coding fits naturally into this. FAC argues that phonemes are not atomic labels. A phoneme is a bundle of features. Voicing, place, manner, nasality, tone, length, vowel height, vowel backness, and rounding combine to produce the categories that writing systems represent as symbols. Human listeners exploit these features constantly. They can hear shared structure between sounds even when the sounds are not identical. A normal ASR model often learns labels as if each class were separate. An acoustic world model would learn the feature field beneath those labels.

The current v0 bridge does not claim full FAC. That honesty matters. The present system reserves FAC slots but mainly asserts the evidence it actually has, such as CTC stability and boundary behavior. It does not yet claim that it has independently detected tone, manner, place, or voicing. But reserving those slots is not empty bureaucracy. It defines the future shape of the world model. It says that the system should eventually reason about speech as feature trajectories, not merely as emitted characters.

Tone is one of the clearest reasons this matters for N'Ko and Manding speech. Text priors can help, but tone is acoustic. A language model can guess what tone is likely from context, but the speaker produced pitch, contour, duration, and register. The future system should not throw those away before the decision layer has a chance to use them. The acoustic world model should keep the tone evidence alive long enough for the governance layer to compare acoustic evidence against textual hypotheses. In that sense, FAC is not a side note. It is the acoustic half of the N'Ko substrate.

The N'Ko surface then plays a different role from the one it plays in ordinary ASR. It is not just the output alphabet. It is a controlled symbolic substrate where claims can be rendered, audited, and compared. In the inscription architecture, the surface is not the source of truth. The typed claim and the evidence chain are the source of truth. The surface is the human-visible commitment. Speech v0 applies this same law to audio. The raw CTC string is not truth. The corrected string is not automatically truth. The accepted or rejected transcript decision, backed by archived evidence, is the truth the system is allowed to record.

This is why CC Inscriptions and the Speech Inscription Bridge belong together. CC Inscriptions gives the pattern of evidence to typed claim to controlled surface to proof scaffold. The speech bridge applies that pattern to microphone evidence. The provenance witness ties the visible rendering back to hashes and replayable files. The manifest sidecar prevents the system from silently mutating its own past. The validator refuses stale or malformed packets. The audit refuses to call the goal complete when only old v1 evidence exists. These behaviors are not administrative overhead. They are the governance machinery that makes an acoustic world model scientifically usable.

Codec-token speech models are relevant because they point toward richer acoustic representations. A CTC recognizer collapses speech into symbols quickly. A codec-token model preserves enough information to reconstruct speech, which means it tends to retain timing, prosody, speaker cues, and tone-bearing structure. That is closer to what an acoustic world model needs. But codec tokens alone are not the answer. Preservation is not governance. A speech-token transformer may keep more signal, but it still needs a decision system that asks which explanations survive the evidence. The N'Ko architecture needs both richer acoustic state and stricter claim governance.

The word world should be taken seriously. A world model learns the regularities of a domain. In vision, a world model might learn that objects persist, move, occlude one another, collide, and reappear. In speech, an acoustic world model should learn that speakers produce continuous articulatory gestures, that sounds coarticulate, that vowels carry formant structure, that tone can stretch across syllables, that silence and breath mark boundaries, that microphones distort amplitude and spectrum, that noise changes confidence, and that the same phoneme may appear differently across speakers and contexts. These are not words. They are acoustic facts.

Once the system has that kind of internal model, recognition becomes a governed explanation of acoustic reality rather than a direct symbol prediction. The system hears an event. It constructs latent acoustic structure. It proposes feature trajectories. It composes phoneme hypotheses. It renders possible N'Ko surfaces. It tests those surfaces against the evidence. It accepts, rejects, or defers. That is the architecture we are moving toward.

This also reframes translation. The current bridge does not mean that reliable real-time English translation already exists. Translation requires stable recognition or a direct speech-to-meaning model. At the moment, the proved layer is on-device acoustic serving plus governed inscription. That is progress toward translation because it prevents the system from building translation on top of unstable hallucinated transcripts. But it is not translation yet. The correct claim is that the system can run the acoustic path on device, archive live evidence, classify transcript admissibility, and refuse garbage as language. Real-time translation comes after the acoustic evidence can support stable meaning.

Speaking Malinke into the app is still the correct direction. English phrases like "testing testing testing" can test microphone capture, pipeline latency, and failure behavior, but they cannot validate a Malinke or N'Ko-oriented recognizer. If the acoustic head is trained for N'Ko or Manding evidence, English should not be expected to produce meaningful N'Ko output. The right calibration process is short Malinke phrases, expected labels when available, and saved packets for every run. The early goal is not to be correct every time. The early goal is to make every run become evidence rather than confusion.

This is where low-resource language work becomes different from ordinary product ASR. In a high-resource setting, one might discard failed runs because there is plenty of labeled data elsewhere. In a low-resource setting, failed runs are part of the path. A rejected_overfire packet teaches a different lesson from a rejected_low_audio packet. A needs_label packet invites human knowledge into the system. A rejected_unstable packet may become useful later when the acoustic world model improves. The system should not only optimize for immediate success. It should optimize for governed accumulation of experience.

The acoustic world model also explains why ANE, TurboQuant, and latency are separate proof lanes. They matter for serving, but they are not correctness. Faster garbage is still garbage. Hardware acceleration helps only after the evidence and decision layers are trustworthy. TurboQuant may eventually reduce cost or latency, but it does not decide whether an acoustic explanation is true. The world model and inscription bridge define what can be claimed. Serving optimizations define how efficiently the claim machinery runs.

The current state is therefore neither failure nor completion. It is a transition point. The on-device ASR serving proof is real. The live mic harness is real. The v2 manifest schema is real. The typed transcript decision system is real. The controlled proof rendering is real. The validator and audit gates are real. The remaining live proof gap is also real. The latest copied iPhone packets were old v1 calibration packets, not fresh v2 live_mic packets. The audit correctly refused completion. That refusal is part of the architecture working.

The deeper point is that the model is becoming more than a neural network. It is becoming a governed relationship between sound, evidence, explanation, decision, surface, and memory. The CoreML encoder is a serving component. The CTC head is a proposal mechanism. The ranker is a bounded correction mechanism. The Speech Inscription Bridge is a governance mechanism. The future acoustic world model is the representation layer that will make the proposals more grounded and the decisions more informative.

If this succeeds, the system will not merely transcribe Malinke into N'Ko. It will preserve acoustic experience in a form that can improve over time. It will hear speech as structured evidence. It will learn feature trajectories. It will compose phonemic explanations. It will render N'Ko surfaces only when the evidence supports them. It will reject unsupported outputs without shame. It will recycle admissible packets into future training. That is a very different architecture from a transcript printer.

For N'Ko, that distinction is central. The script is not only an output format. It is a phonemic and cultural substrate that can make speech evidence auditable. The acoustic world model supplies the missing acoustic depth beneath that substrate. The inscription bridge supplies the governance that prevents the system from lying. Together, they point toward a low-resource speech system that does not depend only on scale. It depends on evidence, compositional features, controlled surfaces, and memory.

That is the real meaning of the acoustic world model in this architecture. It is the move from speech as text prediction to speech as governed acoustic reality. It says the system should not rush from waveform to words. It should first understand the world of sound that produced the waveform, preserve that evidence, construct possible explanations, govern those explanations, and only then render a claim. The current bridge is the first concrete version of that law. The future model will make the law more intelligent.

Promotion Decision

Attach run IDs, datasets, metrics, and reproduction commands.

Source Anchor

nko-brain-scanner/experiments/acoustic_gate/ACOUSTIC-WORLD-MODEL-ESSAY.md

Detected Structure

Method · Evaluation · Architecture