Acoustic World Models and the N'Ko Speech Inscription Bridge

Full HTML reader

Read the full artifact

Extracted abstract or opening context

An acoustic world model is the deeper architecture implied by the work so far. It is not merely a better recognizer, and it is not simply another decoder sitting after Whisper. It is a different way of deciding what speech is inside the system. A normal ASR pipeline treats speech as a signal whose purpose is to become text. The model receives audio, compresses it into features, predicts symbols, and then the rest of the system tries to clean up those symbols. An acoustic world model treats speech as a world of acoustic events. Text becomes one possible explanation of that world, not the world itself. This shift matters because the current N'Ko system has already shown the limit of the recognizer-only framing. The iPhone proof established that the acoustic serving stack can run on device. Audio can move through a Whisper-style mel frontend, a split CoreML encoder, an anchor CTC head, greedy decoding, bounded correction, and a Swift ranker. The live harness can record microphone input, route it into the same machinery, and produce files that can be copied back from the phone. That is a real serving milestone. But the visible live output also showed that a completed inference path can still produce repetitive garbage. The system can finish the computation and still fail to know what it is allowed to say. That is the moment where an acoustic world model becomes necessary. The failure is not only a decoder failure. It is an ontology failure. If the system believes that the primary object is the transcript, then any decoded string looks like an answer. If the system believes that the primary object is acoustic evidence, then the decoded string is only a hypothesis. The hypothesis can be accepted, rejected, deferred, or archived. The Speech Inscription Bridge v0 is the first implementation of that second worldview. The current bridge says that every live or recorded ASR run should compile into audio evidence, a typed transcript decision, an acoustic and FAC claim scaffold, and a controlled N'Ko proof rendering. That structure is important because it changes the meaning of a run. A run is no longer a black box that produces text. A run becomes an evidential event. The captured waveform, prepared waveform, logits, argmax path, hashes, manifest, and proof rendering all become parts of a durable record. The transcript is not allowed to float away from the evidence that produced it. The acoustic world model is the future representation engine that belongs inside this evidential structure. It would learn the structure of acoustic reality before the system commits to symbols. In the current system, the Whisper-style encoder and CTC head are doing most of the representational work. They produce hidden states, logits, and candidate symbols. A future acoustic wor

Promotion decision

What has to happen next

Attach run IDs, datasets, metrics, and reproduction commands.

Why this is not always a full paper yet

Corpus pages are public-safe readers for discovered workspace artifacts. They are not automatically final papers. A corpus item becomes a polished paper only after the editable source, evidence checkpoints, references, figures, render path, and release status are attached through the paper schema.