Grand Diomande Research · Full HTML Reader

Stage 0: Research — Voice-First Agent Architecture

The mesh currently has 8 distinct voice subsystems spread across iOS apps, macOS services, and backend flows. They are architecturally isolated — no subsystem talks to another. The terminal agents (Claude panes, Prefect flows, Discord bots) communicate exclusively through text. Voice exists at the edge (phone, glasses) but doesn't penetrate the mesh core.

Agents That Account for Themselves architecture technical paper candidate score 32 .md

Full Public Reader

Stage 0: Research — Voice-First Agent Architecture

System Under Study

The mesh currently has 8 distinct voice subsystems spread across iOS apps, macOS services, and backend flows. They are architecturally isolated — no subsystem talks to another. The terminal agents (Claude panes, Prefect flows, Discord bots) communicate exclusively through text. Voice exists at the edge (phone, glasses) but doesn't penetrate the mesh core.

Fact Inventory

1. Eight Voice Subsystems

#SystemLocationVoice CapabilityState
1OpenClawHub DirectVoiceiOS appFull STT→intent→Clawdbot→TTS loopWORKING
2OpenClawHub Fleet VoiceiOS (Glasses Gateway)Fleet control: status, resume, inject, kill, spawnWORKING
3OpenClawHub QuadViewiOSPush-to-talk → pane delegationUI complete
4SpeakFlowmacOS + iOS keyboardGlobal hotkey→STT→text injection + "hey claw" triggerWORKING
5SecuriClaw WakeWordiOSContinuous wake phrase detectionWORKING
6Spore VoiceCaptureiOSVoice→idea creation with keyword extractionWORKING
7Voice Task DaemonMac1 LaunchAgentSupabase mac_tasks → TTY pane injectionACTIVE
8Transcription IntelPrefect flowVideo transcripts → intelligence extractionACTIVE

2. Voice Models/APIs in Use

Model/APIWherePurpose
Apple SFSpeechRecognizer (on-device)All iOS/macOS appsSTT
ElevenLabs eleven_turbo_v2_5OpenClawHub (primary TTS)High-quality agent speech
AVSpeechSynthesizerSpeakFlow, OpenClawHub fallbackSystem TTS
Gemini 2.0 Flash Live (WebSocket)OpenClawHub GeminiObserverAmbient audio+video intelligence
OpenAI TTS (6 voices)Speak CLI readerFile reading
macOS `say` + Edge TTSLearnNKoLanguage learning pronunciation
CoreML MohamedSpeakerIDOpenClawHubOwner voice identification
Custom MFCC (vDSP FFT)OpenClawHubSpeaker voiceprint matching

3. Current Voice Data Flow

iPhone mic → Apple STT → VoiceRouter (35 intents) → Clawdbot → Claude → ElevenLabs → speaker
iPhone mic → FleetVoiceRouter (fleet intents) → Aura gateway → mesh dispatch
Phone/glasses → /quick → Supabase mac_tasks → voice_task_daemon → TTY injection
Video content → ab-browser reel_ingest → transcription_intel_pipeline → agent tasks
SpeakFlow hotkey → STT → AX text injection OR "hey claw" → Clawdbot HTTP

4. Intent Classification Systems

VoiceRouter (iOS): 35 intents mapping spoken keywords to ThreadCategory. Covers all projects (Koji, Milkmen, Spore, Serenity, CreativeDirector, CompCore, CogTwin, NKo, etc.). Also handles explicit channel routing.

FleetVoiceRouter (iOS): Fleet-specific intents: `status`, `resume(target)`, `resumeAll`, `inject(target, command)`, `converge`, `diverge(prompt)`, `focus(target)`, `spawn(prompt)`, `kill(target)`, `fleetHealth`, `checkAbsorbing`, `unstickPane`, `checkFocus`, `checkTrajectory`. Has pronoun resolution (`lastTarget`) and destructive command confirmation (1.5s cancel window).

SpeakFlow VoiceCommandService: ~15 commands: "delete that", "new line", "select all", "switch to NKo", "hey claw/claude [command]".

5. Speaker Identification

OpenClawHub SpeakerIDService (418 lines):
- Custom MFCC pipeline: vDSP FFT → 26-filter Mel filterbank → 13 MFCC coefficients → cosine similarity
- Optional CoreML model: `MohamedSpeakerID`
- Enrolled voiceprint in UserDefaults
- Used as "speaker gate" in always-on mode — unknown speakers filtered

6. Critical Gaps

1. Terminal agents have NO voice I/O. Claude in a pane cannot speak or listen.
2. No mesh-level TTS. Agent task completion → text to Discord. No spoken announcements.
3. No Mac-side microphone listener. voice_task_daemon bridges phone→pane but Mac itself has no mic daemon. Cannot say "hey claw" at the desk without the phone.
4. SpeakFlow is not mesh-connected. Routes to Clawdbot but doesn't know pane states, task queues, or orchestrator.
5. Voice Router and Fleet Router are iOS-only. No Python equivalent for Mac/cloud.
6. Speaker ID is device-local. MFCC voiceprint in UserDefaults, not shared across mesh.
7. Gemini Live observer is siloed. Ambient audio+video context feeds OpenClawHub only, not Perception Mesh or Evolution World.
8. No voice memory. Voice conversations in DirectVoice use in-memory history, not written to Obsidian vault or knowledge graph.
9. No voice interruption. Cannot say "stop" to halt a running Claude pane.
10. Serenity "voice" is content style, not audio. voice_spec/voice_score/voice_gate governs writing tone, not speech.

7. Existing Integration Points (Live Wires)

IntegrationWhat It Does
VoiceRouter (iOS)Intent → ThreadCategory (35 intents)
FleetVoiceRouter (iOS)Fleet command + pronoun resolution
voice_task_daemon.pySupabase mac_tasks → TTY injection
Aura gateway :8095 /dispatchPhone → mesh command
Clawdbot gateway :18789Universal agent endpoint
NUMU WebSocket :7890Cross-service event bus
Mesh Event Bus :8600Cross-machine propagation
Pane injectorTerminal TTY injection
ElevenLabs APITTS (voice ID configured)
GeminiLive WebSocketReal-time audio+video
SpeakerIDServiceVoiceprint enrollment
Mac4/Mac5 exo clusterLocal compute for Whisper

Promotion Decision

Promote into a technical note or architecture paper with implementation anchors.

Source Anchor

evo-cube-output/voice-first-agent-architecture/stage0-research.md

Detected Structure

Method · Evaluation · Figures · Code Anchors · Architecture · is Stage Research