Grand Diomande Research · Full HTML Reader

Stage 0: Research — Voice-First Agent Architecture

The mesh currently has 8 distinct voice subsystems spread across iOS apps, macOS services, and backend flows. They are architecturally isolated — no subsystem talks to another. The terminal agents (Claude panes, Prefect flows, Discord bots) communicate exclusively through text. Voice exists at the edge (phone, glasses) but doesn't penetrate the mesh core.

Agents That Account for Themselves architecture technical paper candidate score 32 .md

Full Public Reader

Stage 0: Research — Voice-First Agent Architecture

System Under Study

Fact Inventory

1. Eight Voice Subsystems

#	System	Location	Voice Capability	State
1	OpenClawHub DirectVoice	iOS app	Full STT→intent→Clawdbot→TTS loop	WORKING
2	OpenClawHub Fleet Voice	iOS (Glasses Gateway)	Fleet control: status, resume, inject, kill, spawn	WORKING
3	OpenClawHub QuadView	iOS	Push-to-talk → pane delegation	UI complete
4	SpeakFlow	macOS + iOS keyboard	Global hotkey→STT→text injection + "hey claw" trigger	WORKING
5	SecuriClaw WakeWord	iOS	Continuous wake phrase detection	WORKING
6	Spore VoiceCapture	iOS	Voice→idea creation with keyword extraction	WORKING
7	Voice Task Daemon	Mac1 LaunchAgent	Supabase mac_tasks → TTY pane injection	ACTIVE
8	Transcription Intel	Prefect flow	Video transcripts → intelligence extraction	ACTIVE

2. Voice Models/APIs in Use

Model/API	Where	Purpose
Apple SFSpeechRecognizer (on-device)	All iOS/macOS apps	STT
ElevenLabs eleven_turbo_v2_5	OpenClawHub (primary TTS)	High-quality agent speech
AVSpeechSynthesizer	SpeakFlow, OpenClawHub fallback	System TTS
Gemini 2.0 Flash Live (WebSocket)	OpenClawHub GeminiObserver	Ambient audio+video intelligence
OpenAI TTS (6 voices)	Speak CLI reader	File reading
macOS `say` + Edge TTS	LearnNKo	Language learning pronunciation
CoreML MohamedSpeakerID	OpenClawHub	Owner voice identification
Custom MFCC (vDSP FFT)	OpenClawHub	Speaker voiceprint matching

3. Current Voice Data Flow

iPhone mic → Apple STT → VoiceRouter (35 intents) → Clawdbot → Claude → ElevenLabs → speaker
iPhone mic → FleetVoiceRouter (fleet intents) → Aura gateway → mesh dispatch
Phone/glasses → /quick → Supabase mac_tasks → voice_task_daemon → TTY injection
Video content → ab-browser reel_ingest → transcription_intel_pipeline → agent tasks
SpeakFlow hotkey → STT → AX text injection OR "hey claw" → Clawdbot HTTP

4. Intent Classification Systems

VoiceRouter (iOS): 35 intents mapping spoken keywords to ThreadCategory. Covers all projects (Koji, Milkmen, Spore, Serenity, CreativeDirector, CompCore, CogTwin, NKo, etc.). Also handles explicit channel routing.

FleetVoiceRouter (iOS): Fleet-specific intents: `status`, `resume(target)`, `resumeAll`, `inject(target, command)`, `converge`, `diverge(prompt)`, `focus(target)`, `spawn(prompt)`, `kill(target)`, `fleetHealth`, `checkAbsorbing`, `unstickPane`, `checkFocus`, `checkTrajectory`. Has pronoun resolution (`lastTarget`) and destructive command confirmation (1.5s cancel window).

SpeakFlow VoiceCommandService: ~15 commands: "delete that", "new line", "select all", "switch to NKo", "hey claw/claude [command]".

5. Speaker Identification

OpenClawHub SpeakerIDService (418 lines):
- Custom MFCC pipeline: vDSP FFT → 26-filter Mel filterbank → 13 MFCC coefficients → cosine similarity
- Optional CoreML model: `MohamedSpeakerID`
- Enrolled voiceprint in UserDefaults
- Used as "speaker gate" in always-on mode — unknown speakers filtered

6. Critical Gaps

1. Terminal agents have NO voice I/O. Claude in a pane cannot speak or listen.
2. No mesh-level TTS. Agent task completion → text to Discord. No spoken announcements.
3. No Mac-side microphone listener. voice_task_daemon bridges phone→pane but Mac itself has no mic daemon. Cannot say "hey claw" at the desk without the phone.
4. SpeakFlow is not mesh-connected. Routes to Clawdbot but doesn't know pane states, task queues, or orchestrator.
5. Voice Router and Fleet Router are iOS-only. No Python equivalent for Mac/cloud.
6. Speaker ID is device-local. MFCC voiceprint in UserDefaults, not shared across mesh.
7. Gemini Live observer is siloed. Ambient audio+video context feeds OpenClawHub only, not Perception Mesh or Evolution World.
8. No voice memory. Voice conversations in DirectVoice use in-memory history, not written to Obsidian vault or knowledge graph.
9. No voice interruption. Cannot say "stop" to halt a running Claude pane.
10. Serenity "voice" is content style, not audio. voice_spec/voice_score/voice_gate governs writing tone, not speech.

7. Existing Integration Points (Live Wires)

Integration	What It Does
VoiceRouter (iOS)	Intent → ThreadCategory (35 intents)
FleetVoiceRouter (iOS)	Fleet command + pronoun resolution
voice_task_daemon.py	Supabase mac_tasks → TTY injection
Aura gateway :8095 /dispatch	Phone → mesh command
Clawdbot gateway :18789	Universal agent endpoint
NUMU WebSocket :7890	Cross-service event bus
Mesh Event Bus :8600	Cross-machine propagation
Pane injector	Terminal TTY injection
ElevenLabs API	TTS (voice ID configured)
GeminiLive WebSocket	Real-time audio+video
SpeakerIDService	Voiceprint enrollment
Mac4/Mac5 exo cluster	Local compute for Whisper

Promotion Decision

Promote into a technical note or architecture paper with implementation anchors.

Source Anchor

evo-cube-output/voice-first-agent-architecture/stage0-research.md

Detected Structure

Method · Evaluation · Figures · Code Anchors · Architecture · is Stage Research