Stage 0: Research — Voice-First Agent Architecture
The mesh currently has 8 distinct voice subsystems spread across iOS apps, macOS services, and backend flows. They are architecturally isolated — no subsystem talks to another. The terminal agents (Claude panes, Prefect flows, Discord bots) communicate exclusively through text. Voice exists at the edge (phone, glasses) but doesn't penetrate the mesh core.
Full Public Reader
Stage 0: Research — Voice-First Agent Architecture
System Under Study
The mesh currently has 8 distinct voice subsystems spread across iOS apps, macOS services, and backend flows. They are architecturally isolated — no subsystem talks to another. The terminal agents (Claude panes, Prefect flows, Discord bots) communicate exclusively through text. Voice exists at the edge (phone, glasses) but doesn't penetrate the mesh core.
Fact Inventory
1. Eight Voice Subsystems
| # | System | Location | Voice Capability | State |
|---|---|---|---|---|
| 1 | OpenClawHub DirectVoice | iOS app | Full STT→intent→Clawdbot→TTS loop | WORKING |
| 2 | OpenClawHub Fleet Voice | iOS (Glasses Gateway) | Fleet control: status, resume, inject, kill, spawn | WORKING |
| 3 | OpenClawHub QuadView | iOS | Push-to-talk → pane delegation | UI complete |
| 4 | SpeakFlow | macOS + iOS keyboard | Global hotkey→STT→text injection + "hey claw" trigger | WORKING |
| 5 | SecuriClaw WakeWord | iOS | Continuous wake phrase detection | WORKING |
| 6 | Spore VoiceCapture | iOS | Voice→idea creation with keyword extraction | WORKING |
| 7 | Voice Task Daemon | Mac1 LaunchAgent | Supabase mac_tasks → TTY pane injection | ACTIVE |
| 8 | Transcription Intel | Prefect flow | Video transcripts → intelligence extraction | ACTIVE |
2. Voice Models/APIs in Use
| Model/API | Where | Purpose |
|---|---|---|
| Apple SFSpeechRecognizer (on-device) | All iOS/macOS apps | STT |
| ElevenLabs eleven_turbo_v2_5 | OpenClawHub (primary TTS) | High-quality agent speech |
| AVSpeechSynthesizer | SpeakFlow, OpenClawHub fallback | System TTS |
| Gemini 2.0 Flash Live (WebSocket) | OpenClawHub GeminiObserver | Ambient audio+video intelligence |
| OpenAI TTS (6 voices) | Speak CLI reader | File reading |
| macOS `say` + Edge TTS | LearnNKo | Language learning pronunciation |
| CoreML MohamedSpeakerID | OpenClawHub | Owner voice identification |
| Custom MFCC (vDSP FFT) | OpenClawHub | Speaker voiceprint matching |
3. Current Voice Data Flow
iPhone mic → Apple STT → VoiceRouter (35 intents) → Clawdbot → Claude → ElevenLabs → speaker
iPhone mic → FleetVoiceRouter (fleet intents) → Aura gateway → mesh dispatch
Phone/glasses → /quick → Supabase mac_tasks → voice_task_daemon → TTY injection
Video content → ab-browser reel_ingest → transcription_intel_pipeline → agent tasks
SpeakFlow hotkey → STT → AX text injection OR "hey claw" → Clawdbot HTTP4. Intent Classification Systems
VoiceRouter (iOS): 35 intents mapping spoken keywords to ThreadCategory. Covers all projects (Koji, Milkmen, Spore, Serenity, CreativeDirector, CompCore, CogTwin, NKo, etc.). Also handles explicit channel routing.
FleetVoiceRouter (iOS): Fleet-specific intents: `status`, `resume(target)`, `resumeAll`, `inject(target, command)`, `converge`, `diverge(prompt)`, `focus(target)`, `spawn(prompt)`, `kill(target)`, `fleetHealth`, `checkAbsorbing`, `unstickPane`, `checkFocus`, `checkTrajectory`. Has pronoun resolution (`lastTarget`) and destructive command confirmation (1.5s cancel window).
SpeakFlow VoiceCommandService: ~15 commands: "delete that", "new line", "select all", "switch to NKo", "hey claw/claude [command]".
5. Speaker Identification
OpenClawHub SpeakerIDService (418 lines):
- Custom MFCC pipeline: vDSP FFT → 26-filter Mel filterbank → 13 MFCC coefficients → cosine similarity
- Optional CoreML model: `MohamedSpeakerID`
- Enrolled voiceprint in UserDefaults
- Used as "speaker gate" in always-on mode — unknown speakers filtered
6. Critical Gaps
1. Terminal agents have NO voice I/O. Claude in a pane cannot speak or listen.
2. No mesh-level TTS. Agent task completion → text to Discord. No spoken announcements.
3. No Mac-side microphone listener. voice_task_daemon bridges phone→pane but Mac itself has no mic daemon. Cannot say "hey claw" at the desk without the phone.
4. SpeakFlow is not mesh-connected. Routes to Clawdbot but doesn't know pane states, task queues, or orchestrator.
5. Voice Router and Fleet Router are iOS-only. No Python equivalent for Mac/cloud.
6. Speaker ID is device-local. MFCC voiceprint in UserDefaults, not shared across mesh.
7. Gemini Live observer is siloed. Ambient audio+video context feeds OpenClawHub only, not Perception Mesh or Evolution World.
8. No voice memory. Voice conversations in DirectVoice use in-memory history, not written to Obsidian vault or knowledge graph.
9. No voice interruption. Cannot say "stop" to halt a running Claude pane.
10. Serenity "voice" is content style, not audio. voice_spec/voice_score/voice_gate governs writing tone, not speech.
7. Existing Integration Points (Live Wires)
| Integration | What It Does |
|---|---|
| VoiceRouter (iOS) | Intent → ThreadCategory (35 intents) |
| FleetVoiceRouter (iOS) | Fleet command + pronoun resolution |
| voice_task_daemon.py | Supabase mac_tasks → TTY injection |
| Aura gateway :8095 /dispatch | Phone → mesh command |
| Clawdbot gateway :18789 | Universal agent endpoint |
| NUMU WebSocket :7890 | Cross-service event bus |
| Mesh Event Bus :8600 | Cross-machine propagation |
| Pane injector | Terminal TTY injection |
| ElevenLabs API | TTS (voice ID configured) |
| GeminiLive WebSocket | Real-time audio+video |
| SpeakerIDService | Voiceprint enrollment |
| Mac4/Mac5 exo cluster | Local compute for Whisper |
Promotion Decision
Promote into a technical note or architecture paper with implementation anchors.
Source Anchor
evo-cube-output/voice-first-agent-architecture/stage0-research.md
Detected Structure
Method · Evaluation · Figures · Code Anchors · Architecture · is Stage Research