Stage 2: Compound Architecture — Voice-First Agent Architecture
**Decision:** Path A's Mac Ear is the entry point — you need to be able to talk to the mesh from the desk. But Path C's unified router is needed immediately because we don't want a third intent classifier (Mac) alongside iOS VoiceRouter and FleetVoiceRouter. Build the server-side router first, then point the Mac Ear at it.
Full Public Reader
Stage 2: Compound Architecture — Voice-First Agent Architecture
> Sequential synthesis: each step inherits all prior context.
Step 1: Mac Ear Daemon + Unified Router (inherits Stage 0 + Paths A + C)
Decision: Path A's Mac Ear is the entry point — you need to be able to talk to the mesh from the desk. But Path C's unified router is needed immediately because we don't want a third intent classifier (Mac) alongside iOS VoiceRouter and FleetVoiceRouter. Build the server-side router first, then point the Mac Ear at it.
Implementation:
Server-side Unified Voice Router (Python, on Mac1 or cloud-vm):
# [home-path]
class UnifiedVoiceRouter:
# Merge all 3 intent taxonomies:
# - VoiceRouter: 35 project intents
# - FleetVoiceRouter: fleet control intents
# - SpeakFlow: system commands
# Total: ~55 unique intents
def classify(self, transcript: str, context: VoiceContext) -> Intent:
# 1. Exact keyword match (from FleetVoiceRouter patterns)
# 2. Fuzzy keyword match (from VoiceRouter patterns)
# 3. Embedding similarity fallback (Gemini embedding vs intent descriptions)
# 4. Pronoun resolution from context
# 5. Destructive command flaggingMac Ear Daemon (Python, LaunchAgent on Mac1):
# [home-path]
# - sounddevice for audio capture (16kHz mono)
# - WebRTC VAD for speech detection
# - mlx-whisper for transcription (local, no API call)
# - POST to voice-router /classify
# - Dispatch result to NUMU event busClient migration path:
- Phase 1: Mac Ear uses unified router directly
- Phase 2: OpenClawHub switches from local VoiceRouter/FleetVoiceRouter to HTTP POST
- Phase 3: SpeakFlow switches from local VoiceCommandService to HTTP POST
- Phase 4: Decommission client-side classifiers (keep as offline fallback)
Step 2: Mesh Voice Output (inherits Step 1 + Path B)
Decision: Path B's spoken mesh is essential for bidirectional voice. After Step 1 gives the mesh an ear, it needs a mouth. But Path B's approach of subscribing to ALL events is too noisy. Use a priority filter.
Implementation:
Voice Output Service (Python, LaunchAgent on Mac1):
# [home-path]
class MeshVoice:
def __init__(self):
self.numu = NumuClient("ws://localhost:7890")
self.tts = DualTTS(
critical=ElevenLabsTTS(voice_id="TmSgyk1vGAD9YzdtJV3V"),
routine=SystemTTS(voice="Samantha"),
)
self.muted = False
self.focus_mode = False
SPOKEN_EVENTS = {
# Critical — always spoken (ElevenLabs)
"security.alert": ("critical", "Security alert: {message}"),
"build.failure": ("critical", "{app} build failed."),
"pane.absorbing": ("critical", "Pane {pane} is absorbing. Intervention needed."),
# Normal — spoken unless muted (system TTS)
"pulse.complete": ("normal", "Pulse complete on {project}."),
"build.success": ("normal", "{app} build succeeded."),
"flow.complete": ("normal", "Flow {name} completed."),
# Low — only spoken when explicitly asked for status
"pane.spawn": ("low", "Pane spawned for {project}."),
"ew.mutation": ("low", "Evolution mutation on {target}."),
}Mute/unmute via voice:
- "Quiet" / "mute" → suppresses normal+low events
- "Unmute" / "speak" → resumes
- Critical events always spoken regardless of mute state
- "What did I miss?" → summarize suppressed events
Step 3: Voice Memory (inherits Steps 1-2 + Path D)
Decision: Path D's voice transcript persistence is critical for the mesh's memory model. Voice is currently the only interaction channel that isn't persisted. This creates a gap in the knowledge graph.
Implementation:
Transcript persistence (automatic, every voice session):
1. Mac Ear daemon captures transcript segments
2. Each segment stored in Supabase `voice_transcripts` table
3. Gemini embedding generated for each segment
4. Segment ingested into RAG++ (project: "voice-sessions")
5. Graph Kernel entity extraction runs on accumulated transcripts (hourly batch)
Cross-modal continuity:
- When a terminal Claude session starts, the session_start_hook queries voice_transcripts from the last 24h
- Relevant voice context appears in the smart gateway injection
- "We discussed this earlier" works across voice and text
Voice session summaries:
- At end of voice session (5 min silence or explicit "end session"):
- Claude Haiku summarizes the session
- Summary posted to Obsidian vault (Daily/voice-session-{timestamp}.md)
- Action items extracted and appended to active-tasks.md
Step 4: Ambient Intelligence Integration (inherits Steps 1-3 + Path E)
Decision: Path E's Gemini Live integration is the most forward-looking component. Wire it in, but with strong privacy controls and a conservative default.
Implementation:
Gemini Live → NUMU bridge (in OpenClawHub):
- GeminiObserverFeature already produces context hints
- New: emit hints to NUMU event bus as `gemini.hint` events
- Mac Ear daemon receives hints and uses them to enrich voice responses
Conservative defaults:
- Gemini Live observer is opt-in (default OFF)
- Only activates when user says "observe" or "ambient mode"
- Auto-deactivates after 30 minutes (battery/privacy)
- Hints are used for context enrichment only — never spoken aloud unprompted
- No audio/video stored — only text hints persisted
Integration with voice router:
When a voice command is ambiguous, Gemini context disambiguates:
Voice: "Fix that" (ambiguous — fix what?)
Gemini hint: "User is looking at Xcode with SecuriClaw open, build error visible"
Router: intent = project.fix(target="SecuriClaw", context="build error")Step 5: Multi-Node Voice (inherits Steps 1-4 + Path F, deferred)
Decision: Path F's distributed voice mesh is architecturally correct but premature. Mac4 and Mac5 are headless compute nodes without microphones. Cloud-vm is remote. Only Mac1 needs voice hardware today. But the NUMU event types from Path F should be defined now for future expansion.
Deferred implementation, but define the protocol:
# NUMU voice event types (reserve now, implement when needed)
VOICE_EVENTS = [
"voice.command", # Voice command detected
"voice.speak", # Request TTS on a node
"voice.relay", # Forward voice between nodes
"voice.mute", # Suppress voice on a node
"voice.unmute", # Resume voice on a node
"voice.transcript", # New transcript segment
"voice.status", # Voice daemon health
"gemini.hint", # Ambient intelligence hint
]Current scope: Mac1 only. Voice events emitted to NUMU with `machine: "mac1"`. When Mac4/Mac5 get audio hardware (USB mic), adding them is just deploying the daemon and changing the machine ID.
Final Compound Architecture
┌─────────────────────────────────────────────────────┐
│ VOICE LAYER │
│ │
│ ┌──────────┐ ┌──────────┐ ┌────────────────┐ │
│ │ Mac Ear │ │ iPhone │ │ SpeakFlow Mac │ │
│ │ (daemon) │ │ (OCH) │ │ (hotkey) │ │
│ │ Whisper │ │ Apple STT│ │ Apple STT │ │
│ └────┬─────┘ └────┬─────┘ └───────┬────────┘ │
│ │ │ │ │
│ └──────────────┼────────────────┘ │
│ │ │
│ ┌───────▼────────┐ │
│ │ Unified Voice │ HTTP POST /classify │
│ │ Router (55 │ │
│ │ intents) │ │
│ └───────┬────────┘ │
│ │ │
│ ┌────────────┼────────────┐ │
│ │ │ │ │
│ ┌──────▼──┐ ┌──────▼──┐ ┌─────▼──────┐ │
│ │ Fleet │ │ Project │ │ System │ │
│ │ Control │ │ Tasks │ │ Commands │ │
│ └────┬────┘ └────┬────┘ └─────┬──────┘ │
│ │ │ │ │
│ └────────────┼──────────────┘ │
│ │ │
│ ┌───────▼────────┐ │
│ │ NUMU Event Bus │ voice.command events │
│ └───────┬────────┘ │
│ │ │
│ ┌──────────────┼──────────────┐ │
│ │ │ │ │
│ ┌───▼────┐ ┌──────▼──┐ ┌───────▼──────┐ │
│ │ Pane │ │ Voice │ │ Voice │ │
│ │Injector│ │ Memory │ │ Output (TTS) │ │
│ │(TTY) │ │(Supa+ │ │ ElevenLabs + │ │
│ │ │ │ RAG++) │ │ system say │ │
│ └────────┘ └─────────┘ └──────────────┘ │
│ │
│ ┌──────────────────────────────────────────┐ │
│ │ Ambient Intelligence (opt-in) │ │
│ │ Gemini Live → gemini.hint → context │ │
│ └──────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────┘Code changes estimate:
- Unified Voice Router: ~250 lines (Python, server-side)
- Mac Ear Daemon: ~200 lines (Python, LaunchAgent)
- Mesh Voice Output: ~150 lines (Python, NUMU subscriber + TTS)
- Voice Memory: ~100 lines (Supabase table + RAG++ ingestion)
- Gemini Live NUMU bridge: ~50 lines (Swift, iOS)
- NUMU voice event types: ~30 lines (protocol definition)
- Total: ~780 lines of new code
Promotion Decision
Promote into a technical note or architecture paper with implementation anchors.
Source Anchor
evo-cube-output/voice-first-agent-architecture/stage2-compound.md
Detected Structure
Method · Evaluation · Code Anchors · Architecture · is Stage Research