Stage 3: Expand + Master Plan — Voice-First Agent Architecture
**1. The unified router eliminates the triple-classifier problem.** Three intent classifiers with incompatible taxonomies is the root cause of inconsistent voice behavior across devices. One server-side router, shared by all clients, fixes this permanently. The ~55 merged intents cover all existing use cases.
Full Public Reader
Stage 3: Expand + Master Plan — Voice-First Agent Architecture
3a. AUDIT
What Holds Strong
1. The unified router eliminates the triple-classifier problem. Three intent classifiers with incompatible taxonomies is the root cause of inconsistent voice behavior across devices. One server-side router, shared by all clients, fixes this permanently. The ~55 merged intents cover all existing use cases.
2. Mac Ear Daemon is the single biggest UX improvement. Eliminating the phone dependency for voice interaction changes the relationship with the mesh. Walk to the desk, say "status", get a spoken briefing. No phone required. mlx-whisper on M2 handles transcription locally with no API cost.
3. Voice memory closes the biggest persistence gap. Every other interaction channel (text prompts, Discord, code, Obsidian) is persisted. Voice isn't. Storing transcripts in Supabase + RAG++ means "we discussed this earlier" works across modalities. This is a force multiplier for the entire knowledge system.
4. The ElevenLabs integration is already production-ready. Voice ID configured, API key active, streaming playback works in iOS. Extending this to Mac1 TTS is ~20 lines of Python (HTTP POST + audio playback).
What Breaks Under Pressure
1. Whisper on Mac1 competes for compute.
Mac1 is already running: 7 LaunchAgents, Xcode builds, SSH tunnels, the pane orchestrator, and terminal Claude sessions. Adding continuous audio capture + Whisper inference adds CPU/memory pressure.
Mitigation: VAD is ultra-lightweight (~1
2. NUMU event bus is single-threaded for voice.
Voice events (command, speak, transcript, hint) add to NUMU's event load. The bus was designed for periodic events (spawns, completions), not real-time audio-adjacent traffic.
Mitigation: Voice events are text-only (transcripts, not audio bytes). Each voice command generates ~3 events (command, dispatch, response). At 10 voice commands/hour, this adds ~30 events/hour — negligible vs. existing NUMU traffic. Audio bytes never touch NUMU — they stay local to the capturing daemon.
3. ElevenLabs has latency and cost.
ElevenLabs TTS: ~500ms first-byte, ~$0.30/1000 chars. For critical alerts this is fine. For routine status updates, 500ms latency is acceptable but the cost adds up.
Mitigation: Use system `say` for routine events (instant, free). Reserve ElevenLabs for: agent responses to voice questions, critical alerts, and "personality" moments. Estimated monthly cost: ~$5-10 (conservative usage).
4. Speaker ID doesn't transfer to Mac.
The MFCC voiceprint is in iOS UserDefaults. The Mac Ear Daemon has no speaker authentication. Anyone near the Mac mic can issue commands.
Mitigation (Phase 1): Mac Ear requires wake phrase only — physical proximity is the auth factor. Phase 2: Port SpeakerIDService to Python (MFCC + cosine similarity is ~50 lines). Enroll from Mac mic. Store voiceprint in Supabase for cross-device sync.
What's Missing
1. Voice latency budget.
End-to-end voice command latency: audio capture (~0ms) + VAD detection (~200ms silence) + Whisper transcription (~1-2s) + router classification (~50ms) + dispatch (~100ms) + action execution (variable) + TTS generation (~500ms ElevenLabs, ~0ms `say`) + audio playback (~0ms).
Total for system `say`: ~1.5-2.5s. Total for ElevenLabs: ~2-3s.
Solution: This is acceptable for command-response but not for conversation. For multi-turn voice dialog, use streaming: start TTS playback of the first sentence while generating the rest. The iOS app already does this with ElevenLabs streaming.
2. Error handling for voice commands.
What happens when a voice command is misclassified? "Spawn Spore" → classified as "kill Spore" would be catastrophic.
Solution: Destructive commands (kill, inject, diverge) always require spoken confirmation:
User: "Kill the Spore pane"
Mesh: "Confirm: kill Spore pane? Say 'yes' to proceed."
User: "Yes" → execute
User: [anything else or silence] → cancelFleetVoiceRouter already has `isDestructive` flags — port these to the unified router.
3. Offline fallback.
If the server-side unified router is down, voice commands fail.
Solution: Ship a minimal local fallback classifier in both the Mac Ear daemon and iOS apps. It handles: mute/unmute, status, help, and "route to Clawdbot" for everything else. Full classification resumes when the router comes back.
3b. EXPAND — Implementation Specs
Subsystem 1: Unified Voice Router Service
File: `[home-path]` (~250 lines)
from fastapi import FastAPI
from pydantic import BaseModel
import numpy as np
app = FastAPI()
class VoiceCommand(BaseModel):
transcript: str
session_id: str = ""
device: str = "mac"
speaker_embedding: list[float] | None = None
class Intent(BaseModel):
action: str # "fleet.status", "project.task", "system.mute", etc.
target: str | None
params: dict = {}
is_destructive: bool = False
spoken_response: str
confidence: float
# Intent patterns (merged from VoiceRouter + FleetVoiceRouter + SpeakFlow)
FLEET_PATTERNS = {
"fleet.status": ["status", "what's running", "fleet health", "how are things"],
"fleet.spawn": ["spawn", "start a pane", "open", "create"],
"fleet.kill": ["kill", "stop", "close", "terminate", "shut down"],
"fleet.inject": ["inject", "send to", "tell .* to"],
"fleet.converge": ["converge", "sync up", "bring together"],
"fleet.focus": ["focus on", "switch to", "go to"],
"fleet.unstick": ["unstick", "unblock", "fix absorbing"],
}
SYSTEM_PATTERNS = {
"system.mute": ["quiet", "mute", "shut up", "silence"],
"system.unmute": ["unmute", "speak", "resume voice"],
"system.repeat": ["repeat", "say that again", "what did you say"],
"system.summary": ["what did i miss", "summary", "catch me up"],
}
# Project keywords (from VoiceRouter's 35-intent map)
PROJECT_KEYWORDS = {
"spore": "Spore", "koji": "Koatji", "securiclaw": "SecuriClaw",
"creative director": "CreativeDirector", "serenity": "Serenity Soother",
"nko": "LearnNKo", "speakflow": "SpeakFlow", "nexus": "Nexus Portal",
# ... remaining projects
}
DESTRUCTIVE_INTENTS = {"fleet.kill", "fleet.inject", "fleet.diverge", "fleet.spawn"}
@app.post("/classify")
async def classify_voice(cmd: VoiceCommand) -> Intent:
transcript = cmd.transcript.lower().strip()
# 1. Fleet pattern matching
for intent, patterns in FLEET_PATTERNS.items():
for pattern in patterns:
if pattern in transcript:
target = extract_target(transcript, pattern)
return Intent(
action=intent, target=target,
is_destructive=intent in DESTRUCTIVE_INTENTS,
spoken_response=generate_response(intent, target),
confidence=0.9,
)
# 2. System commands
for intent, patterns in SYSTEM_PATTERNS.items():
for pattern in patterns:
if pattern in transcript:
return Intent(action=intent, spoken_response=generate_response(intent), confidence=0.95)
# 3. Project routing
for keyword, project in PROJECT_KEYWORDS.items():
if keyword in transcript:
return Intent(
action="project.task", target=project,
params={"raw_command": cmd.transcript},
spoken_response=f"Routing to {project}.",
confidence=0.8,
)
# 4. Fallback: route to Clawdbot as free-form question
return Intent(
action="clawdbot.query",
params={"raw_command": cmd.transcript},
spoken_response="Let me think about that.",
confidence=0.5,
)Subsystem 2: Mac Ear Daemon
File: `[home-path]` (~200 lines)
import sounddevice as sd
import numpy as np
import whisper
import requests
import subprocess
WAKE_PHRASES = ["hey claw", "hey claude"]
SAMPLE_RATE = 16000
SILENCE_THRESHOLD = 0.01
SILENCE_DURATION = 1.5 # seconds
ROUTER_URL = "http://localhost:8650/classify"
class MacEarDaemon:
def __init__(self):
self.model = whisper.load_model("base") # or mlx_whisper
self.buffer = []
self.is_speaking = False
self.silence_frames = 0
def audio_callback(self, indata, frames, time, status):
"""Called by sounddevice for each audio chunk."""
energy = np.sqrt(np.mean(indata**2))
if energy > SILENCE_THRESHOLD:
self.is_speaking = True
self.silence_frames = 0
self.buffer.append(indata.copy())
elif self.is_speaking:
self.silence_frames += 1
self.buffer.append(indata.copy())
if self.silence_frames > int(SILENCE_DURATION * SAMPLE_RATE / frames):
self.process_buffer()
def process_buffer(self):
"""Transcribe buffered audio and check for wake phrase."""
audio = np.concatenate(self.buffer).flatten()
self.buffer = []
self.is_speaking = False
self.silence_frames = 0
result = self.model.transcribe(audio, language="en")
transcript = result["text"].strip().lower()
for phrase in WAKE_PHRASES:
if phrase in transcript:
command = transcript.split(phrase, 1)[-1].strip()
if command:
self.dispatch_command(command)
return
def dispatch_command(self, command: str):
"""Send to unified voice router."""
resp = requests.post(ROUTER_URL, json={
"transcript": command,
"device": "mac",
"session_id": "mac-ear",
})
intent = resp.json()
# Speak response
subprocess.run(["say", intent["spoken_response"]])
# Dispatch action via NUMU
self.emit_numu("voice.command", {
"intent": intent["action"],
"target": intent.get("target"),
"transcript": command,
"machine": "mac1",
})Subsystem 3: Voice Memory Table
CREATE TABLE voice_transcripts (
id UUID DEFAULT gen_random_uuid() PRIMARY KEY,
session_id TEXT NOT NULL,
speaker TEXT DEFAULT 'mohamed',
role TEXT CHECK (role IN ('user', 'agent', 'system')),
content TEXT NOT NULL,
intent TEXT,
device TEXT,
embedding vector(768),
created_at TIMESTAMPTZ DEFAULT now(),
metadata JSONB DEFAULT '{}'
);
CREATE INDEX idx_vt_session ON voice_transcripts(session_id);
CREATE INDEX idx_vt_created ON voice_transcripts(created_at DESC);
CREATE INDEX idx_vt_embedding ON voice_transcripts USING ivfflat (embedding vector_cosine_ops) WITH (lists = 10);Subsystem 4: Mesh Voice Output
File: `[home-path]` (~150 lines)
Subscribes to NUMU events, speaks critical/normal events via TTS. Supports mute/unmute. Queues events during mute for summary on unmute.
3c. MASTER PLAN
### Phase 1: Foundation (Days 1-7)
| # | Task | Automatable |
|---|------|-------------|
| 1 | Build Unified Voice Router (FastAPI, ~55 intents merged) | Yes |
| 2 | Deploy router on Mac1 (:8650) as LaunchAgent | Yes |
| 3 | Build Mac Ear Daemon (sounddevice + Whisper + VAD) | Yes |
| 4 | Test wake phrase detection + command extraction | Yes |
| 5 | Wire Mac Ear → Router → NUMU dispatch | Yes |
| 6 | Build Mesh Voice Output daemon (NUMU subscriber + TTS) | Yes |
| 7 | Define voice NUMU event types (7 new event types) | Yes |
Gate: "Hey Claw, status" from Mac1 mic → spoken fleet status response.
### Phase 2: Memory + Persistence (Days 7-14)
| # | Task | Automatable |
|---|------|-------------|
| 8 | Create `voice_transcripts` Supabase table + indexes | Yes |
| 9 | Store all voice transcripts (command + response) in Supabase | Yes |
| 10 | Generate Gemini embeddings for voice transcript segments | Yes |
| 11 | Ingest voice transcripts into RAG++ (project: "voice-sessions") | Yes |
| 12 | Add voice context to session_start_hook smart gateway query | Yes |
| 13 | Build voice session summarizer (end-of-session → Obsidian note) | Yes |
| 14 | Add 6 Prometheus metrics for voice system | Yes |
Gate: Voice transcripts searchable in RAG++. Terminal sessions see "recent voice context."
### Phase 3: iOS Migration (Days 14-21)
| # | Task | Automatable |
|---|------|-------------|
| 15 | OpenClawHub: Replace VoiceRouter with HTTP POST to unified router | Yes |
| 16 | OpenClawHub: Replace FleetVoiceRouter with HTTP POST to unified router | Yes |
| 17 | Add offline fallback classifier to OpenClawHub | Yes |
| 18 | SpeakFlow: Replace VoiceCommandService with HTTP POST + fallback | Yes |
| 19 | Wire Gemini Live hints to NUMU (gemini.hint events) | Yes |
| 20 | Test cross-device voice continuity (phone → mac → phone) | Semi |
Gate: All devices use unified router. Intent classification is consistent everywhere.
### Phase 4: Polish + Safety (Days 21-28)
| # | Task | Automatable |
|---|------|-------------|
| 21 | Implement destructive command confirmation flow (spoken yes/no) | Yes |
| 22 | Port SpeakerIDService to Python for Mac Ear authentication | Yes |
| 23 | Build "what did I miss" summary command | Yes |
| 24 | Build mute/unmute/quiet voice commands | Yes |
| 25 | Add Nexus Portal `/voice` dashboard (session history, intent distribution) | Yes |
| 26 | Tune VAD sensitivity, silence thresholds, wake phrase accuracy | Semi |
| 27 | Load test: 50 voice commands in 10 minutes, measure latency p95 | Yes |
Gate: Destructive commands require confirmation. Speaker auth works. Latency p95 < 3s.
Summary
| Metric | Current | Target |
|---|---|---|
| Voice input devices | 2 (iPhone, Mac via SpeakFlow) | 3+ (Mac Ear, iPhone, SpeakFlow) |
| Intent classifiers | 3 (incompatible) | 1 (unified, server-side) |
| Voice memory persistence | 0 (ephemeral) | 100 |
| Mesh TTS output | 0 events spoken | Critical + normal events |
| Voice-to-fleet commands | iOS only | Any device |
| End-to-end latency (system TTS) | N/A | <2.5s p95 |
| End-to-end latency (ElevenLabs) | N/A | <3.5s p95 |
| Speaker authentication | iOS only | Mac + iOS |
| NUMU voice event types | 0 | 7 |
Total tasks: 27 | Automatable: 24 | Duration: 28 days | New code: ~780 lines
Promotion Decision
Promote into a technical note or architecture paper with implementation anchors.
Source Anchor
evo-cube-output/voice-first-agent-architecture/stage3-expand-master-plan.md
Detected Structure
Method · Evaluation · Figures · Code Anchors · Architecture · is Stage Research