Grand Diomande Research · Full HTML Reader

Stage 3: Expand + Master Plan — Voice-First Agent Architecture

**1. The unified router eliminates the triple-classifier problem.** Three intent classifiers with incompatible taxonomies is the root cause of inconsistent voice behavior across devices. One server-side router, shared by all clients, fixes this permanently. The ~55 merged intents cover all existing use cases.

Agents That Account for Themselves architecture technical paper candidate score 40 .md

Full Public Reader

Stage 3: Expand + Master Plan — Voice-First Agent Architecture

3a. AUDIT

What Holds Strong

1. The unified router eliminates the triple-classifier problem. Three intent classifiers with incompatible taxonomies is the root cause of inconsistent voice behavior across devices. One server-side router, shared by all clients, fixes this permanently. The ~55 merged intents cover all existing use cases.

2. Mac Ear Daemon is the single biggest UX improvement. Eliminating the phone dependency for voice interaction changes the relationship with the mesh. Walk to the desk, say "status", get a spoken briefing. No phone required. mlx-whisper on M2 handles transcription locally with no API cost.

3. Voice memory closes the biggest persistence gap. Every other interaction channel (text prompts, Discord, code, Obsidian) is persisted. Voice isn't. Storing transcripts in Supabase + RAG++ means "we discussed this earlier" works across modalities. This is a force multiplier for the entire knowledge system.

4. The ElevenLabs integration is already production-ready. Voice ID configured, API key active, streaming playback works in iOS. Extending this to Mac1 TTS is ~20 lines of Python (HTTP POST + audio playback).

What Breaks Under Pressure

1. Whisper on Mac1 competes for compute.
Mac1 is already running: 7 LaunchAgents, Xcode builds, SSH tunnels, the pane orchestrator, and terminal Claude sessions. Adding continuous audio capture + Whisper inference adds CPU/memory pressure.

Mitigation: VAD is ultra-lightweight (~1

2. NUMU event bus is single-threaded for voice.
Voice events (command, speak, transcript, hint) add to NUMU's event load. The bus was designed for periodic events (spawns, completions), not real-time audio-adjacent traffic.

Mitigation: Voice events are text-only (transcripts, not audio bytes). Each voice command generates ~3 events (command, dispatch, response). At 10 voice commands/hour, this adds ~30 events/hour — negligible vs. existing NUMU traffic. Audio bytes never touch NUMU — they stay local to the capturing daemon.

3. ElevenLabs has latency and cost.
ElevenLabs TTS: ~500ms first-byte, ~$0.30/1000 chars. For critical alerts this is fine. For routine status updates, 500ms latency is acceptable but the cost adds up.

Mitigation: Use system `say` for routine events (instant, free). Reserve ElevenLabs for: agent responses to voice questions, critical alerts, and "personality" moments. Estimated monthly cost: ~$5-10 (conservative usage).

4. Speaker ID doesn't transfer to Mac.
The MFCC voiceprint is in iOS UserDefaults. The Mac Ear Daemon has no speaker authentication. Anyone near the Mac mic can issue commands.

Mitigation (Phase 1): Mac Ear requires wake phrase only — physical proximity is the auth factor. Phase 2: Port SpeakerIDService to Python (MFCC + cosine similarity is ~50 lines). Enroll from Mac mic. Store voiceprint in Supabase for cross-device sync.

What's Missing

1. Voice latency budget.
End-to-end voice command latency: audio capture (~0ms) + VAD detection (~200ms silence) + Whisper transcription (~1-2s) + router classification (~50ms) + dispatch (~100ms) + action execution (variable) + TTS generation (~500ms ElevenLabs, ~0ms `say`) + audio playback (~0ms).
Total for system `say`: ~1.5-2.5s. Total for ElevenLabs: ~2-3s.

Solution: This is acceptable for command-response but not for conversation. For multi-turn voice dialog, use streaming: start TTS playback of the first sentence while generating the rest. The iOS app already does this with ElevenLabs streaming.

2. Error handling for voice commands.
What happens when a voice command is misclassified? "Spawn Spore" → classified as "kill Spore" would be catastrophic.

Solution: Destructive commands (kill, inject, diverge) always require spoken confirmation:

User: "Kill the Spore pane"
Mesh: "Confirm: kill Spore pane? Say 'yes' to proceed."
User: "Yes" → execute
User: [anything else or silence] → cancel

FleetVoiceRouter already has `isDestructive` flags — port these to the unified router.

3. Offline fallback.
If the server-side unified router is down, voice commands fail.

Solution: Ship a minimal local fallback classifier in both the Mac Ear daemon and iOS apps. It handles: mute/unmute, status, help, and "route to Clawdbot" for everything else. Full classification resumes when the router comes back.

3b. EXPAND — Implementation Specs

Subsystem 1: Unified Voice Router Service

File: `[home-path]` (~250 lines)

python
from fastapi import FastAPI
from pydantic import BaseModel
import numpy as np

app = FastAPI()

class VoiceCommand(BaseModel):
    transcript: str
    session_id: str = ""
    device: str = "mac"
    speaker_embedding: list[float] | None = None

class Intent(BaseModel):
    action: str  # "fleet.status", "project.task", "system.mute", etc.
    target: str | None
    params: dict = {}
    is_destructive: bool = False
    spoken_response: str
    confidence: float

# Intent patterns (merged from VoiceRouter + FleetVoiceRouter + SpeakFlow)
FLEET_PATTERNS = {
    "fleet.status": ["status", "what's running", "fleet health", "how are things"],
    "fleet.spawn": ["spawn", "start a pane", "open", "create"],
    "fleet.kill": ["kill", "stop", "close", "terminate", "shut down"],
    "fleet.inject": ["inject", "send to", "tell .* to"],
    "fleet.converge": ["converge", "sync up", "bring together"],
    "fleet.focus": ["focus on", "switch to", "go to"],
    "fleet.unstick": ["unstick", "unblock", "fix absorbing"],
}

SYSTEM_PATTERNS = {
    "system.mute": ["quiet", "mute", "shut up", "silence"],
    "system.unmute": ["unmute", "speak", "resume voice"],
    "system.repeat": ["repeat", "say that again", "what did you say"],
    "system.summary": ["what did i miss", "summary", "catch me up"],
}

# Project keywords (from VoiceRouter's 35-intent map)
PROJECT_KEYWORDS = {
    "spore": "Spore", "koji": "Koatji", "securiclaw": "SecuriClaw",
    "creative director": "CreativeDirector", "serenity": "Serenity Soother",
    "nko": "LearnNKo", "speakflow": "SpeakFlow", "nexus": "Nexus Portal",
    # ... remaining projects
}

DESTRUCTIVE_INTENTS = {"fleet.kill", "fleet.inject", "fleet.diverge", "fleet.spawn"}

@app.post("/classify")
async def classify_voice(cmd: VoiceCommand) -> Intent:
    transcript = cmd.transcript.lower().strip()

    # 1. Fleet pattern matching
    for intent, patterns in FLEET_PATTERNS.items():
        for pattern in patterns:
            if pattern in transcript:
                target = extract_target(transcript, pattern)
                return Intent(
                    action=intent, target=target,
                    is_destructive=intent in DESTRUCTIVE_INTENTS,
                    spoken_response=generate_response(intent, target),
                    confidence=0.9,
                )

    # 2. System commands
    for intent, patterns in SYSTEM_PATTERNS.items():
        for pattern in patterns:
            if pattern in transcript:
                return Intent(action=intent, spoken_response=generate_response(intent), confidence=0.95)

    # 3. Project routing
    for keyword, project in PROJECT_KEYWORDS.items():
        if keyword in transcript:
            return Intent(
                action="project.task", target=project,
                params={"raw_command": cmd.transcript},
                spoken_response=f"Routing to {project}.",
                confidence=0.8,
            )

    # 4. Fallback: route to Clawdbot as free-form question
    return Intent(
        action="clawdbot.query",
        params={"raw_command": cmd.transcript},
        spoken_response="Let me think about that.",
        confidence=0.5,
    )

Subsystem 2: Mac Ear Daemon

File: `[home-path]` (~200 lines)

python
import sounddevice as sd
import numpy as np
import whisper
import requests
import subprocess

WAKE_PHRASES = ["hey claw", "hey claude"]
SAMPLE_RATE = 16000
SILENCE_THRESHOLD = 0.01
SILENCE_DURATION = 1.5  # seconds
ROUTER_URL = "http://localhost:8650/classify"

class MacEarDaemon:
    def __init__(self):
        self.model = whisper.load_model("base")  # or mlx_whisper
        self.buffer = []
        self.is_speaking = False
        self.silence_frames = 0

    def audio_callback(self, indata, frames, time, status):
        """Called by sounddevice for each audio chunk."""
        energy = np.sqrt(np.mean(indata**2))
        if energy > SILENCE_THRESHOLD:
            self.is_speaking = True
            self.silence_frames = 0
            self.buffer.append(indata.copy())
        elif self.is_speaking:
            self.silence_frames += 1
            self.buffer.append(indata.copy())
            if self.silence_frames > int(SILENCE_DURATION * SAMPLE_RATE / frames):
                self.process_buffer()

    def process_buffer(self):
        """Transcribe buffered audio and check for wake phrase."""
        audio = np.concatenate(self.buffer).flatten()
        self.buffer = []
        self.is_speaking = False
        self.silence_frames = 0

        result = self.model.transcribe(audio, language="en")
        transcript = result["text"].strip().lower()

        for phrase in WAKE_PHRASES:
            if phrase in transcript:
                command = transcript.split(phrase, 1)[-1].strip()
                if command:
                    self.dispatch_command(command)
                return

    def dispatch_command(self, command: str):
        """Send to unified voice router."""
        resp = requests.post(ROUTER_URL, json={
            "transcript": command,
            "device": "mac",
            "session_id": "mac-ear",
        })
        intent = resp.json()

        # Speak response
        subprocess.run(["say", intent["spoken_response"]])

        # Dispatch action via NUMU
        self.emit_numu("voice.command", {
            "intent": intent["action"],
            "target": intent.get("target"),
            "transcript": command,
            "machine": "mac1",
        })

Subsystem 3: Voice Memory Table

sql
CREATE TABLE voice_transcripts (
    id UUID DEFAULT gen_random_uuid() PRIMARY KEY,
    session_id TEXT NOT NULL,
    speaker TEXT DEFAULT 'mohamed',
    role TEXT CHECK (role IN ('user', 'agent', 'system')),
    content TEXT NOT NULL,
    intent TEXT,
    device TEXT,
    embedding vector(768),
    created_at TIMESTAMPTZ DEFAULT now(),
    metadata JSONB DEFAULT '{}'
);

CREATE INDEX idx_vt_session ON voice_transcripts(session_id);
CREATE INDEX idx_vt_created ON voice_transcripts(created_at DESC);
CREATE INDEX idx_vt_embedding ON voice_transcripts USING ivfflat (embedding vector_cosine_ops) WITH (lists = 10);

Subsystem 4: Mesh Voice Output

File: `[home-path]` (~150 lines)

Subscribes to NUMU events, speaks critical/normal events via TTS. Supports mute/unmute. Queues events during mute for summary on unmute.

3c. MASTER PLAN

### Phase 1: Foundation (Days 1-7)
| # | Task | Automatable |
|---|------|-------------|
| 1 | Build Unified Voice Router (FastAPI, ~55 intents merged) | Yes |
| 2 | Deploy router on Mac1 (:8650) as LaunchAgent | Yes |
| 3 | Build Mac Ear Daemon (sounddevice + Whisper + VAD) | Yes |
| 4 | Test wake phrase detection + command extraction | Yes |
| 5 | Wire Mac Ear → Router → NUMU dispatch | Yes |
| 6 | Build Mesh Voice Output daemon (NUMU subscriber + TTS) | Yes |
| 7 | Define voice NUMU event types (7 new event types) | Yes |

Gate: "Hey Claw, status" from Mac1 mic → spoken fleet status response.

### Phase 2: Memory + Persistence (Days 7-14)
| # | Task | Automatable |
|---|------|-------------|
| 8 | Create `voice_transcripts` Supabase table + indexes | Yes |
| 9 | Store all voice transcripts (command + response) in Supabase | Yes |
| 10 | Generate Gemini embeddings for voice transcript segments | Yes |
| 11 | Ingest voice transcripts into RAG++ (project: "voice-sessions") | Yes |
| 12 | Add voice context to session_start_hook smart gateway query | Yes |
| 13 | Build voice session summarizer (end-of-session → Obsidian note) | Yes |
| 14 | Add 6 Prometheus metrics for voice system | Yes |

Gate: Voice transcripts searchable in RAG++. Terminal sessions see "recent voice context."

### Phase 3: iOS Migration (Days 14-21)
| # | Task | Automatable |
|---|------|-------------|
| 15 | OpenClawHub: Replace VoiceRouter with HTTP POST to unified router | Yes |
| 16 | OpenClawHub: Replace FleetVoiceRouter with HTTP POST to unified router | Yes |
| 17 | Add offline fallback classifier to OpenClawHub | Yes |
| 18 | SpeakFlow: Replace VoiceCommandService with HTTP POST + fallback | Yes |
| 19 | Wire Gemini Live hints to NUMU (gemini.hint events) | Yes |
| 20 | Test cross-device voice continuity (phone → mac → phone) | Semi |

Gate: All devices use unified router. Intent classification is consistent everywhere.

### Phase 4: Polish + Safety (Days 21-28)
| # | Task | Automatable |
|---|------|-------------|
| 21 | Implement destructive command confirmation flow (spoken yes/no) | Yes |
| 22 | Port SpeakerIDService to Python for Mac Ear authentication | Yes |
| 23 | Build "what did I miss" summary command | Yes |
| 24 | Build mute/unmute/quiet voice commands | Yes |
| 25 | Add Nexus Portal `/voice` dashboard (session history, intent distribution) | Yes |
| 26 | Tune VAD sensitivity, silence thresholds, wake phrase accuracy | Semi |
| 27 | Load test: 50 voice commands in 10 minutes, measure latency p95 | Yes |

Gate: Destructive commands require confirmation. Speaker auth works. Latency p95 < 3s.

Summary

MetricCurrentTarget
Voice input devices2 (iPhone, Mac via SpeakFlow)3+ (Mac Ear, iPhone, SpeakFlow)
Intent classifiers3 (incompatible)1 (unified, server-side)
Voice memory persistence0 (ephemeral)100
Mesh TTS output0 events spokenCritical + normal events
Voice-to-fleet commandsiOS onlyAny device
End-to-end latency (system TTS)N/A<2.5s p95
End-to-end latency (ElevenLabs)N/A<3.5s p95
Speaker authenticationiOS onlyMac + iOS
NUMU voice event types07

Total tasks: 27 | Automatable: 24 | Duration: 28 days | New code: ~780 lines

Promotion Decision

Promote into a technical note or architecture paper with implementation anchors.

Source Anchor

evo-cube-output/voice-first-agent-architecture/stage3-expand-master-plan.md

Detected Structure

Method · Evaluation · Figures · Code Anchors · Architecture · is Stage Research