Grand Diomande Research · Full HTML Reader

Stage 2: Compound Architecture — Voice-First Agent Architecture

**Decision:** Path A's Mac Ear is the entry point — you need to be able to talk to the mesh from the desk. But Path C's unified router is needed immediately because we don't want a third intent classifier (Mac) alongside iOS VoiceRouter and FleetVoiceRouter. Build the server-side router first, then point the Mac Ear at it.

Agents That Account for Themselves architecture technical paper candidate score 28 .md

Full Public Reader

Stage 2: Compound Architecture — Voice-First Agent Architecture

> Sequential synthesis: each step inherits all prior context.

Step 1: Mac Ear Daemon + Unified Router (inherits Stage 0 + Paths A + C)

Decision: Path A's Mac Ear is the entry point — you need to be able to talk to the mesh from the desk. But Path C's unified router is needed immediately because we don't want a third intent classifier (Mac) alongside iOS VoiceRouter and FleetVoiceRouter. Build the server-side router first, then point the Mac Ear at it.

Implementation:

Server-side Unified Voice Router (Python, on Mac1 or cloud-vm):

python
# [home-path]
class UnifiedVoiceRouter:
    # Merge all 3 intent taxonomies:
    # - VoiceRouter: 35 project intents
    # - FleetVoiceRouter: fleet control intents
    # - SpeakFlow: system commands
    # Total: ~55 unique intents

    def classify(self, transcript: str, context: VoiceContext) -> Intent:
        # 1. Exact keyword match (from FleetVoiceRouter patterns)
        # 2. Fuzzy keyword match (from VoiceRouter patterns)
        # 3. Embedding similarity fallback (Gemini embedding vs intent descriptions)
        # 4. Pronoun resolution from context
        # 5. Destructive command flagging

Mac Ear Daemon (Python, LaunchAgent on Mac1):

python
# [home-path]
# - sounddevice for audio capture (16kHz mono)
# - WebRTC VAD for speech detection
# - mlx-whisper for transcription (local, no API call)
# - POST to voice-router /classify
# - Dispatch result to NUMU event bus

Client migration path:
- Phase 1: Mac Ear uses unified router directly
- Phase 2: OpenClawHub switches from local VoiceRouter/FleetVoiceRouter to HTTP POST
- Phase 3: SpeakFlow switches from local VoiceCommandService to HTTP POST
- Phase 4: Decommission client-side classifiers (keep as offline fallback)

Step 2: Mesh Voice Output (inherits Step 1 + Path B)

Decision: Path B's spoken mesh is essential for bidirectional voice. After Step 1 gives the mesh an ear, it needs a mouth. But Path B's approach of subscribing to ALL events is too noisy. Use a priority filter.

Implementation:

Voice Output Service (Python, LaunchAgent on Mac1):

python
# [home-path]
class MeshVoice:
    def __init__(self):
        self.numu = NumuClient("ws://localhost:7890")
        self.tts = DualTTS(
            critical=ElevenLabsTTS(voice_id="TmSgyk1vGAD9YzdtJV3V"),
            routine=SystemTTS(voice="Samantha"),
        )
        self.muted = False
        self.focus_mode = False

    SPOKEN_EVENTS = {
        # Critical — always spoken (ElevenLabs)
        "security.alert": ("critical", "Security alert: {message}"),
        "build.failure": ("critical", "{app} build failed."),
        "pane.absorbing": ("critical", "Pane {pane} is absorbing. Intervention needed."),

        # Normal — spoken unless muted (system TTS)
        "pulse.complete": ("normal", "Pulse complete on {project}."),
        "build.success": ("normal", "{app} build succeeded."),
        "flow.complete": ("normal", "Flow {name} completed."),

        # Low — only spoken when explicitly asked for status
        "pane.spawn": ("low", "Pane spawned for {project}."),
        "ew.mutation": ("low", "Evolution mutation on {target}."),
    }

Mute/unmute via voice:
- "Quiet" / "mute" → suppresses normal+low events
- "Unmute" / "speak" → resumes
- Critical events always spoken regardless of mute state
- "What did I miss?" → summarize suppressed events

Step 3: Voice Memory (inherits Steps 1-2 + Path D)

Decision: Path D's voice transcript persistence is critical for the mesh's memory model. Voice is currently the only interaction channel that isn't persisted. This creates a gap in the knowledge graph.

Implementation:

Transcript persistence (automatic, every voice session):
1. Mac Ear daemon captures transcript segments
2. Each segment stored in Supabase `voice_transcripts` table
3. Gemini embedding generated for each segment
4. Segment ingested into RAG++ (project: "voice-sessions")
5. Graph Kernel entity extraction runs on accumulated transcripts (hourly batch)

Cross-modal continuity:
- When a terminal Claude session starts, the session_start_hook queries voice_transcripts from the last 24h
- Relevant voice context appears in the smart gateway injection
- "We discussed this earlier" works across voice and text

Voice session summaries:
- At end of voice session (5 min silence or explicit "end session"):
- Claude Haiku summarizes the session
- Summary posted to Obsidian vault (Daily/voice-session-{timestamp}.md)
- Action items extracted and appended to active-tasks.md

Step 4: Ambient Intelligence Integration (inherits Steps 1-3 + Path E)

Decision: Path E's Gemini Live integration is the most forward-looking component. Wire it in, but with strong privacy controls and a conservative default.

Implementation:

Gemini Live → NUMU bridge (in OpenClawHub):
- GeminiObserverFeature already produces context hints
- New: emit hints to NUMU event bus as `gemini.hint` events
- Mac Ear daemon receives hints and uses them to enrich voice responses

Conservative defaults:
- Gemini Live observer is opt-in (default OFF)
- Only activates when user says "observe" or "ambient mode"
- Auto-deactivates after 30 minutes (battery/privacy)
- Hints are used for context enrichment only — never spoken aloud unprompted
- No audio/video stored — only text hints persisted

Integration with voice router:
When a voice command is ambiguous, Gemini context disambiguates:

Voice: "Fix that" (ambiguous — fix what?)
Gemini hint: "User is looking at Xcode with SecuriClaw open, build error visible"
Router: intent = project.fix(target="SecuriClaw", context="build error")

Step 5: Multi-Node Voice (inherits Steps 1-4 + Path F, deferred)

Decision: Path F's distributed voice mesh is architecturally correct but premature. Mac4 and Mac5 are headless compute nodes without microphones. Cloud-vm is remote. Only Mac1 needs voice hardware today. But the NUMU event types from Path F should be defined now for future expansion.

Deferred implementation, but define the protocol:

python
# NUMU voice event types (reserve now, implement when needed)
VOICE_EVENTS = [
    "voice.command",      # Voice command detected
    "voice.speak",        # Request TTS on a node
    "voice.relay",        # Forward voice between nodes
    "voice.mute",         # Suppress voice on a node
    "voice.unmute",       # Resume voice on a node
    "voice.transcript",   # New transcript segment
    "voice.status",       # Voice daemon health
    "gemini.hint",        # Ambient intelligence hint
]

Current scope: Mac1 only. Voice events emitted to NUMU with `machine: "mac1"`. When Mac4/Mac5 get audio hardware (USB mic), adding them is just deploying the daemon and changing the machine ID.

Final Compound Architecture

┌─────────────────────────────────────────────────────┐
│                    VOICE LAYER                       │
│                                                      │
│  ┌──────────┐  ┌──────────┐  ┌────────────────┐    │
│  │ Mac Ear  │  │ iPhone   │  │ SpeakFlow Mac  │    │
│  │ (daemon) │  │ (OCH)    │  │ (hotkey)       │    │
│  │ Whisper  │  │ Apple STT│  │ Apple STT      │    │
│  └────┬─────┘  └────┬─────┘  └───────┬────────┘    │
│       │              │                │              │
│       └──────────────┼────────────────┘              │
│                      │                               │
│              ┌───────▼────────┐                      │
│              │ Unified Voice  │  HTTP POST /classify  │
│              │ Router (55     │                      │
│              │ intents)       │                      │
│              └───────┬────────┘                      │
│                      │                               │
│         ┌────────────┼────────────┐                  │
│         │            │            │                  │
│  ┌──────▼──┐  ┌──────▼──┐  ┌─────▼──────┐          │
│  │ Fleet   │  │ Project │  │ System     │          │
│  │ Control │  │ Tasks   │  │ Commands   │          │
│  └────┬────┘  └────┬────┘  └─────┬──────┘          │
│       │            │              │                  │
│       └────────────┼──────────────┘                  │
│                    │                                 │
│            ┌───────▼────────┐                        │
│            │ NUMU Event Bus │  voice.command events   │
│            └───────┬────────┘                        │
│                    │                                 │
│     ┌──────────────┼──────────────┐                  │
│     │              │              │                  │
│ ┌───▼────┐  ┌──────▼──┐  ┌───────▼──────┐          │
│ │ Pane   │  │ Voice   │  │ Voice        │          │
│ │Injector│  │ Memory  │  │ Output (TTS) │          │
│ │(TTY)   │  │(Supa+   │  │ ElevenLabs + │          │
│ │        │  │ RAG++)  │  │ system say   │          │
│ └────────┘  └─────────┘  └──────────────┘          │
│                                                      │
│  ┌──────────────────────────────────────────┐       │
│  │ Ambient Intelligence (opt-in)            │       │
│  │ Gemini Live → gemini.hint → context      │       │
│  └──────────────────────────────────────────┘       │
└─────────────────────────────────────────────────────┘

Code changes estimate:
- Unified Voice Router: ~250 lines (Python, server-side)
- Mac Ear Daemon: ~200 lines (Python, LaunchAgent)
- Mesh Voice Output: ~150 lines (Python, NUMU subscriber + TTS)
- Voice Memory: ~100 lines (Supabase table + RAG++ ingestion)
- Gemini Live NUMU bridge: ~50 lines (Swift, iOS)
- NUMU voice event types: ~30 lines (protocol definition)
- Total: ~780 lines of new code

Promotion Decision

Promote into a technical note or architecture paper with implementation anchors.

Source Anchor

evo-cube-output/voice-first-agent-architecture/stage2-compound.md

Detected Structure

Method · Evaluation · Code Anchors · Architecture · is Stage Research