Grand Diomande Research · Full HTML Reader

Stage 1 Path C: The Unified Voice Router — One Intent Classifier for All Devices

> Grounded in: Stage 0 finding that VoiceRouter (35 intents, iOS), FleetVoiceRouter (fleet intents, iOS), and SpeakFlow VoiceCommandService (15 commands, macOS) are three separate intent classifiers with incompatible taxonomies. Voice commands work differently depending on which device you're on.

Agents That Account for Themselves architecture technical paper candidate score 20 .md

Full Public Reader

Stage 1 Path C: The Unified Voice Router — One Intent Classifier for All Devices

> Grounded in: Stage 0 finding that VoiceRouter (35 intents, iOS), FleetVoiceRouter (fleet intents, iOS), and SpeakFlow VoiceCommandService (15 commands, macOS) are three separate intent classifiers with incompatible taxonomies. Voice commands work differently depending on which device you're on.

Core Thesis

Port intent classification to the server. One classifier, shared across all clients. The iPhone, Mac, glasses, and any future device all send audio/text to the same endpoint. The server classifies the intent, resolves the target, and dispatches. Device-specific logic is limited to audio capture and playback.

The Mechanism

1. Server-Side Voice Router:

python
# [home-path]
class UnifiedVoiceRouter:
    """Single intent classifier for all voice inputs."""

    INTENTS = {
        # Fleet control (from FleetVoiceRouter)
        "fleet.status": ["status", "what's running", "fleet health"],
        "fleet.spawn": ["spawn", "start", "open", "create pane"],
        "fleet.kill": ["kill", "stop", "close", "terminate"],
        "fleet.inject": ["inject", "send to", "tell"],
        "fleet.converge": ["converge", "sync up", "bring together"],
        "fleet.diverge": ["diverge", "branch out", "explore"],
        "fleet.focus": ["focus on", "switch to", "go to"],

        # Project routing (from VoiceRouter)
        "project.query": ["ask about", "what's happening with"],
        "project.task": ["work on", "build", "deploy", "fix"],

        # System commands (from SpeakFlow)
        "system.mute": ["quiet", "mute", "shut up"],
        "system.unmute": ["unmute", "speak", "resume voice"],
        "system.dictate": ["type this", "dictate"],

        # Meta
        "meta.help": ["help", "what can you do", "commands"],
        "meta.repeat": ["repeat", "say that again"],
    }

    def classify(self, transcript: str, context: VoiceContext) -> Intent:
        # 1. Keyword matching (fast, deterministic)
        # 2. If ambiguous: embedding similarity against intent descriptions
        # 3. Pronoun resolution from context.last_target
        # 4. Return Intent with target, action, params, is_destructive

2. Voice API Endpoint:

POST /voice/command
{
    "transcript": "spawn a pane for Spore work",
    "session_id": "abc123",
    "device": "iphone",
    "speaker_embedding": [0.12, -0.34, ...]  // optional speaker ID
}

Response:
{
    "intent": "fleet.spawn",
    "target": "Spore",
    "action": "spawn_pane",
    "spoken_response": "Spawning a pane for Spore.",
    "requires_confirmation": false
}

3. Client Updates:
- iOS OpenClawHub: Replace VoiceRouter.classify() + FleetVoiceRouter.classify() with POST to /voice/command
- SpeakFlow macOS: Replace VoiceCommandService local classification with POST
- Mac Ear Daemon (Path A): Use /voice/command directly
- Benefit: fix a classification bug once, all devices get the update

4. Context Tracking:

python
class VoiceContext:
    session_id: str
    last_target: str | None  # Pronoun resolution: "it" → last_target
    last_absorbing: list[str]  # "unstick it" → most recent absorbing pane
    active_conversation: bool  # Are we in multi-turn voice dialog?
    device: str  # "iphone" | "mac" | "glasses"
    speaker_verified: bool  # Speaker ID passed?

What This Solves

  • One intent classifier for all devices — consistent behavior everywhere
  • Pronoun resolution and context tracking work across device switches
  • New intents added once, available everywhere
  • Server-side classification can use more powerful models (embedding similarity, LLM fallback)
  • Device-agnostic: any device that can capture audio and play TTS can be a voice client

What This Risks

  • Network dependency: voice classification fails without internet (mitigate: local fallback on each device)
  • Latency: network round-trip adds ~50-200ms to classification (acceptable for voice UX)
  • Single point of failure (mitigate: local fallback + NUMU event for "voice router offline")
  • Migrating iOS code to server means maintaining two codebases during transition
  • Destructive command confirmation needs haptic feedback on device, not just server response

Promotion Decision

Promote into a technical note or architecture paper with implementation anchors.

Source Anchor

evo-cube-output/voice-first-agent-architecture/stage1-path-c.md

Detected Structure

Evaluation · Code Anchors · Architecture · is Stage Research