Grand Diomande Research · Full HTML Reader

Stage 1 Path F: The Voice Mesh Protocol — Every Node Gets a Mouth and an Ear

> Grounded in: Stage 0 finding that 4 Mac machines exist in the mesh but only phones have voice I/O. The mesh nodes are deaf and mute.

Agents That Account for Themselves architecture technical paper candidate score 22 .md

Full Public Reader

Stage 1 Path F: The Voice Mesh Protocol — Every Node Gets a Mouth and an Ear

> Grounded in: Stage 0 finding that 4 Mac machines exist in the mesh but only phones have voice I/O. The mesh nodes are deaf and mute.

Core Thesis

Don't centralize voice. Distribute it. Every node in the mesh (Mac1, Mac4, Mac5, cloud-vm) gets a microphone daemon (ear) and a TTS daemon (mouth). Inter-node voice happens over the existing NUMU/Mesh Event Bus. The mesh doesn't need a voice server — it needs voice as a native capability of every node, like networking.

The Mechanism

1. Voice Node Daemon (per-machine):

python
# Runs on every Mac with audio hardware
class VoiceNode:
    def __init__(self, machine_id: str):
        self.machine_id = machine_id  # "mac1", "mac4", "mac5"
        self.whisper = load_local_whisper()  # mlx-whisper or whisper.cpp
        self.tts = LocalTTS()  # macOS say + ElevenLabs for critical
        self.numu = NumuClient()  # NUMU WebSocket for events

    async def listen(self):
        """Always-on mic with VAD + wake word."""
        # Audio capture → VAD → wake detection → Whisper transcription
        # On command: emit to NUMU
        await self.numu.emit("voice.command", {
            "machine": self.machine_id,
            "transcript": transcript,
            "speaker_id": speaker_embedding,
        })

    async def speak(self, event: NumuEvent):
        """Receive voice.speak events and play audio."""
        if event.data["target"] == self.machine_id or event.data["target"] == "all":
            await self.tts.speak(event.data["text"], priority=event.data.get("priority"))

2. Cross-Node Voice Routing:

Mac1 mic → "Hey Claw, what's Mac4 doing?"
    → voice.command event on NUMU
    → voice_router (cloud-vm or Mac1) classifies intent: fleet.status(mac4)
    → query mac4 status
    → voice.speak event to Mac1: "Mac4 is running 3 exo models with 78% GPU utilization"
    → Mac1 TTS speaks the response

3. Voice Relay Between Machines:

Mac1: "Tell Mac4 to start fine-tuning the N'Ko model"
    → voice.command → router → voice.relay event to Mac4
    → Mac4 receives relay, speaks it locally: "Mac1 requests: start N'Ko fine-tune"
    → Mac4 executes → voice.speak to Mac1: "Fine-tune started on Mac4"
    → Mac1 TTS: "Fine-tune started on Mac4"

4. Spatial Voice (multi-room):
If machines are in different rooms/locations:
- Mac1 (office desk) = primary interaction point
- Mac4 (server rack area) = status announcements only
- Mac5 (beside Mac4) = paired with Mac4 for compute status
- Each machine's TTS volume and frequency tuned to its physical context

5. Voice Mesh Events (new NUMU event types):

voice.command    — a voice command was detected on a node
voice.speak      — request a node to speak text aloud
voice.relay      — forward a voice message between nodes
voice.mute       — suppress voice output on a node
voice.unmute     — resume voice output
voice.status     — report voice daemon health
voice.transcript — a new transcript segment (for persistence)

What This Solves

  • Every machine becomes voice-interactive, not just the phone
  • Cross-machine voice relay enables "talk to Mac4 through Mac1"
  • Distributed voice means no single point of failure
  • Each node's voice daemon is independent — works offline for local commands
  • Builds on existing NUMU event bus and mesh architecture
  • Physical presence at any machine = voice access to entire mesh

What This Risks

  • Mac4 and Mac5 may not have microphones (headless compute nodes)
  • Audio hardware management across 4+ machines is complex
  • Echo/feedback loops if two machines are near each other (Mac4+Mac5 side by side)
  • Whisper model on every node = memory overhead per machine
  • NUMU event bus latency for voice relay (~50-200ms) adds to response time
  • No authentication for voice commands — anyone near a mic can issue commands (mitigate: speaker ID)

Promotion Decision

Promote into a technical note or architecture paper with implementation anchors.

Source Anchor

evo-cube-output/voice-first-agent-architecture/stage1-path-f.md

Detected Structure

Method · Evaluation · Architecture · is Stage Research