Stage 1 Path F: The Voice Mesh Protocol — Every Node Gets a Mouth and an Ear
> Grounded in: Stage 0 finding that 4 Mac machines exist in the mesh but only phones have voice I/O. The mesh nodes are deaf and mute.
Full Public Reader
Stage 1 Path F: The Voice Mesh Protocol — Every Node Gets a Mouth and an Ear
> Grounded in: Stage 0 finding that 4 Mac machines exist in the mesh but only phones have voice I/O. The mesh nodes are deaf and mute.
Core Thesis
Don't centralize voice. Distribute it. Every node in the mesh (Mac1, Mac4, Mac5, cloud-vm) gets a microphone daemon (ear) and a TTS daemon (mouth). Inter-node voice happens over the existing NUMU/Mesh Event Bus. The mesh doesn't need a voice server — it needs voice as a native capability of every node, like networking.
The Mechanism
1. Voice Node Daemon (per-machine):
# Runs on every Mac with audio hardware
class VoiceNode:
def __init__(self, machine_id: str):
self.machine_id = machine_id # "mac1", "mac4", "mac5"
self.whisper = load_local_whisper() # mlx-whisper or whisper.cpp
self.tts = LocalTTS() # macOS say + ElevenLabs for critical
self.numu = NumuClient() # NUMU WebSocket for events
async def listen(self):
"""Always-on mic with VAD + wake word."""
# Audio capture → VAD → wake detection → Whisper transcription
# On command: emit to NUMU
await self.numu.emit("voice.command", {
"machine": self.machine_id,
"transcript": transcript,
"speaker_id": speaker_embedding,
})
async def speak(self, event: NumuEvent):
"""Receive voice.speak events and play audio."""
if event.data["target"] == self.machine_id or event.data["target"] == "all":
await self.tts.speak(event.data["text"], priority=event.data.get("priority"))2. Cross-Node Voice Routing:
Mac1 mic → "Hey Claw, what's Mac4 doing?"
→ voice.command event on NUMU
→ voice_router (cloud-vm or Mac1) classifies intent: fleet.status(mac4)
→ query mac4 status
→ voice.speak event to Mac1: "Mac4 is running 3 exo models with 78% GPU utilization"
→ Mac1 TTS speaks the response3. Voice Relay Between Machines:
Mac1: "Tell Mac4 to start fine-tuning the N'Ko model"
→ voice.command → router → voice.relay event to Mac4
→ Mac4 receives relay, speaks it locally: "Mac1 requests: start N'Ko fine-tune"
→ Mac4 executes → voice.speak to Mac1: "Fine-tune started on Mac4"
→ Mac1 TTS: "Fine-tune started on Mac4"4. Spatial Voice (multi-room):
If machines are in different rooms/locations:
- Mac1 (office desk) = primary interaction point
- Mac4 (server rack area) = status announcements only
- Mac5 (beside Mac4) = paired with Mac4 for compute status
- Each machine's TTS volume and frequency tuned to its physical context
5. Voice Mesh Events (new NUMU event types):
voice.command — a voice command was detected on a node
voice.speak — request a node to speak text aloud
voice.relay — forward a voice message between nodes
voice.mute — suppress voice output on a node
voice.unmute — resume voice output
voice.status — report voice daemon health
voice.transcript — a new transcript segment (for persistence)What This Solves
- Every machine becomes voice-interactive, not just the phone
- Cross-machine voice relay enables "talk to Mac4 through Mac1"
- Distributed voice means no single point of failure
- Each node's voice daemon is independent — works offline for local commands
- Builds on existing NUMU event bus and mesh architecture
- Physical presence at any machine = voice access to entire mesh
What This Risks
- Mac4 and Mac5 may not have microphones (headless compute nodes)
- Audio hardware management across 4+ machines is complex
- Echo/feedback loops if two machines are near each other (Mac4+Mac5 side by side)
- Whisper model on every node = memory overhead per machine
- NUMU event bus latency for voice relay (~50-200ms) adds to response time
- No authentication for voice commands — anyone near a mic can issue commands (mitigate: speaker ID)
Promotion Decision
Promote into a technical note or architecture paper with implementation anchors.
Source Anchor
evo-cube-output/voice-first-agent-architecture/stage1-path-f.md
Detected Structure
Method · Evaluation · Architecture · is Stage Research