Stage 1 Path B: The Spoken Mesh — Agents That Talk Back
> Grounded in: Stage 0 finding that agents return results as text to Discord/terminal. No mesh-level TTS exists. The mesh is mute.
Full Public Reader
Stage 1 Path B: The Spoken Mesh — Agents That Talk Back
> Grounded in: Stage 0 finding that agents return results as text to Discord/terminal. No mesh-level TTS exists. The mesh is mute.
Core Thesis
A voice-first architecture isn't just about listening — it's about speaking. When a Prefect flow completes, when a pane finishes a pulse task, when Evolution World triggers a mutation, when a security alert fires — these events should be spoken aloud. Not all of them. The critical ones. The mesh should have a voice.
The Mechanism
1. Event-to-Speech Bridge:
class MeshVoice:
"""Subscribes to NUMU events and speaks important ones."""
VOICE_EVENTS = {
"pulse.complete": "Pulse {task} completed on {pane}.",
"pane.spawn": "New pane spawned for {project}.",
"pane.absorbing": "Warning: {pane} is absorbing. Unstick recommended.",
"security.alert": "Security alert: {message}.",
"build.success": "{app} build succeeded. Ready for upload.",
"build.failure": "{app} build failed: {error}.",
"ew.mutation": "Evolution World mutated {target}.",
"flow.error": "Prefect flow {name} failed: {error}.",
"creator_shield.escalation": "Creator Shield escalation: {severity}.",
}
async def on_event(self, event: NumuEvent):
template = self.VOICE_EVENTS.get(event.type)
if template:
text = template.format(**event.data)
await self.speak(text, priority=event.priority)
async def speak(self, text: str, priority: str = "normal"):
if priority == "critical":
# ElevenLabs for critical (higher quality, slight latency)
audio = await elevenlabs_tts(text)
play_audio(audio)
else:
# macOS say for routine (instant, no API call)
subprocess.run(["say", "-v", "Samantha", text])2. Priority and Suppression:
- Critical events (security, build failure, absorbing panes) always spoken
- Normal events spoken only when no active voice conversation
- Suppress during "focus mode" (user says "quiet" or "mute mesh")
- Rate limiting: max 1 spoken event per 30 seconds for non-critical
- Queue events during suppression, summarize when un-muted: "While you were focused, 3 builds completed and 1 flow failed"
3. Voice Personality:
- ElevenLabs voice ID already configured: `TmSgyk1vGAD9YzdtJV3V`
- Consistent voice across all mesh announcements
- Tone varies by severity: neutral for status, urgent cadence for alerts
- Optional: different voices for different subsystems (infra = deep voice, creative = lighter)
4. Spatial Audio (future):
- Mac speakers can do stereo positioning
- Left speaker = build/deploy events, right speaker = creative/content events
- Or: volume correlates with priority (whisper for info, full volume for critical)
What This Solves
- Ambient awareness without checking Discord/terminal
- Critical alerts reach you even when not looking at a screen
- Summarized status updates reduce context-switching
- The mesh "personality" becomes tangible — it has a voice
- Builds on existing ElevenLabs integration (voice ID, API key configured)
What This Risks
- Annoying if poorly tuned — constant spoken interruptions are worse than silence
- ElevenLabs API calls add cost (~$0.30/1000 chars, critical events only mitigates)
- System `say` voice quality is mediocre for non-English text
- Audio output conflicts with music, calls, video editing
- Privacy: spoken agent output audible to anyone nearby
Promotion Decision
Promote into a technical note or architecture paper with implementation anchors.
Source Anchor
evo-cube-output/voice-first-agent-architecture/stage1-path-b.md
Detected Structure
Method · Evaluation · Figures · Architecture · is Stage Research