Grand Diomande Research · Full HTML Reader

Stage 1 Path A: The Mac Ear — Always-On Desktop Microphone Daemon

The single highest-impact addition is a daemon on Mac1 that listens to the built-in microphone, detects a wake phrase, transcribes the command, and injects it into the mesh. This eliminates the phone dependency for voice interaction. The Mac is always on, always in front of you, always connected to the mesh.

Agents That Account for Themselves architecture technical paper candidate score 28 .md

Full Public Reader

Stage 1 Path A: The Mac Ear — Always-On Desktop Microphone Daemon

> Grounded in: Stage 0 finding that the Mac has no microphone listener. You must pick up the phone to talk to the mesh.

Core Thesis

The single highest-impact addition is a daemon on Mac1 that listens to the built-in microphone, detects a wake phrase, transcribes the command, and injects it into the mesh. This eliminates the phone dependency for voice interaction. The Mac is always on, always in front of you, always connected to the mesh.

The Mechanism

Architecture:

python
# [home-path] (LaunchAgent)
class MacEarDaemon:
    def __init__(self):
        self.wake_phrases = ["hey claw", "hey claude", "securiclaw"]
        self.whisper_model = load_whisper("base.en")  # or mlx-whisper on Mac4
        self.state = "listening"  # listening | processing | speaking

    async def listen_loop(self):
        """Continuous audio capture with VAD."""
        # sounddevice or pyaudio → 16kHz mono PCM
        # VAD: WebRTC VAD or simple energy threshold
        # On speech detected: buffer audio
        # On silence (1.5s): check for wake phrase in buffer
        # If wake phrase found: extract command portion
        # Send command to voice_task_daemon pipeline

    async def process_command(self, audio: bytes):
        """Whisper transcription → intent classification → dispatch."""
        transcript = self.whisper_model.transcribe(audio)
        intent = classify_intent(transcript)  # Port FleetVoiceRouter logic to Python
        await dispatch(intent)  # → NUMU event OR direct pane injection OR Clawdbot

    async def speak_response(self, text: str):
        """ElevenLabs or system say for response."""
        # elevenlabs.generate(text, voice="TmSgyk1vGAD9YzdtJV3V") → play
        # OR: subprocess.run(["say", text])

Wake word detection: Two options:
1. Whisper-based (accurate but heavier): Run Whisper on every VAD-triggered segment, check transcript for wake phrase
2. Porcupine/OpenWakeWord (lightweight): Dedicated wake word model runs continuously, Whisper runs only after wake detection

STT options on Mac1:
- `whisper.cpp` via CLI (~1s for 5s audio on M2)
- `mlx-whisper` (Apple Silicon optimized)
- Apple SFSpeechRecognizer via `speech_recognition` Python bridge
- Mac4/Mac5 remote Whisper (over exo cluster API)

What This Solves

  • Voice interaction from the desk without phone
  • Always-on (LaunchAgent, survives logout)
  • Routes into the existing voice_task_daemon → TTY injection pipeline
  • Can trigger any fleet command: status, spawn, kill, converge
  • Uses existing Whisper infrastructure on Mac4/Mac5

What This Risks

  • Mac1 microphone is always hot — privacy concern (mitigated by local-only processing)
  • Apple SFSpeechRecognizer has 1-minute timeout (mitigated by Whisper alternative)
  • Audio quality from built-in Mac mic in a multi-display desk setup
  • CPU usage for continuous audio monitoring (VAD is cheap, Whisper is expensive — only run Whisper on wake)
  • Conflict with other audio apps (mitigated by shared audio session)

Promotion Decision

Promote into a technical note or architecture paper with implementation anchors.

Source Anchor

evo-cube-output/voice-first-agent-architecture/stage1-path-a.md

Detected Structure

Method · Evaluation · Code Anchors · Architecture · is Stage Research