Grand Diomande Research · Full HTML Reader

Gemini Live Voice Control - High Accuracy DJ Commands

This uses **Google's Gemini Live API** for superior speech recognition accuracy compared to standard speech recognition libraries.

Agents That Account for Themselves research note experiment writeup candidate score 28 .md

Full Public Reader

Gemini Live Voice Control - High Accuracy DJ Commands

This uses Google's Gemini Live API for superior speech recognition accuracy compared to standard speech recognition libraries.

🎯 Why Gemini Live API?

Much Better Accuracy: Uses Google's advanced AI models
Real-time Streaming: Low-latency voice recognition
Context Understanding: Understands natural speech patterns
Voice Activity Detection: Built-in VAD to filter out background noise

📋 Requirements

1. Gemini API Key: Get one from [https://ai.google.dev/](https://ai.google.dev/)
2. Python Dependencies: Install with `pip install -r requirements.txt`

🚀 Quick Start

1. Get Your API Key

Visit [https://ai.google.dev/](https://ai.google.dev/) and create an API key.

2. Create .env File

Create a `.env` file in the project root:

bash

# Copy the example file
cp .env.example .env

# Or create it manually
echo "GEMINI_API_KEY=your-api-key-here" > .env

Then edit `.env` and add your actual API key:

bash

GEMINI_API_KEY=your-actual-api-key-here

3. Run Voice Control

bash

./START_VOICE_CONTROL_GEMINI.sh

Or directly:

bash

python3 dj_agent/run_voice_control_gemini.py

Note: The script automatically loads the API key from `.env` file!

📝 Available Commands

List all commands:

bash

python3 dj_agent/run_voice_control_gemini.py --commands

Example Commands:

Left Deck:
- "play left" / "pause left" / "stop left"
- "cue 1 left" / "cue 2 left" / ... / "cue 8 left"
- "censor left" / "filter left" / "echo left"
- "tempo up left" / "faster left"

Right Deck:
- "play right" / "pause right" / "stop right"
- "cue 1 right" / "cue 2 right" / ... / "cue 4 right"
- "censor right" / "filter right" / "echo right"
- "tempo up right" / "faster right"

Quick Actions:
- "drop it" / "kill it" / "bring it back"
- "go" / "stop" / "next" / "back"
- "zoom in" / "zoom out"
- "record" / "search"

⚙️ Features

### High Accuracy Recognition
- Uses Gemini 2.5 Flash Native Audio model
- Understands natural speech patterns
- Handles variations in pronunciation

### Fuzzy Matching
- If exact match not found, tries fuzzy matching (60
- Example: "play lef" → "play left"

### Command Cooldown
- Prevents duplicate commands from firing rapidly
- 1.5 second cooldown between same command
- Shows countdown: "⏸️ Cooldown (0.8s remaining)"

### Real-time Streaming
- Continuous audio streaming to Gemini Live API
- Low latency response
- Automatic voice activity detection

🔧 Configuration

API Key Options (Priority Order)

1. `.env` File (Recommended):

bash

   # Create .env in project root
   GEMINI_API_KEY=your-api-key-here

2. Environment Variable:

bash

   export GEMINI_API_KEY='your-key'

3. Command Line:

bash

   python3 dj_agent/run_voice_control_gemini.py --api-key 'your-key'

The script checks in this order: command line → environment variable → `.env` file

Audio Settings

Default settings (optimized for Gemini Live API):
- Sample Rate: 16kHz (required by Gemini)
- Channels: Mono
- Format: 16-bit PCM
- Chunk Size: 1024 samples

These are automatically configured - no changes needed!

🐛 Troubleshooting

### ".env file not found" or "Gemini API key required"
- Create a `.env` file in the project root with: `GEMINI_API_KEY=your-key`
- Or set the environment variable: `export GEMINI_API_KEY='your-key'`
- Or pass via command line: `--api-key 'your-key'`

### "Failed to connect to Gemini Live API"
- Check your API key is valid
- Check internet connection
- Verify API key has Live API access enabled

### "Failed to open audio stream"
- macOS: Grant microphone permission to Terminal/Python
- System Preferences → Security & Privacy → Microphone
- Linux: Install `portaudio`: `sudo apt-get install portaudio19-dev`
- Windows: Install PyAudio: `pip install pyaudio`

### "No command matched"
- Speak clearly and naturally
- Use commands from the `--commands` list
- Try variations: "play left" instead of "play the left deck"

### Low Recognition Accuracy
- Speak in a quiet environment
- Position microphone closer
- Speak at normal volume (not too quiet/loud)
- Gemini handles natural speech well - don't over-enunciate

💰 Pricing

Gemini Live API has usage-based pricing. Check current rates at:
[https://ai.google.dev/pricing](https://ai.google.dev/pricing)

Note: Real-time voice recognition uses API credits. Monitor your usage!

🔄 Comparison: Standard vs Gemini

Feature	Standard (`run_voice_control.py`)	Gemini (`run_voice_control_gemini.py`)
Accuracy	Good	Excellent
Cost	Free	Pay-per-use
Setup	Simple	Requires API key
Latency	~1-2s	~500ms-1s
Context	Basic	Advanced
Background Noise	Sensitive	Better filtering

📚 References

[Gemini Live API Docs](https://ai.google.dev/gemini-api/docs/live)
[Get API Key](https://ai.google.dev/)
[Pricing Information](https://ai.google.dev/pricing)

🎤 Tips for Best Results

1. Speak Naturally: Gemini understands context - don't over-enunciate
2. Use Short Phrases: "play left" is better than "please play the left deck"
3. Wait for Cooldown: Same command needs 1.5s gap
4. Quiet Environment: Reduces false positives
5. Clear Microphone: Position mic properly for best pickup

---

Enjoy high-accuracy voice control! 🎧

Promotion Decision

Attach run IDs, datasets, metrics, and reproduction commands.

Source Anchor

projects/Documentation/02-projects/dj-agent/studio/GEMINI_VOICE_CONTROL_README.md

Detected Structure

Evaluation · References · Figures · Code Anchors