Rekordbox Voice Control – Evaluation Plan
- **Path A – Gemini Live + Embedding Gemma** - Mic → Gemini Live (cloud ASR) → text - Text → Embedding Gemma → Rekordbox orbiter → keyboard bridge
Full Public Reader
Rekordbox Voice Control – Evaluation Plan
Systematic testing and comparison of Gemini Live vs Wav2Vec2 for voice-controlled DJing with Rekordbox.
Overview
This evaluation compares two ASR (Automatic Speech Recognition) paths:
- Path A – Gemini Live + Embedding Gemma
- Mic → Gemini Live (cloud ASR) → text
- Text → Embedding Gemma → Rekordbox orbiter → keyboard bridge
- Path B – Wav2Vec2 + Embedding Gemma
- Mic → Wav2Vec2 (local ASR) → text
- Text → Embedding Gemma → Rekordbox orbiter → keyboard bridge
Both paths share:
- Same command catalog: `Mapping/commands.yaml`
- Same retrieval logic (hard overrides + Embedding Gemma + Rekordbox index)
- Same constraints and safety layer
- Same Rekordbox keyboard bridge
The only difference: ASR front-end (cloud vs local)
---
1. Goals
### 1.1. ASR Quality
- How often does each path convert spoken commands into the expected text?
- Performance in quiet vs noisy environments
### 1.2. Command Mapping Quality
- Given ASR text, how often does the system choose the correct Rekordbox command?
- Accuracy for critical commands (play, loop, hot cues, sync)
### 1.3. Latency
For each path:
- ASR latency: audio → text
- Mapping latency: text → command decision
- End-to-end latency: speech onset → Rekordbox action
### 1.4. Behavioral Sanity
- "Play left/right" reliably hits Deck 1/2 Play/Pause
- "Loop left/right" reliably triggers 4-beat loops for correct deck
- Hot cue and other key actions behave as intended
---
2. Prerequisites
2.1. Dependencies
From `computational-studio/studio`, in your venv:
pip install -r requirements.txtRequired packages:
- `torch`, `torchaudio`, `transformers` (Wav2Vec2)
- `huggingface_hub` (Embedding Gemma)
- `google-genai` (Gemini Live)
- `pyaudio`, `pynput`, `mido`, `python-rtmidi`
2.2. Environment Variables
Create `.env` file:
GEMINI_API_KEY=your_gemini_api_key_here
HF_TOKEN=your_huggingface_token_here- GEMINI_API_KEY: For Gemini Live (Path A)
- HF_TOKEN: For Embedding Gemma via HF Inference (both paths)
2.3. Rekordbox Setup
1. Install Rekordbox (version 6 or 7)
2. Export your library (optional):
- FILE → EXPORT COLLECTION IN XML FORMAT
3. Configure keyboard shortcuts:
- Preferences → Keyboard → verify default shortcuts
4. Enable Performance mode:
- Load tracks to both decks
- Keep Rekordbox as active window during tests
---
3. Build Test Dataset
3.1. Record Command Audio Clips
Create folder structure:
studio/tests/rekordbox_commands/
├── play_left_01.wav
├── play_left_02.wav
├── play_right_01.wav
├── loop_left_01.wav
├── loop_right_01.wav
├── cue_a_left_01.wav
├── cue_a_right_01.wav
├── beat_sync_left_01.wav
├── beat_sync_right_01.wav
└── start_recording_01.wavRecording guidelines:
- Mono audio (preferred)
- Any sample rate (eval script will resample)
- Clear speech in realistic DJ booth conditions
- Include ambient noise if testing robustness
- 1-3 clips per command for statistical validity
3.2. Create Manifest File
Create `studio/tests/rekordbox_manifest.jsonl`:
{"path": "rekordbox_commands/play_left_01.wav", "expected_text": "play left", "expected_command_id": "PLAY_A"}
{"path": "rekordbox_commands/play_right_01.wav", "expected_text": "play right", "expected_command_id": "PLAY_B"}
{"path": "rekordbox_commands/loop_left_01.wav", "expected_text": "loop left", "expected_command_id": "LOOP_4_A"}
{"path": "rekordbox_commands/loop_right_01.wav", "expected_text": "loop right", "expected_command_id": "LOOP_4_B"}
{"path": "rekordbox_commands/cue_a_left_01.wav", "expected_text": "set hot cue a left deck", "expected_command_id": "SET_CUE_1_A"}
{"path": "rekordbox_commands/cue_a_right_01.wav", "expected_text": "set hot cue a right deck", "expected_command_id": "SET_CUE_1_B"}
{"path": "rekordbox_commands/sync_left_01.wav", "expected_text": "beat sync left", "expected_command_id": "SYNC_A"}
{"path": "rekordbox_commands/sync_right_01.wav", "expected_text": "beat sync right", "expected_command_id": "SYNC_B"}Notes:
- `path`: relative to manifest directory
- `expected_text`: what ASR should produce (lowercase)
- `expected_command_id`: matches action names in `configs/rekordbox.yaml`
---
4. Evaluate Wav2Vec2 Path
4.1. Integration with Rekordbox Bridge
The Wav2Vec2 evaluation uses:
1. Wav2Vec2 ASR → transcribe audio to text
2. Embedding Gemma → embed text for semantic search
3. Rekordbox Orbiter → map to command ID
4. Rekordbox Bridge → execute keyboard shortcut
4.2. Run Evaluation
cd computational-studio/studio
python dj_agent/scripts/eval_rekordbox_voice_wav2vec.py \
tests/rekordbox_manifest.jsonl \
--config configs/rekordbox.yamlWhat it measures:
1. ASR Accuracy:
- Exact match: `asr_text == expected_text`
- Word Error Rate (WER)
- Character Error Rate (CER)
2. Command Accuracy:
- Correct command ID predicted: `predicted_id == expected_id`
- Top-3 accuracy
- Per-category accuracy (transport, loops, cues)
3. Latencies:
- ASR latency (ms)
- Mapping latency (ms)
- End-to-end latency (ms)
4. Error Analysis:
- Confusion matrix for common errors
- Failed examples logged for inspection
Output:
=== Wav2Vec2 Evaluation Results ===
ASR Metrics:
Exact Match Accuracy: 85.5%
Word Error Rate: 12.3%
Character Error Rate: 5.8%
Command Mapping Metrics:
Top-1 Accuracy: 92.0%
Top-3 Accuracy: 98.5%
Per-category:
Transport: 95.0%
Loops: 88.0%
Hot Cues: 90.0%
Latency (ms):
ASR (mean): 245ms, p95: 380ms
Mapping (mean): 45ms, p95: 78ms
End-to-end (mean): 290ms, p95: 458ms
Failed Examples:
1. play_left_02.wav: "play love" → INCORRECT
2. loop_right_01.wav: "blue right" → INCORRECT---
5. Evaluate Gemini Path
5.1. Create Gemini Eval Script
Create `dj_agent/scripts/eval_rekordbox_voice_gemini.py`:
#!/usr/bin/env python3
"""
Evaluate Gemini Live ASR + Rekordbox mapping.
Mirrors the Wav2Vec2 evaluation for fair comparison.
"""
import json
import time
from pathlib import Path
import torchaudio
from google import genai
# Your existing modules
from dj_agent.voice_control.core.rekordbox_library import RekordboxLibraryReader
from dj_agent.core.rekordbox_bridge import RekordboxBridge
# TODO: Adapt GeminiVoiceListener to accept audio files instead of mic input
class GeminiFileASR:
"""Gemini Live ASR that processes audio files."""
def __init__(self, [sensitive field redacted]):
self.client = genai.Client([sensitive field redacted])
def transcribe(self, audio_path: Path) -> tuple[str, float]:
"""
Transcribe audio file using Gemini Live.
Returns:
(transcribed_text, latency_ms)
"""
start = time.time()
# Load audio
waveform, sr = torchaudio.load(audio_path)
# TODO: Stream audio to Gemini Live session
# For now, use a simpler Gemini audio API if available
# Or adapt the streaming logic from run_voice_control_gemini.py
# Placeholder:
# response = self.client.models.generate_content(...)
latency = (time.time() - start) * 1000
return transcribed_text, latency
def main():
# Similar structure to eval_rekordbox_voice_wav2vec.py
# Load manifest, run Gemini ASR, measure accuracy and latency
pass
if __name__ == '__main__':
main()5.2. Run Gemini Evaluation
python dj_agent/scripts/eval_rekordbox_voice_gemini.py \
tests/rekordbox_manifest.jsonl \
--config configs/rekordbox.yamlExpected output format: Same as Wav2Vec2 eval for direct comparison
---
6. Side-by-Side Comparison
6.1. Metrics Table
| Metric | Gemini Live | Wav2Vec2 | Winner |
|---|---|---|---|
| ASR Accuracy | |||
| Exact Match | ? | ||
| WER | ? | ||
| Command Accuracy | |||
| Top-1 | ? | ||
| Top-3 | ? | ||
| Latency (mean) | |||
| ASR | ?ms | 245ms | ? |
| Mapping | ?ms | 45ms | ? |
| End-to-end | ?ms | 290ms | ? |
| Robustness | |||
| Noisy environment | ? | ? | ? |
| Accent handling | ? | ? | ? |
| Other | |||
| Requires internet | Yes | No | Wav2Vec2 |
| Cost | API usage | Free | Wav2Vec2 |
6.2. Decision Criteria
Choose Gemini Live if:
- ✅ Higher ASR accuracy (especially in noisy environments)
- ✅ Better handling of natural language variations
- ✅ You have reliable internet in DJ booth
- ✅ API costs are acceptable
Choose Wav2Vec2 if:
- ✅ Lower latency required (< 200ms ASR)
- ✅ Offline operation essential
- ✅ Zero API costs preferred
- ✅ Accuracy is "good enough" (>85
Hybrid approach:
- Use Gemini when online (better accuracy)
- Fall back to Wav2Vec2 when offline
- Implement auto-detection in your code
---
7. Real-Time Manual Testing
Beyond scripted evals, validate behavior live.
7.1. Test Gemini Path
./START_REKORDBOX_VOICE_GEMINI.shOr manually:
cd studio
python dj_agent/scripts/run_voice_control_gemini.py \
--config configs/rekordbox.yaml7.2. Test Wav2Vec2 Path
./START_REKORDBOX_VOICE_WAV2VEC.shOr manually:
cd studio
python dj_agent/scripts/run_voice_control_wav2vec.py \
--config configs/rekordbox.yaml7.3. Test Sequence
For both paths, test these critical commands:
#### Transport
- [ ] "play left" → Deck 1 plays/pauses
- [ ] "play right" → Deck 2 plays/pauses
- [ ] "play both" → Both decks play simultaneously
#### Sync
- [ ] "sync left" → Deck 1 syncs to Deck 2
- [ ] "sync right" → Deck 2 syncs to Deck 1
- [ ] "beat sync left" → Same as above
#### Loops
- [ ] "loop left" → 4-beat loop on Deck 1
- [ ] "loop right" → 4-beat loop on Deck 2
- [ ] "double loop left" → Loop length doubles
- [ ] "exit loop left" → Exit loop
#### Hot Cues
- [ ] "set hot cue A left deck" → Sets hot cue 1 on Deck 1
- [ ] "set hot cue A right deck" → Sets hot cue 1 on Deck 2
- [ ] "jump to hot cue A left" → Jumps to cue 1 on Deck 1
- [ ] "clear hot cue A left deck" → Clears cue 1 on Deck 1
#### Effects
- [ ] "effects left" → Toggles FX on Deck 1
- [ ] "echo left" → Echo tap on Deck 1
#### Recording
- [ ] "start recording" → Starts Rekordbox recording
- [ ] "stop recording" → Stops recording
7.4. Validation Checklist
For each command:
- ✅ Rekordbox performs correct action on correct deck
- ✅ Logged `commandId` matches `Mapping/commands.yaml`
- ✅ Keyboard shortcut logged matches `configs/rekordbox.yaml`
- ✅ Latency feels acceptable (< 500ms ideal)
- ✅ No false positives from background music/noise
---
8. Integration with Rekordbox Bridge
8.1. Architecture
┌─────────────────┐
│ Microphone │
└────────┬────────┘
│ Audio
↓
┌─────────────────┐
│ ASR Frontend │ ← Gemini Live OR Wav2Vec2
└────────┬────────┘
│ Text
↓
┌─────────────────┐
│ Embedding Gemma │
└────────┬────────┘
│ Embedding
↓
┌─────────────────┐
│ Rekordbox Index │ ← commands.yaml
│ + Orbiter │
└────────┬────────┘
│ Command ID
↓
┌─────────────────┐
│ Rekordbox Bridge│ ← rekordbox_bridge.py
└────────┬────────┘
│ Keyboard/MIDI
↓
┌─────────────────┐
│ Rekordbox │
└─────────────────┘8.2. Configuration
Update `configs/rekordbox.yaml`:
dj:
software: "rekordbox"
rekordbox:
mode: "keyboard" # or "midi"
auto_focus: true # Bring Rekordbox to foreground before commands
# Keyboard shortcuts (match your Rekordbox Preferences → Keyboard)
map:
PLAY_A:
type: key
key: "z"
PLAY_B:
type: key
key: "n"
SYNC_A:
type: key
key: "q"
LOOP_4_A:
type: key
key: "7"
# ... (see configs/rekordbox.yaml for full mapping)8.3. Voice Control Integration
Your voice control scripts should use the bridge like this:
from dj_agent.core.rekordbox_bridge import RekordboxBridge
# Initialize bridge
config = load_config('configs/rekordbox.yaml')
bridge = RekordboxBridge(config['dj']['rekordbox'])
# When a voice command is recognized:
def execute_command(command_id: str):
"""Execute Rekordbox command from voice input."""
# Get keyboard mapping
mapping = config['dj']['rekordbox']['map']
if command_id not in mapping:
print(f"Unknown command: {command_id}")
return
# Build message
action_config = mapping[command_id]
if action_config['type'] == 'key':
message = {
'type': 'keyboard',
'key': action_config['key'],
'modifiers': action_config.get('modifiers', [])
}
# Send to Rekordbox
success = bridge.send(message)
if success:
print(f"✅ Executed: {command_id}")
else:
print(f"❌ Failed: {command_id}")---
9. Troubleshooting
9.1. ASR Issues
Problem: Low ASR accuracy
Solutions:
- Record test clips in actual DJ booth environment
- Adjust microphone positioning (closer to mouth, away from speakers)
- Use noise-cancelling microphone
- Fine-tune Wav2Vec2 on your voice (if using local model)
- Switch to Gemini Live for better accuracy
9.2. Command Mapping Issues
Problem: Wrong commands triggered
Solutions:
- Check hard-coded overrides in your voice control script
- Verify `Mapping/commands.yaml` has correct command IDs
- Increase top-k in semantic search (default: 5)
- Add more training examples to Embedding Gemma
- Use explicit deck prefixes ("left" vs "deck one")
9.3. Latency Issues
Problem: Commands feel slow
Solutions:
- Profile each stage (ASR, embedding, mapping, bridge)
- Use local Wav2Vec2 instead of Gemini for lower ASR latency
- Cache embeddings for common commands
- Reduce `auto_focus` delay in bridge
- Use MIDI mode instead of keyboard (faster)
9.4. Rekordbox Integration Issues
Problem: Shortcuts not working
Solutions:
- Verify Rekordbox is in Performance mode (not Export mode)
- Check keyboard shortcuts in Preferences → Keyboard
- Ensure Rekordbox window is focused (or enable `auto_focus: true`)
- Grant Accessibility permissions (macOS)
- Test shortcuts manually first
---
10. Next Steps
10.1. Immediate Tasks
1. ✅ Record test dataset (10-20 commands, 1-3 clips each)
2. ✅ Create manifest JSONL
3. ✅ Run Wav2Vec2 evaluation
4. ✅ Run Gemini evaluation (after creating eval script)
5. ✅ Compare results side-by-side
6. ✅ Choose ASR path (or implement hybrid)
10.2. Production Deployment
Once evaluation is complete:
1. Optimize selected path:
- Fine-tune models if needed
- Add command aliases for better recognition
- Implement confidence thresholds
2. Add robustness:
- Confirmation mode for destructive actions
- Undo/redo support
- Safety constraints (don't load tracks while playing)
3. Extend functionality:
- Add more commands (EQ, filters, samples)
- Multi-language support
- Custom wake word ("Hey DJ, play left")
4. Monitor performance:
- Log all commands with timestamps
- Track accuracy over time
- Collect failure cases for retraining
10.3. Advanced Features
- Motion + Voice hybrid: Use gestures for continuous controls (filters, crossfader), voice for discrete actions (loops, cues)
- Context-aware commands: "play that track" after browsing library
- Beatmatching assistance: "find compatible tracks" based on current BPM/key
- Session recording: Auto-document your mixes with voice annotations
---
Resources
- Rekordbox Bridge: [dj_agent/core/rekordbox_bridge.py](../core/rekordbox_bridge.py)
- Rekordbox Config: [configs/rekordbox.yaml](../../configs/rekordbox.yaml)
- Integration Docs: [REKORDBOX_INTEGRATION.md](./REKORDBOX_INTEGRATION.md)
- Test Script: [scripts/test_rekordbox_bridge.py](../scripts/test_rekordbox_bridge.py)
---
Good luck with your evaluation! Let me know if you need help implementing the Gemini eval script or have questions about the Rekordbox bridge integration.
Promotion Decision
Attach run IDs, datasets, metrics, and reproduction commands.
Source Anchor
projects/Documentation/02-projects/dj-agent/studio/VOICE_CONTROL_EVALUATION.md
Detected Structure
Method · Evaluation · References · Figures · Code Anchors · Architecture