Grand Diomande Research · Full HTML Reader

Rekordbox Voice Control – Evaluation Plan

- **Path A – Gemini Live + Embedding Gemma** - Mic → Gemini Live (cloud ASR) → text - Text → Embedding Gemma → Rekordbox orbiter → keyboard bridge

Agents That Account for Themselves experiment experiment writeup candidate score 44 .md

Full Public Reader

Rekordbox Voice Control – Evaluation Plan

Systematic testing and comparison of Gemini Live vs Wav2Vec2 for voice-controlled DJing with Rekordbox.

Overview

This evaluation compares two ASR (Automatic Speech Recognition) paths:

  • Path A – Gemini Live + Embedding Gemma
  • Mic → Gemini Live (cloud ASR) → text
  • Text → Embedding Gemma → Rekordbox orbiter → keyboard bridge
  • Path B – Wav2Vec2 + Embedding Gemma
  • Mic → Wav2Vec2 (local ASR) → text
  • Text → Embedding Gemma → Rekordbox orbiter → keyboard bridge

Both paths share:
- Same command catalog: `Mapping/commands.yaml`
- Same retrieval logic (hard overrides + Embedding Gemma + Rekordbox index)
- Same constraints and safety layer
- Same Rekordbox keyboard bridge

The only difference: ASR front-end (cloud vs local)

---

1. Goals

### 1.1. ASR Quality
- How often does each path convert spoken commands into the expected text?
- Performance in quiet vs noisy environments

### 1.2. Command Mapping Quality
- Given ASR text, how often does the system choose the correct Rekordbox command?
- Accuracy for critical commands (play, loop, hot cues, sync)

### 1.3. Latency
For each path:
- ASR latency: audio → text
- Mapping latency: text → command decision
- End-to-end latency: speech onset → Rekordbox action

### 1.4. Behavioral Sanity
- "Play left/right" reliably hits Deck 1/2 Play/Pause
- "Loop left/right" reliably triggers 4-beat loops for correct deck
- Hot cue and other key actions behave as intended

---

2. Prerequisites

2.1. Dependencies

From `computational-studio/studio`, in your venv:

bash
pip install -r requirements.txt

Required packages:
- `torch`, `torchaudio`, `transformers` (Wav2Vec2)
- `huggingface_hub` (Embedding Gemma)
- `google-genai` (Gemini Live)
- `pyaudio`, `pynput`, `mido`, `python-rtmidi`

2.2. Environment Variables

Create `.env` file:

bash
GEMINI_API_KEY=your_gemini_api_key_here
HF_TOKEN=your_huggingface_token_here
  • GEMINI_API_KEY: For Gemini Live (Path A)
  • HF_TOKEN: For Embedding Gemma via HF Inference (both paths)

2.3. Rekordbox Setup

1. Install Rekordbox (version 6 or 7)
2. Export your library (optional):
- FILE → EXPORT COLLECTION IN XML FORMAT
3. Configure keyboard shortcuts:
- Preferences → Keyboard → verify default shortcuts
4. Enable Performance mode:
- Load tracks to both decks
- Keep Rekordbox as active window during tests

---

3. Build Test Dataset

3.1. Record Command Audio Clips

Create folder structure:

studio/tests/rekordbox_commands/
├── play_left_01.wav
├── play_left_02.wav
├── play_right_01.wav
├── loop_left_01.wav
├── loop_right_01.wav
├── cue_a_left_01.wav
├── cue_a_right_01.wav
├── beat_sync_left_01.wav
├── beat_sync_right_01.wav
└── start_recording_01.wav

Recording guidelines:
- Mono audio (preferred)
- Any sample rate (eval script will resample)
- Clear speech in realistic DJ booth conditions
- Include ambient noise if testing robustness
- 1-3 clips per command for statistical validity

3.2. Create Manifest File

Create `studio/tests/rekordbox_manifest.jsonl`:

jsonl
{"path": "rekordbox_commands/play_left_01.wav", "expected_text": "play left", "expected_command_id": "PLAY_A"}
{"path": "rekordbox_commands/play_right_01.wav", "expected_text": "play right", "expected_command_id": "PLAY_B"}
{"path": "rekordbox_commands/loop_left_01.wav", "expected_text": "loop left", "expected_command_id": "LOOP_4_A"}
{"path": "rekordbox_commands/loop_right_01.wav", "expected_text": "loop right", "expected_command_id": "LOOP_4_B"}
{"path": "rekordbox_commands/cue_a_left_01.wav", "expected_text": "set hot cue a left deck", "expected_command_id": "SET_CUE_1_A"}
{"path": "rekordbox_commands/cue_a_right_01.wav", "expected_text": "set hot cue a right deck", "expected_command_id": "SET_CUE_1_B"}
{"path": "rekordbox_commands/sync_left_01.wav", "expected_text": "beat sync left", "expected_command_id": "SYNC_A"}
{"path": "rekordbox_commands/sync_right_01.wav", "expected_text": "beat sync right", "expected_command_id": "SYNC_B"}

Notes:
- `path`: relative to manifest directory
- `expected_text`: what ASR should produce (lowercase)
- `expected_command_id`: matches action names in `configs/rekordbox.yaml`

---

4. Evaluate Wav2Vec2 Path

4.1. Integration with Rekordbox Bridge

The Wav2Vec2 evaluation uses:
1. Wav2Vec2 ASR → transcribe audio to text
2. Embedding Gemma → embed text for semantic search
3. Rekordbox Orbiter → map to command ID
4. Rekordbox Bridge → execute keyboard shortcut

4.2. Run Evaluation

bash
cd computational-studio/studio
python dj_agent/scripts/eval_rekordbox_voice_wav2vec.py \
    tests/rekordbox_manifest.jsonl \
    --config configs/rekordbox.yaml

What it measures:

1. ASR Accuracy:
- Exact match: `asr_text == expected_text`
- Word Error Rate (WER)
- Character Error Rate (CER)

2. Command Accuracy:
- Correct command ID predicted: `predicted_id == expected_id`
- Top-3 accuracy
- Per-category accuracy (transport, loops, cues)

3. Latencies:
- ASR latency (ms)
- Mapping latency (ms)
- End-to-end latency (ms)

4. Error Analysis:
- Confusion matrix for common errors
- Failed examples logged for inspection

Output:

=== Wav2Vec2 Evaluation Results ===

ASR Metrics:
  Exact Match Accuracy: 85.5%
  Word Error Rate: 12.3%
  Character Error Rate: 5.8%

Command Mapping Metrics:
  Top-1 Accuracy: 92.0%
  Top-3 Accuracy: 98.5%
  Per-category:
    Transport: 95.0%
    Loops: 88.0%
    Hot Cues: 90.0%

Latency (ms):
  ASR (mean): 245ms, p95: 380ms
  Mapping (mean): 45ms, p95: 78ms
  End-to-end (mean): 290ms, p95: 458ms

Failed Examples:
  1. play_left_02.wav: "play love" → INCORRECT
  2. loop_right_01.wav: "blue right" → INCORRECT

---

5. Evaluate Gemini Path

5.1. Create Gemini Eval Script

Create `dj_agent/scripts/eval_rekordbox_voice_gemini.py`:

python
#!/usr/bin/env python3
"""
Evaluate Gemini Live ASR + Rekordbox mapping.
Mirrors the Wav2Vec2 evaluation for fair comparison.
"""

import json
import time
from pathlib import Path
import torchaudio
from google import genai

# Your existing modules
from dj_agent.voice_control.core.rekordbox_library import RekordboxLibraryReader
from dj_agent.core.rekordbox_bridge import RekordboxBridge

# TODO: Adapt GeminiVoiceListener to accept audio files instead of mic input
class GeminiFileASR:
    """Gemini Live ASR that processes audio files."""

    def __init__(self, [sensitive field redacted]):
        self.client = genai.Client([sensitive field redacted])

    def transcribe(self, audio_path: Path) -> tuple[str, float]:
        """
        Transcribe audio file using Gemini Live.

        Returns:
            (transcribed_text, latency_ms)
        """
        start = time.time()

        # Load audio
        waveform, sr = torchaudio.load(audio_path)

        # TODO: Stream audio to Gemini Live session
        # For now, use a simpler Gemini audio API if available
        # Or adapt the streaming logic from run_voice_control_gemini.py

        # Placeholder:
        # response = self.client.models.generate_content(...)

        latency = (time.time() - start) * 1000
        return transcribed_text, latency

def main():
    # Similar structure to eval_rekordbox_voice_wav2vec.py
    # Load manifest, run Gemini ASR, measure accuracy and latency
    pass

if __name__ == '__main__':
    main()

5.2. Run Gemini Evaluation

bash
python dj_agent/scripts/eval_rekordbox_voice_gemini.py \
    tests/rekordbox_manifest.jsonl \
    --config configs/rekordbox.yaml

Expected output format: Same as Wav2Vec2 eval for direct comparison

---

6. Side-by-Side Comparison

6.1. Metrics Table

MetricGemini LiveWav2Vec2Winner
ASR Accuracy
Exact Match?
WER?
Command Accuracy
Top-1?
Top-3?
Latency (mean)
ASR?ms245ms?
Mapping?ms45ms?
End-to-end?ms290ms?
Robustness
Noisy environment???
Accent handling???
Other
Requires internetYesNoWav2Vec2
CostAPI usageFreeWav2Vec2

6.2. Decision Criteria

Choose Gemini Live if:
- ✅ Higher ASR accuracy (especially in noisy environments)
- ✅ Better handling of natural language variations
- ✅ You have reliable internet in DJ booth
- ✅ API costs are acceptable

Choose Wav2Vec2 if:
- ✅ Lower latency required (< 200ms ASR)
- ✅ Offline operation essential
- ✅ Zero API costs preferred
- ✅ Accuracy is "good enough" (>85

Hybrid approach:
- Use Gemini when online (better accuracy)
- Fall back to Wav2Vec2 when offline
- Implement auto-detection in your code

---

7. Real-Time Manual Testing

Beyond scripted evals, validate behavior live.

7.1. Test Gemini Path

bash
./START_REKORDBOX_VOICE_GEMINI.sh

Or manually:

bash
cd studio
python dj_agent/scripts/run_voice_control_gemini.py \
    --config configs/rekordbox.yaml

7.2. Test Wav2Vec2 Path

bash
./START_REKORDBOX_VOICE_WAV2VEC.sh

Or manually:

bash
cd studio
python dj_agent/scripts/run_voice_control_wav2vec.py \
    --config configs/rekordbox.yaml

7.3. Test Sequence

For both paths, test these critical commands:

#### Transport
- [ ] "play left" → Deck 1 plays/pauses
- [ ] "play right" → Deck 2 plays/pauses
- [ ] "play both" → Both decks play simultaneously

#### Sync
- [ ] "sync left" → Deck 1 syncs to Deck 2
- [ ] "sync right" → Deck 2 syncs to Deck 1
- [ ] "beat sync left" → Same as above

#### Loops
- [ ] "loop left" → 4-beat loop on Deck 1
- [ ] "loop right" → 4-beat loop on Deck 2
- [ ] "double loop left" → Loop length doubles
- [ ] "exit loop left" → Exit loop

#### Hot Cues
- [ ] "set hot cue A left deck" → Sets hot cue 1 on Deck 1
- [ ] "set hot cue A right deck" → Sets hot cue 1 on Deck 2
- [ ] "jump to hot cue A left" → Jumps to cue 1 on Deck 1
- [ ] "clear hot cue A left deck" → Clears cue 1 on Deck 1

#### Effects
- [ ] "effects left" → Toggles FX on Deck 1
- [ ] "echo left" → Echo tap on Deck 1

#### Recording
- [ ] "start recording" → Starts Rekordbox recording
- [ ] "stop recording" → Stops recording

7.4. Validation Checklist

For each command:
- ✅ Rekordbox performs correct action on correct deck
- ✅ Logged `commandId` matches `Mapping/commands.yaml`
- ✅ Keyboard shortcut logged matches `configs/rekordbox.yaml`
- ✅ Latency feels acceptable (< 500ms ideal)
- ✅ No false positives from background music/noise

---

8. Integration with Rekordbox Bridge

8.1. Architecture

┌─────────────────┐
│   Microphone    │
└────────┬────────┘
         │ Audio
         ↓
┌─────────────────┐
│  ASR Frontend   │ ← Gemini Live OR Wav2Vec2
└────────┬────────┘
         │ Text
         ↓
┌─────────────────┐
│ Embedding Gemma │
└────────┬────────┘
         │ Embedding
         ↓
┌─────────────────┐
│ Rekordbox Index │ ← commands.yaml
│   + Orbiter     │
└────────┬────────┘
         │ Command ID
         ↓
┌─────────────────┐
│ Rekordbox Bridge│ ← rekordbox_bridge.py
└────────┬────────┘
         │ Keyboard/MIDI
         ↓
┌─────────────────┐
│   Rekordbox     │
└─────────────────┘

8.2. Configuration

Update `configs/rekordbox.yaml`:

yaml
dj:
  software: "rekordbox"

  rekordbox:
    mode: "keyboard"  # or "midi"
    auto_focus: true  # Bring Rekordbox to foreground before commands

    # Keyboard shortcuts (match your Rekordbox Preferences → Keyboard)
    map:
      PLAY_A:
        type: key
        key: "z"
      PLAY_B:
        type: key
        key: "n"
      SYNC_A:
        type: key
        key: "q"
      LOOP_4_A:
        type: key
        key: "7"
      # ... (see configs/rekordbox.yaml for full mapping)

8.3. Voice Control Integration

Your voice control scripts should use the bridge like this:

python
from dj_agent.core.rekordbox_bridge import RekordboxBridge

# Initialize bridge
config = load_config('configs/rekordbox.yaml')
bridge = RekordboxBridge(config['dj']['rekordbox'])

# When a voice command is recognized:
def execute_command(command_id: str):
    """Execute Rekordbox command from voice input."""

    # Get keyboard mapping
    mapping = config['dj']['rekordbox']['map']

    if command_id not in mapping:
        print(f"Unknown command: {command_id}")
        return

    # Build message
    action_config = mapping[command_id]
    if action_config['type'] == 'key':
        message = {
            'type': 'keyboard',
            'key': action_config['key'],
            'modifiers': action_config.get('modifiers', [])
        }

    # Send to Rekordbox
    success = bridge.send(message)
    if success:
        print(f"✅ Executed: {command_id}")
    else:
        print(f"❌ Failed: {command_id}")

---

9. Troubleshooting

9.1. ASR Issues

Problem: Low ASR accuracy

Solutions:
- Record test clips in actual DJ booth environment
- Adjust microphone positioning (closer to mouth, away from speakers)
- Use noise-cancelling microphone
- Fine-tune Wav2Vec2 on your voice (if using local model)
- Switch to Gemini Live for better accuracy

9.2. Command Mapping Issues

Problem: Wrong commands triggered

Solutions:
- Check hard-coded overrides in your voice control script
- Verify `Mapping/commands.yaml` has correct command IDs
- Increase top-k in semantic search (default: 5)
- Add more training examples to Embedding Gemma
- Use explicit deck prefixes ("left" vs "deck one")

9.3. Latency Issues

Problem: Commands feel slow

Solutions:
- Profile each stage (ASR, embedding, mapping, bridge)
- Use local Wav2Vec2 instead of Gemini for lower ASR latency
- Cache embeddings for common commands
- Reduce `auto_focus` delay in bridge
- Use MIDI mode instead of keyboard (faster)

9.4. Rekordbox Integration Issues

Problem: Shortcuts not working

Solutions:
- Verify Rekordbox is in Performance mode (not Export mode)
- Check keyboard shortcuts in Preferences → Keyboard
- Ensure Rekordbox window is focused (or enable `auto_focus: true`)
- Grant Accessibility permissions (macOS)
- Test shortcuts manually first

---

10. Next Steps

10.1. Immediate Tasks

1. ✅ Record test dataset (10-20 commands, 1-3 clips each)
2. ✅ Create manifest JSONL
3. ✅ Run Wav2Vec2 evaluation
4. ✅ Run Gemini evaluation (after creating eval script)
5. ✅ Compare results side-by-side
6. ✅ Choose ASR path (or implement hybrid)

10.2. Production Deployment

Once evaluation is complete:

1. Optimize selected path:
- Fine-tune models if needed
- Add command aliases for better recognition
- Implement confidence thresholds

2. Add robustness:
- Confirmation mode for destructive actions
- Undo/redo support
- Safety constraints (don't load tracks while playing)

3. Extend functionality:
- Add more commands (EQ, filters, samples)
- Multi-language support
- Custom wake word ("Hey DJ, play left")

4. Monitor performance:
- Log all commands with timestamps
- Track accuracy over time
- Collect failure cases for retraining

10.3. Advanced Features

  • Motion + Voice hybrid: Use gestures for continuous controls (filters, crossfader), voice for discrete actions (loops, cues)
  • Context-aware commands: "play that track" after browsing library
  • Beatmatching assistance: "find compatible tracks" based on current BPM/key
  • Session recording: Auto-document your mixes with voice annotations

---

Resources

  • Rekordbox Bridge: [dj_agent/core/rekordbox_bridge.py](../core/rekordbox_bridge.py)
  • Rekordbox Config: [configs/rekordbox.yaml](../../configs/rekordbox.yaml)
  • Integration Docs: [REKORDBOX_INTEGRATION.md](./REKORDBOX_INTEGRATION.md)
  • Test Script: [scripts/test_rekordbox_bridge.py](../scripts/test_rekordbox_bridge.py)

---

Good luck with your evaluation! Let me know if you need help implementing the Gemini eval script or have questions about the Rekordbox bridge integration.

Promotion Decision

Attach run IDs, datasets, metrics, and reproduction commands.

Source Anchor

projects/Documentation/02-projects/dj-agent/studio/VOICE_CONTROL_EVALUATION.md

Detected Structure

Method · Evaluation · References · Figures · Code Anchors · Architecture