Voice Control Architecture - Technical Overview
You now have **three independent voice control pipelines** for Rekordbox DJ software, each optimized for different use cases.
Full Public Reader
Voice Control Architecture - Technical Overview
System Architecture Overview
You now have three independent voice control pipelines for Rekordbox DJ software, each optimized for different use cases.
---
1. Gemini Live System (Cloud-Based, Lowest Latency)
Architecture Diagram
┌─────────────────────────────────────────────────────────────────────┐
│ USER VOICE INPUT │
└────────────────────────────┬────────────────────────────────────────┘
│
│ Microphone audio stream
↓
┌─────────────────────────────────────────────────────────────────────┐
│ GEMINI LIVE SESSION │
│ ┌──────────────────────────────────────────────────────────────┐ │
│ │ Google Gemini 2.0 Flash (Experimental) │ │
│ │ - Real-time audio streaming │ │
│ │ - Speech-to-text (optimized for low latency) │ │
│ │ - Context-aware (knows it's DJ commands) │ │
│ └──────────────────────────┬───────────────────────────────────┘ │
│ │ Text output │
│ │ Latency: ~80ms │
└─────────────────────────────┼────────────────────────────────────────┘
│
↓
┌─────────────────────────────────────────────────────────────────────┐
│ EMBEDDING GEMMA PROVIDER │
│ ┌──────────────────────────────────────────────────────────────┐ │
│ │ google/gemma-2-2b-it (HuggingFace Inference API) │ │
│ │ - Convert text → 768-dim embedding vector │ │
│ │ - Semantic representation of command │ │
│ └──────────────────────────┬───────────────────────────────────┘ │
│ │ Embedding vector │
│ │ Latency: ~35ms │
└─────────────────────────────┼────────────────────────────────────────┘
│
↓
┌─────────────────────────────────────────────────────────────────────┐
│ REKORDBOX ORBITER │
│ ┌──────────────────────────────────────────────────────────────┐ │
│ │ Rekordbox Index (FAISS-like cosine similarity search) │ │
│ │ - Search command database with embedding │ │
│ │ - Return top-5 matches with confidence scores │ │
│ └──────────────────────────┬───────────────────────────────────┘ │
│ │ Top command matches │
│ │ Latency: ~10ms │
│ ↓ │
│ ┌──────────────────────────────────────────────────────────────┐ │
│ │ Constraints & Stability Layer │ │
│ │ - Check deck state │ │
│ │ - Safety rules (prevent double-triggers) │ │
│ │ - Stability filtering │ │
│ └──────────────────────────┬───────────────────────────────────┘ │
│ │ Approved command │
│ │ Latency: <1ms │
└─────────────────────────────┼────────────────────────────────────────┘
│
↓
┌─────────────────────────────────────────────────────────────────────┐
│ REKORDBOX BRIDGE │
│ ┌──────────────────────────────────────────────────────────────┐ │
│ │ Keyboard/MIDI Output │ │
│ │ - Send keyboard shortcut to Rekordbox │ │
│ │ - Auto-focus Rekordbox window (macOS/Windows) │ │
│ │ - Execute command │ │
│ └──────────────────────────┬───────────────────────────────────┘ │
│ │ Keyboard event │
│ │ Latency: <1ms │
└─────────────────────────────┼────────────────────────────────────────┘
│
↓
┌─────────────────────────────────────────────────────────────────────┐
│ REKORDBOX (DJ SOFTWARE) │
│ - Receives keyboard shortcut (e.g., "Z" for Play Deck 1) │
│ - Executes DJ action │
└─────────────────────────────────────────────────────────────────────┘
Total Pipeline Latency: ~80ms (FASTEST)Key Components
Files:
- `dj_agent/scripts/run_rekordbox_voice_gemini.py` - Main entry point
- `dj_agent/voice_control/gemini_live_asr.py` - Gemini Live streaming
- `dj_agent/voice_control/orbiter/` - Command matching & execution
- `START_REKORDBOX_VOICE_GEMINI.sh` - Launcher script
Pros:
- Lowest latency (80ms total)
- Highest out-of-box accuracy (98
- Simplest setup
Cons:
- Requires internet connection
- API costs (~$0.001 per command)
- Sends audio to cloud
---
2. Whisper System (Offline, High Accuracy)
Architecture Diagram
┌─────────────────────────────────────────────────────────────────────┐
│ USER VOICE INPUT │
└────────────────────────────┬────────────────────────────────────────┘
│
│ Microphone audio stream
↓
┌─────────────────────────────────────────────────────────────────────┐
│ WHISPER VOICE LISTENER │
│ ┌──────────────────────────────────────────────────────────────┐ │
│ │ Voice Activity Detection (VAD) │ │
│ │ - Energy-based speech detection │ │
│ │ - Silence detection (800ms timeout) │ │
│ │ - Utterance segmentation │ │
│ └──────────────────────────┬───────────────────────────────────┘ │
│ │ Audio segments │
└─────────────────────────────┼────────────────────────────────────────┘
│
↓
┌─────────────────────────────────────────────────────────────────────┐
│ WHISPER ASR │
│ ┌──────────────────────────────────────────────────────────────┐ │
│ │ OpenAI Whisper (tiny.en model) │ │
│ │ - Local inference (no cloud) │ │
│ │ - Optimized for English │ │
│ │ - Beam size = 1 (greedy decoding for speed) │ │
│ │ - FP16 on GPU if available │ │
│ └──────────────────────────┬───────────────────────────────────┘ │
│ │ Transcribed text │
│ │ Latency: ~150ms (CPU), ~80ms (GPU) │
└─────────────────────────────┼────────────────────────────────────────┘
│
↓
┌─────────────────────────────────────────────────────────────────────┐
│ EMBEDDING GEMMA + REKORDBOX ORBITER │
│ (Same as Gemini Live system) │
│ Latency: ~45ms │
└─────────────────────────────┼────────────────────────────────────────┘
│
↓
┌─────────────────────────────────────────────────────────────────────┐
│ REKORDBOX BRIDGE │
│ (Same as Gemini Live system) │
│ Latency: <1ms │
└─────────────────────────────┼────────────────────────────────────────┘
│
↓
┌─────────────────────────────────────────────────────────────────────┐
│ REKORDBOX (DJ SOFTWARE) │
└─────────────────────────────────────────────────────────────────────┘
Total Pipeline Latency: ~195ms (OFFLINE)Key Components
Files:
- `dj_agent/scripts/run_rekordbox_voice_whisper.py` - Main entry point
- `dj_agent/voice_control/whisper_asr.py` - Whisper transcription
- `dj_agent/voice_control/core/whisper_listener.py` - VAD + Whisper integration
- `START_REKORDBOX_VOICE_WHISPER.sh` - Launcher script
Pros:
- Fully offline (no internet needed)
- High accuracy (95-98
- Free (local inference)
Cons:
- Slower latency (195ms - noticeable)
- CPU-intensive (fan may spin up)
- Static (doesn't improve)
---
3. Hybrid System (Self-Improving, Best Long-Term) ⭐
Architecture Diagram
┌─────────────────────────────────────────────────────────────────────┐
│ USER VOICE INPUT │
└────────────────────────────┬────────────────────────────────────────┘
│
│ Microphone audio stream
↓
┌─────────────────────────────────────────────────────────────────────┐
│ HYBRID VOICE LISTENER │
│ ┌──────────────────────────────────────────────────────────────┐ │
│ │ Voice Activity Detection (VAD) │ │
│ │ - Energy threshold: 40.0 │ │
│ │ - Silence timeout: 800ms │ │
│ │ - Utterance segmentation │ │
│ └──────────────────────────┬───────────────────────────────────┘ │
│ │ Audio segments │
└─────────────────────────────┼────────────────────────────────────────┘
│
├─────────────┬──────────────────────────┐
│ │ │
REAL-TIME PATH SHADOW PATH │
(Fast response) (Ground truth) │
│ │ │
↓ ↓ │
┌──────────────────────────────┐ ┌──────────────────────────────┐ │
│ WAV2VEC2 ASR │ │ WHISPER ASR │ │
│ ┌────────────────────────┐ │ │ ┌────────────────────────┐ │ │
│ │ facebook/wav2vec2-base │ │ │ │ OpenAI Whisper (tiny) │ │ │
│ │ - Fast inference │ │ │ │ - Accurate inference │ │ │
│ │ - 60ms latency │ │ │ │ - 150ms latency │ │ │
│ │ - 40-60% accuracy │ │ │ │ - 95-98% accuracy │ │ │
│ │ (initially) │ │ │ │ - Runs async │ │ │
│ └─────────┬──────────────┘ │ │ └─────────┬──────────────┘ │ │
│ │ "hey laughed" │ │ │ "play left" │ │
└────────────┼─────────────────┘ └────────────┼─────────────────┘ │
│ │ │
↓ ↓ │
┌──────────────────────────────┐ ┌──────────────────────────────┐ │
│ TEXT CORRECTION │ │ TRAINING DATA SAVER │ │
│ ┌────────────────────────┐ │ │ ┌────────────────────────┐ │ │
│ │ Hybrid Corrector │ │ │ │ Save to disk: │ │ │
│ │ 1. Phonetic rules │ │ │ │ - Audio file (.wav) │ │ │
│ │ "hey laughed" → │ │ │ │ - Wav2Vec2 text │ │ │
│ │ "play left" │ │ │ │ - Corrected text │ │ │
│ │ Latency: <1ms │ │ │ │ - Whisper text │ │ │
│ │ │ │ │ │ - Timestamp │ │ │
│ │ 2. Gemma-2-2b (LLM) │ │ │ └─────────┬──────────────┘ │ │
│ │ Semantic fix │ │ │ │ │ │
│ │ Latency: ~25ms │ │ │ ↓ │ │
│ └─────────┬──────────────┘ │ │ training_data/ │ │
│ │ "play left" │ │ auto_collected/ │ │
└────────────┼─────────────────┘ │ ├─ manifest.jsonl │ │
│ │ ├─ 1234567890.wav │ │
↓ │ └─ ... │ │
┌──────────────────────────────┐ └──────────────────────────────┘ │
│ COMPARISON & LOGGING │ │
│ ┌────────────────────────┐ │ │
│ │ Compare: │ │ │
│ │ Corrected text == │ │ │
│ │ Whisper text? │ │ │
│ │ │ │ │
│ │ ✅ Match → Log success │ │ │
│ │ ⚠️ Mismatch → Log │ │ │
│ │ for review │ │ │
│ └────────────────────────┘ │ │
└────────────┬─────────────────┘ │
│ Corrected text │
↓ │
┌─────────────────────────────────────────────────────────────────────┐
│ EMBEDDING GEMMA + REKORDBOX ORBITER │
│ Latency: ~45ms │
└─────────────────────────────┼────────────────────────────────────────┘
│
↓
┌─────────────────────────────────────────────────────────────────────┐
│ REKORDBOX BRIDGE │
│ Latency: <1ms │
└─────────────────────────────┼────────────────────────────────────────┘
│
↓
┌─────────────────────────────────────────────────────────────────────┐
│ REKORDBOX (DJ SOFTWARE) │
└─────────────────────────────────────────────────────────────────────┘
Real-time Path Latency: ~125ms (ACCEPTABLE)
Total System (with shadow): ~150ms async in backgroundSelf-Improvement Cycle
┌──────────────────────────────────────────────────────────────────┐
│ WEEK 1-2: DATA COLLECTION │
│ User DJs normally → System auto-collects 500+ samples │
│ Each sample: (audio, wav2vec_text, corrected_text, │
│ whisper_text) │
└────────────────────────────┬─────────────────────────────────────┘
│
↓
┌──────────────────────────────────────────────────────────────────┐
│ OFFLINE: FINE-TUNING │
│ python finetune_from_autocollected.py │
│ │
│ Process: │
│ 1. Load manifest.jsonl │
│ 2. Use Whisper text as ground truth │
│ 3. Train Wav2Vec2: audio → whisper_text │
│ 4. Save improved model │
│ │
│ Result: Wav2Vec2 WER drops (40% → 30% → 15% → 5% → 2%) │
└────────────────────────────┬─────────────────────────────────────┘
│
↓
┌──────────────────────────────────────────────────────────────────┐
│ IMPROVED SYSTEM (WEEK 2+) │
│ Real-time: Fine-tuned Wav2Vec2 → Less correction needed │
│ Latency: 125ms → 105ms → 90ms → 85ms │
│ Accuracy: 90% → 92% → 95% → 98% │
└────────────────────────────┬─────────────────────────────────────┘
│
↓ (Repeat cycle monthly)
┌──────────────────────────────────────────────────────────────────┐
│ OPTIMAL SYSTEM (WEEK 12+) │
│ Wav2Vec2: 98% accurate on YOUR voice │
│ Gemma: Rarely needed (<5% of commands) │
│ Latency: ~85ms (nearly matches Gemini Live!) │
│ Offline: Fully functional without internet │
│ Cost: Free (no API costs) │
└──────────────────────────────────────────────────────────────────┘Key Components
Files:
- `dj_agent/scripts/run_rekordbox_voice_hybrid.py` - Main entry point
- `dj_agent/voice_control/core/hybrid_listener.py` - Dual-path listener
- `dj_agent/voice_control/wav2vec_asr.py` - Fast ASR
- `dj_agent/voice_control/whisper_asr.py` - Accurate ASR (shadow)
- `dj_agent/voice_control/text_correction.py` - Phonetic + Gemma correction
- `dj_agent/scripts/finetune_from_autocollected.py` - Fine-tuning script
- `START_REKORDBOX_VOICE_HYBRID.sh` - Launcher script
Pros:
- Fast (125ms → 85ms after fine-tuning)
- Self-improving (auto-collects training data)
- Best long-term outcome (98
- Fully offline
- Free
Cons:
- Initial accuracy lower (90
- Requires periodic fine-tuning
- Higher CPU usage (Whisper shadow)
---
Shared Components (All Systems)
Rekordbox Orbiter
Purpose: Command matching, constraint checking, and execution
Components:
1. Rekordbox Index (`rekordbox_index.py`)
- FAISS-like vector search
- Stores command embeddings
- Returns top-K matches with scores
2. Constraints Layer (`stability.py`, `constraints.py`)
- Safety checks (prevent double-triggers)
- Deck state validation
- Timing constraints
3. Rekordbox Bridge (`bridge.py`)
- Keyboard shortcut execution (pynput)
- Auto-focus Rekordbox window
- Cross-platform (macOS, Windows, Linux)
Files:
- `dj_agent/voice_control/orbiter/rekordbox_orbiter.py`
- `dj_agent/voice_control/orbiter/rekordbox_index.py`
- `dj_agent/voice_control/orbiter/constraints.py`
- `dj_agent/voice_control/orbiter/stability.py`
- `dj_agent/voice_control/orbiter/bridge.py`
Embedding Provider
Purpose: Convert text to semantic embeddings for retrieval
Model: google/gemma-2-2b-it (HuggingFace Inference API)
Process:
Text → Gemma-2-2b → 768-dim embedding → Cosine similarity searchFile: `dj_agent/voice_control/orbiter/embedding.py`
Command Database
Source: `Mapping/commands.yaml`
Format:
3006: # Command ID
name: "play left"
shortcut: "z"
deck: "left"
category: "transport"
description: "Play/pause deck 1"Total Commands: 218 Rekordbox keyboard shortcuts
---
Data Flow Comparison
Gemini Live (Fastest)
Audio → Gemini API → Text → Embedding → Retrieval → Command
↑ 80ms totalWhisper (Offline)
Audio → Whisper → Text → Embedding → Retrieval → Command
↑ 150ms ↑ 45ms
↑ 195ms totalHybrid (Self-Improving)
Real-time:
Audio → Wav2Vec2 → Gemma correction → Embedding → Command
↑ 60ms ↑ 25ms ↑ 45ms
↑ 125ms total (initially)
↑ 85ms total (after fine-tuning)
Shadow (async):
Audio → Whisper → Save (audio, texts) → Fine-tuning → Improved Wav2Vec2
↑ 150ms async---
Performance Metrics
| System | Latency (Initial) | Latency (Optimized) | Accuracy (Initial) | Accuracy (Optimized) | Offline | Self-Improving |
|---|---|---|---|---|---|---|
| Gemini Live | 80ms | 80ms | 98 | |||
| Whisper | 195ms | 195ms | 95-98 | |||
| Hybrid | 125ms | 85ms | 90 |
---
Technology Stack
### ASR Models
- Gemini Live: Google Gemini 2.0 Flash (Experimental)
- Wav2Vec2: facebook/wav2vec2-base-960h
- Whisper: OpenAI Whisper (tiny.en, base.en, small.en)
### Embedding Models
- Gemma: google/gemma-2-2b-it (768-dim embeddings)
### Text Correction
- Phonetic: Rule-based phonetic corrections
- Gemma: google/gemma-2-2b-it (LLM-based semantic correction)
### Infrastructure
- PyAudio: Audio capture
- PyTorch: ML inference
- HuggingFace: Model hosting & inference API
- pynput: Keyboard control
- soundfile: Audio I/O
---
File Structure
studio/
├── dj_agent/
│ ├── voice_control/
│ │ ├── core/
│ │ │ ├── wav2vec_listener.py # Wav2Vec2 VAD + ASR
│ │ │ ├── whisper_listener.py # Whisper VAD + ASR
│ │ │ └── hybrid_listener.py # Dual-path listener ⭐
│ │ ├── gemini_live_asr.py # Gemini Live streaming
│ │ ├── wav2vec_asr.py # Wav2Vec2 transcription
│ │ ├── whisper_asr.py # Whisper transcription
│ │ ├── text_correction.py # Phonetic + Gemma correction ⭐
│ │ └── orbiter/
│ │ ├── rekordbox_orbiter.py # Main orchestrator
│ │ ├── rekordbox_index.py # Vector search
│ │ ├── embedding.py # Gemma embeddings
│ │ ├── constraints.py # Safety checks
│ │ ├── stability.py # Anti-flicker
│ │ └── bridge.py # Keyboard output
│ └── scripts/
│ ├── run_rekordbox_voice_gemini.py
│ ├── run_rekordbox_voice_whisper.py
│ ├── run_rekordbox_voice_hybrid.py ⭐
│ ├── finetune_wav2vec.py
│ ├── finetune_from_autocollected.py ⭐
│ └── record_training_data_ui.py
├── Mapping/
│ └── commands.yaml # 218 Rekordbox shortcuts
├── training_data/
│ └── auto_collected/ ⭐ Auto-saved by hybrid system
│ ├── manifest.jsonl
│ └── *.wav
├── models/
│ └── wav2vec2-dj-autocollected/ ⭐ Fine-tuned models
├── START_REKORDBOX_VOICE_GEMINI.sh
├── START_REKORDBOX_VOICE_WHISPER.sh
├── START_REKORDBOX_VOICE_HYBRID.sh ⭐ Recommended
└── Documentation:
├── ARCHITECTURE.md ← You are here
├── VOICE_CONTROL_SYSTEMS_GUIDE.md
├── VOICE_SYSTEMS_COMPARISON.md
├── FINE_TUNE_GUIDE.md
└── QUICK_START.md---
Recommended System: Hybrid ⭐
Why?
- Starts good (90
- Gets excellent (98
- Fully automatic (zero manual work)
- Best long-term outcome
Path to Excellence:
Week 1: Install → Use normally → 90% accuracy @ 125ms
Week 2: Fine-tune (1hr) → 92% accuracy @ 110ms
Week 4: Fine-tune (1hr) → 95% accuracy @ 100ms
Week 8: Fine-tune (1hr) → 97% accuracy @ 90ms
Week 12: Fine-tune (1hr) → 98% accuracy @ 85ms ✨---
Next Steps
1. Choose your system (Hybrid recommended ⭐)
2. Launch it: `./START_REKORDBOX_VOICE_HYBRID.sh`
3. DJ normally (system auto-improves)
4. Fine-tune monthly (after collecting data)
5. Enjoy optimal performance! 🎉
---
For detailed guides, see:
- [QUICK_START.md](QUICK_START.md) - Get started in 5 minutes
- [VOICE_CONTROL_SYSTEMS_GUIDE.md](VOICE_CONTROL_SYSTEMS_GUIDE.md) - Complete guide
- [FINE_TUNE_GUIDE.md](FINE_TUNE_GUIDE.md) - Fine-tuning deep dive
Promotion Decision
Promote into a technical note or architecture paper with implementation anchors.
Source Anchor
projects/Documentation/02-projects/dj-agent/studio/docs/ARCHITECTURE.md
Detected Structure
Method · Evaluation · Code Anchors · Architecture