Fine-tune Wav2Vec2 for DJ Commands - Complete Guide
**Why this approach?** - ✅ **Speed**: <100ms total latency (acceptable for DJing) - ✅ **Accuracy**: Fine-tuned ASR + semantic retrieval = best of both worlds - ✅ **Flexibility**: Can add new commands without retraining audio model - ✅ **Debugging**: Can see transcribed text - ✅ **Practical**: Uses pre-trained models with fine-tuning
Full Public Reader
Fine-tune Wav2Vec2 for DJ Commands - Complete Guide
🎯 Goal
Train Wav2Vec2 to recognize your voice saying DJ commands with >95
📊 Architecture Decision
What We're Doing ✅
Voice → Wav2Vec2 (fine-tuned) → Text → Embedding Gemma → Retrieval → Command
↑ 60ms ↑ 45ms
Total: ~105ms latencyWhat We're NOT Doing ❌
Voice → Direct Audio Embedding → Command (Pure S2O)
↑ 20ms but requires custom trainingWhy this approach?
- ✅ Speed: <100ms total latency (acceptable for DJing)
- ✅ Accuracy: Fine-tuned ASR + semantic retrieval = best of both worlds
- ✅ Flexibility: Can add new commands without retraining audio model
- ✅ Debugging: Can see transcribed text
- ✅ Practical: Uses pre-trained models with fine-tuning
vs Pure S2O (Audio→Command):
- Faster (~20ms) but:
- Requires large audio-command dataset
- Less flexible (fixed command vocabulary)
- Harder to debug
- More complex training pipeline
🚀 Quick Start (3 Steps)
Step 1: Record Training Data (30-45 min)
# Install UI dependencies
pip install sounddevice soundfile
# Launch recording UI
python dj_agent/scripts/record_training_data_ui.pyWhat you'll do:
1. UI shows a command (e.g., "play left")
2. Click "RECORD" button
3. Speak the command clearly after countdown
4. Repeat 3 times per command (for robustness)
5. ~40 commands × 3 variations = 120 recordings
Tips:
- Speak naturally (as you would while DJing)
- Use consistent pronunciation
- Record in similar environment to where you'll DJ
- Take breaks every 20 commands
Output:
- `training_data/recordings/*.wav` (audio files)
- `training_data/recordings/manifest.jsonl` (metadata)
---
Step 2: Fine-tune Wav2Vec2 (1-2 hours)
# Install training dependencies
pip install datasets evaluate jiwer accelerate
# Fine-tune the model
python dj_agent/scripts/finetune_wav2vec.py \
--data training_data/recordings/manifest.jsonl \
--output models/wav2vec2-dj-finetuned \
--epochs 30 \
--batch-size 4What happens:
1. Loads facebook/wav2vec2-base-960h (base model)
2. Trains on your recordings for 30 epochs
3. Saves fine-tuned model to `models/wav2vec2-dj-finetuned/`
Hardware:
- CPU: 1-2 hours (M1 Mac: ~45 min)
- GPU: 10-15 minutes
Expected Results:
- Before: WER ~30-40
- After: WER <5
---
Step 3: Use Fine-tuned Model
# Edit: dj_agent/voice_control/wav2vec_asr.py
# Line 38, change:
# From:
model_name = "facebook/wav2vec2-base-960h"
# To:
model_name = "models/wav2vec2-dj-finetuned"Then test:
./START_REKORDBOX_VOICE_WAV2VEC.shExpected output:
📝 Wav2Vec2 (complete): "play left" ← Accurate!
⏱ ASR latency: 58.3 ms ← Fast!
📝 ASR (text): "play left"
🔎 Rekordbox: execute 3006 (Z) reason=approved
✓ Pressed Rekordbox shortcut: Z
⏱ Rekordbox latency: 42.1 ms---
📈 Performance Comparison
### Before Fine-tuning (Base Wav2Vec2)
| Metric | Value | Quality |
|--------|-------|---------|
| ASR Accuracy | 60-70
| WER | 30-40
| "play left" → | "hey laughed", "they left" | ❌ Wrong |
| Latency | 60ms | ✅ Fast |
### After Fine-tuning (Your Voice)
| Metric | Value | Quality |
|--------|-------|---------|
| ASR Accuracy | 95-98
| WER | 2-5
| "play left" → | "play left" | ✅ Correct! |
| Latency | 60ms | ✅ Fast |
### vs Whisper
| Metric | Wav2Vec2 (Fine-tuned) | Whisper (tiny) | Whisper (base) |
|--------|----------------------|----------------|----------------|
| Accuracy | 95-98
| Latency | 60ms ✅ | 150ms | 300ms ❌ |
| Setup | 2 hours | 5 min | 5 min |
| Offline | ✅ Yes | ✅ Yes | ✅ Yes |
Conclusion: Fine-tuned Wav2Vec2 = best speed + accuracy balance
---
🎛️ Advanced: Architecture Deep Dive
Full Pipeline Latency Breakdown
┌─────────────────────────────────────────────────────────┐
│ Voice Input (microphone) │
└──────────────┬──────────────────────────────────────────┘
│ Audio buffer (16kHz, mono)
↓
┌─────────────────────────────────────────────────────────┐
│ Wav2Vec2 ASR (fine-tuned on your voice) │
│ - Model: wav2vec2-dj-finetuned │
│ - Latency: ~60ms (GPU) / ~80ms (CPU) │
│ - Output: "play left" │
└──────────────┬──────────────────────────────────────────┘
│ Text string
↓
┌─────────────────────────────────────────────────────────┐
│ Phonetic Normalization (optional, +5ms) │
│ - "hey laughed" → "play left" │
│ - Catches ASR errors │
└──────────────┬──────────────────────────────────────────┘
│ Normalized text
↓
┌─────────────────────────────────────────────────────────┐
│ Embedding Gemma (semantic embedding) │
│ - Model: google/gemma-2-2b-it │
│ - Latency: ~35ms │
│ - Output: 768-dim vector │
└──────────────┬──────────────────────────────────────────┘
│ Embedding vector
↓
┌─────────────────────────────────────────────────────────┐
│ Rekordbox Index (FAISS/cosine similarity) │
│ - Search command database │
│ - Latency: ~10ms │
│ - Output: Top-5 command IDs │
└──────────────┬──────────────────────────────────────────┘
│ Command ID + confidence
↓
┌─────────────────────────────────────────────────────────┐
│ Constraints & Safety Layer │
│ - Check deck state, timing, safety rules │
│ - Latency: <1ms │
└──────────────┬──────────────────────────────────────────┘
│ Approved command
↓
┌─────────────────────────────────────────────────────────┐
│ Rekordbox Bridge (keyboard/MIDI) │
│ - Send shortcut to Rekordbox │
│ - Latency: <1ms │
└──────────────┬──────────────────────────────────────────┘
│ Keyboard event
↓
┌─────────────────────────────────────────────────────────┐
│ Rekordbox (receives command) │
└─────────────────────────────────────────────────────────┘
Total Latency: ~105ms (acceptable for live DJing)Why Not Pure Audio→Command (S2O)?
S2O Model (like Speech2Object) directly embeds audio into command space:
Voice Audio → Audio Encoder → 768-dim embedding
↓
Cosine similarity with command embeddings
↓
Closest command (20ms total)Pros:
- ⚡ Super fast (~20ms)
- No ASR errors
Cons:
- ❌ Requires training on paired audio-command data (thousands of samples)
- ❌ Fixed vocabulary (can't easily add new commands)
- ❌ Harder to debug (no text intermediate)
- ❌ More complex training pipeline
- ❌ Less research/tooling available
Our approach (ASR + Retrieval):
- ✅ Leverage pre-trained ASR (fine-tune with 100 samples)
- ✅ Flexible vocabulary (add commands via text)
- ✅ Debuggable (see transcription)
- ✅ Fast enough (<100ms)
Conclusion: ASR + Retrieval is the pragmatic choice for DJ commands.
---
🔧 Troubleshooting
Issue: WER still high after fine-tuning (>10
Solutions:
1. Record more variations per command (5 instead of 3)
2. Increase training epochs (50 instead of 30)
3. Check for inconsistent pronunciation
4. Review and delete bad recordings
Issue: Fine-tuning takes too long (>3 hours)
Solutions:
1. Reduce batch size: `--batch-size 2`
2. Use GPU (install `pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118`)
3. Reduce epochs: `--epochs 20`
Issue: Model overfits (great on train, poor on new recordings)
Solutions:
1. Record more diverse variations
2. Add data augmentation (pitch shift, speed change)
3. Reduce epochs or add regularization
Issue: Out of memory during training
Solutions:
1. Reduce batch size: `--batch-size 1`
2. Use gradient accumulation
3. Use smaller base model: `facebook/wav2vec2-base` (no 960h)
---
📚 References
- Wav2Vec2 Paper: [wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations](https://arxiv.org/abs/2006.11477)
- Fine-tuning Guide: [HuggingFace ASR Fine-tuning](https://huggingface.co/docs/transformers/tasks/asr)
- S2O (Audio-to-Object): Similar to [AudioCLIP](https://arxiv.org/abs/2106.13043) but for commands
---
🎯 Next Steps
1. ✅ Record training data (30-45 min)
2. ✅ Fine-tune model (1-2 hours)
3. ✅ Test accuracy (should be >95
4. 🔄 Iterate if needed (more data, more epochs)
5. 🎉 Use in live DJ sets!
---
💡 Future Improvements
Option 1: Add Data Augmentation
# In finetune_wav2vec.py, augment audio:
import torchaudio.transforms as T
def augment_audio(waveform, sr):
# Pitch shift
if random.random() > 0.5:
pitch_shift = T.PitchShift(sr, n_steps=random.randint(-2, 2))
waveform = pitch_shift(waveform)
# Time stretch
if random.random() > 0.5:
rate = random.uniform(0.9, 1.1)
waveform = T.TimeStretch()(waveform, rate)
return waveform### Option 2: Explore Pure Audio→Command
If you want <20ms latency and are willing to invest in training:
1. Collect 1000+ audio samples (10+ variations per command)
2. Train audio encoder (Wav2Vec2 features → command embeddings)
3. Use contrastive learning (similar to CLIP)
4. Deploy for ultra-low latency
But for now, fine-tuned Wav2Vec2 + retrieval is the sweet spot!
---
Good luck with your fine-tuning! 🎤🎛️
Promotion Decision
Attach run IDs, datasets, metrics, and reproduction commands.
Source Anchor
projects/Documentation/02-projects/dj-agent/studio/docs/FINE_TUNE_GUIDE.md
Detected Structure
Method · Evaluation · References · Code Anchors · Architecture