Grand Diomande Research · Full HTML Reader

Fine-tune Wav2Vec2 for DJ Commands - Complete Guide

**Why this approach?** - ✅ **Speed**: <100ms total latency (acceptable for DJing) - ✅ **Accuracy**: Fine-tuned ASR + semantic retrieval = best of both worlds - ✅ **Flexibility**: Can add new commands without retraining audio model - ✅ **Debugging**: Can see transcribed text - ✅ **Practical**: Uses pre-trained models with fine-tuning

Agents That Account for Themselves research note experiment writeup candidate score 32 .md

Full Public Reader

Fine-tune Wav2Vec2 for DJ Commands - Complete Guide

🎯 Goal

Train Wav2Vec2 to recognize your voice saying DJ commands with >95

📊 Architecture Decision

What We're Doing ✅

Voice → Wav2Vec2 (fine-tuned) → Text → Embedding Gemma → Retrieval → Command
        ↑ 60ms                      ↑ 45ms
        Total: ~105ms latency

What We're NOT Doing ❌

Voice → Direct Audio Embedding → Command (Pure S2O)
        ↑ 20ms but requires custom training

Why this approach?
- ✅ Speed: <100ms total latency (acceptable for DJing)
- ✅ Accuracy: Fine-tuned ASR + semantic retrieval = best of both worlds
- ✅ Flexibility: Can add new commands without retraining audio model
- ✅ Debugging: Can see transcribed text
- ✅ Practical: Uses pre-trained models with fine-tuning

vs Pure S2O (Audio→Command):
- Faster (~20ms) but:
- Requires large audio-command dataset
- Less flexible (fixed command vocabulary)
- Harder to debug
- More complex training pipeline

🚀 Quick Start (3 Steps)

Step 1: Record Training Data (30-45 min)

bash
# Install UI dependencies
pip install sounddevice soundfile

# Launch recording UI
python dj_agent/scripts/record_training_data_ui.py

What you'll do:
1. UI shows a command (e.g., "play left")
2. Click "RECORD" button
3. Speak the command clearly after countdown
4. Repeat 3 times per command (for robustness)
5. ~40 commands × 3 variations = 120 recordings

Tips:
- Speak naturally (as you would while DJing)
- Use consistent pronunciation
- Record in similar environment to where you'll DJ
- Take breaks every 20 commands

Output:
- `training_data/recordings/*.wav` (audio files)
- `training_data/recordings/manifest.jsonl` (metadata)

---

Step 2: Fine-tune Wav2Vec2 (1-2 hours)

bash
# Install training dependencies
pip install datasets evaluate jiwer accelerate

# Fine-tune the model
python dj_agent/scripts/finetune_wav2vec.py \
    --data training_data/recordings/manifest.jsonl \
    --output models/wav2vec2-dj-finetuned \
    --epochs 30 \
    --batch-size 4

What happens:
1. Loads facebook/wav2vec2-base-960h (base model)
2. Trains on your recordings for 30 epochs
3. Saves fine-tuned model to `models/wav2vec2-dj-finetuned/`

Hardware:
- CPU: 1-2 hours (M1 Mac: ~45 min)
- GPU: 10-15 minutes

Expected Results:
- Before: WER ~30-40
- After: WER <5

---

Step 3: Use Fine-tuned Model

python
# Edit: dj_agent/voice_control/wav2vec_asr.py
# Line 38, change:

# From:
model_name = "facebook/wav2vec2-base-960h"

# To:
model_name = "models/wav2vec2-dj-finetuned"

Then test:

bash
./START_REKORDBOX_VOICE_WAV2VEC.sh

Expected output:

📝 Wav2Vec2 (complete): "play left"  ← Accurate!
   ⏱ ASR latency: 58.3 ms          ← Fast!
📝 ASR (text): "play left"
   🔎 Rekordbox: execute 3006 (Z) reason=approved
   ✓ Pressed Rekordbox shortcut: Z
   ⏱ Rekordbox latency: 42.1 ms

---

📈 Performance Comparison

### Before Fine-tuning (Base Wav2Vec2)
| Metric | Value | Quality |
|--------|-------|---------|
| ASR Accuracy | 60-70
| WER | 30-40
| "play left" → | "hey laughed", "they left" | ❌ Wrong |
| Latency | 60ms | ✅ Fast |

### After Fine-tuning (Your Voice)
| Metric | Value | Quality |
|--------|-------|---------|
| ASR Accuracy | 95-98
| WER | 2-5
| "play left" → | "play left" | ✅ Correct! |
| Latency | 60ms | ✅ Fast |

### vs Whisper
| Metric | Wav2Vec2 (Fine-tuned) | Whisper (tiny) | Whisper (base) |
|--------|----------------------|----------------|----------------|
| Accuracy | 95-98
| Latency | 60ms ✅ | 150ms | 300ms ❌ |
| Setup | 2 hours | 5 min | 5 min |
| Offline | ✅ Yes | ✅ Yes | ✅ Yes |

Conclusion: Fine-tuned Wav2Vec2 = best speed + accuracy balance

---

🎛️ Advanced: Architecture Deep Dive

Full Pipeline Latency Breakdown

┌─────────────────────────────────────────────────────────┐
│ Voice Input (microphone)                                │
└──────────────┬──────────────────────────────────────────┘
               │ Audio buffer (16kHz, mono)
               ↓
┌─────────────────────────────────────────────────────────┐
│ Wav2Vec2 ASR (fine-tuned on your voice)                │
│   - Model: wav2vec2-dj-finetuned                       │
│   - Latency: ~60ms (GPU) / ~80ms (CPU)                 │
│   - Output: "play left"                                 │
└──────────────┬──────────────────────────────────────────┘
               │ Text string
               ↓
┌─────────────────────────────────────────────────────────┐
│ Phonetic Normalization (optional, +5ms)                │
│   - "hey laughed" → "play left"                        │
│   - Catches ASR errors                                  │
└──────────────┬──────────────────────────────────────────┘
               │ Normalized text
               ↓
┌─────────────────────────────────────────────────────────┐
│ Embedding Gemma (semantic embedding)                    │
│   - Model: google/gemma-2-2b-it                        │
│   - Latency: ~35ms                                      │
│   - Output: 768-dim vector                              │
└──────────────┬──────────────────────────────────────────┘
               │ Embedding vector
               ↓
┌─────────────────────────────────────────────────────────┐
│ Rekordbox Index (FAISS/cosine similarity)               │
│   - Search command database                             │
│   - Latency: ~10ms                                      │
│   - Output: Top-5 command IDs                           │
└──────────────┬──────────────────────────────────────────┘
               │ Command ID + confidence
               ↓
┌─────────────────────────────────────────────────────────┐
│ Constraints & Safety Layer                              │
│   - Check deck state, timing, safety rules              │
│   - Latency: <1ms                                       │
└──────────────┬──────────────────────────────────────────┘
               │ Approved command
               ↓
┌─────────────────────────────────────────────────────────┐
│ Rekordbox Bridge (keyboard/MIDI)                        │
│   - Send shortcut to Rekordbox                          │
│   - Latency: <1ms                                       │
└──────────────┬──────────────────────────────────────────┘
               │ Keyboard event
               ↓
┌─────────────────────────────────────────────────────────┐
│ Rekordbox (receives command)                            │
└─────────────────────────────────────────────────────────┘

Total Latency: ~105ms (acceptable for live DJing)

Why Not Pure Audio→Command (S2O)?

S2O Model (like Speech2Object) directly embeds audio into command space:

Voice Audio → Audio Encoder → 768-dim embedding
                ↓
          Cosine similarity with command embeddings
                ↓
          Closest command (20ms total)

Pros:
- ⚡ Super fast (~20ms)
- No ASR errors

Cons:
- ❌ Requires training on paired audio-command data (thousands of samples)
- ❌ Fixed vocabulary (can't easily add new commands)
- ❌ Harder to debug (no text intermediate)
- ❌ More complex training pipeline
- ❌ Less research/tooling available

Our approach (ASR + Retrieval):
- ✅ Leverage pre-trained ASR (fine-tune with 100 samples)
- ✅ Flexible vocabulary (add commands via text)
- ✅ Debuggable (see transcription)
- ✅ Fast enough (<100ms)

Conclusion: ASR + Retrieval is the pragmatic choice for DJ commands.

---

🔧 Troubleshooting

Issue: WER still high after fine-tuning (>10

Solutions:
1. Record more variations per command (5 instead of 3)
2. Increase training epochs (50 instead of 30)
3. Check for inconsistent pronunciation
4. Review and delete bad recordings

Issue: Fine-tuning takes too long (>3 hours)

Solutions:
1. Reduce batch size: `--batch-size 2`
2. Use GPU (install `pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118`)
3. Reduce epochs: `--epochs 20`

Issue: Model overfits (great on train, poor on new recordings)

Solutions:
1. Record more diverse variations
2. Add data augmentation (pitch shift, speed change)
3. Reduce epochs or add regularization

Issue: Out of memory during training

Solutions:
1. Reduce batch size: `--batch-size 1`
2. Use gradient accumulation
3. Use smaller base model: `facebook/wav2vec2-base` (no 960h)

---

📚 References

  • Wav2Vec2 Paper: [wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations](https://arxiv.org/abs/2006.11477)
  • Fine-tuning Guide: [HuggingFace ASR Fine-tuning](https://huggingface.co/docs/transformers/tasks/asr)
  • S2O (Audio-to-Object): Similar to [AudioCLIP](https://arxiv.org/abs/2106.13043) but for commands

---

🎯 Next Steps

1. ✅ Record training data (30-45 min)
2. ✅ Fine-tune model (1-2 hours)
3. ✅ Test accuracy (should be >95
4. 🔄 Iterate if needed (more data, more epochs)
5. 🎉 Use in live DJ sets!

---

💡 Future Improvements

Option 1: Add Data Augmentation

python
# In finetune_wav2vec.py, augment audio:
import torchaudio.transforms as T

def augment_audio(waveform, sr):
    # Pitch shift
    if random.random() > 0.5:
        pitch_shift = T.PitchShift(sr, n_steps=random.randint(-2, 2))
        waveform = pitch_shift(waveform)

    # Time stretch
    if random.random() > 0.5:
        rate = random.uniform(0.9, 1.1)
        waveform = T.TimeStretch()(waveform, rate)

    return waveform

### Option 2: Explore Pure Audio→Command
If you want <20ms latency and are willing to invest in training:

1. Collect 1000+ audio samples (10+ variations per command)
2. Train audio encoder (Wav2Vec2 features → command embeddings)
3. Use contrastive learning (similar to CLIP)
4. Deploy for ultra-low latency

But for now, fine-tuned Wav2Vec2 + retrieval is the sweet spot!

---

Good luck with your fine-tuning! 🎤🎛️

Promotion Decision

Attach run IDs, datasets, metrics, and reproduction commands.

Source Anchor

projects/Documentation/02-projects/dj-agent/studio/docs/FINE_TUNE_GUIDE.md

Detected Structure

Method · Evaluation · References · Code Anchors · Architecture