Improving Wav2Vec2 ASR Accuracy for DJ Commands
Wav2Vec2 is misrecognizing DJ commands: - "play left" → "hey laughed", "they left", "lay left" - Short, specific phrases are hard for general ASR models
Full Public Reader
Improving Wav2Vec2 ASR Accuracy for DJ Commands
Problem
Wav2Vec2 is misrecognizing DJ commands:
- "play left" → "hey laughed", "they left", "lay left"
- Short, specific phrases are hard for general ASR models
Solutions (Ranked by Effectiveness)
✅ Option 1: Phonetic Matching (Quick Fix - 30 min)
Instead of exact text matching, match by phonetic similarity.
Implementation:
# Install phonetics library
pip install fuzzy-matcher jellyfish
# Use in your command matching:
from jellyfish import jaro_winkler_similarity
def fuzzy_match_command(asr_text: str, commands: list[str]) -> str:
"""Match ASR text to closest command using phonetic similarity."""
best_match = None
best_score = 0.0
for cmd in commands:
score = jaro_winkler_similarity(asr_text.lower(), cmd.lower())
if score > best_score:
best_score = score
best_match = cmd
return best_match if best_score > 0.85 else None
# Example:
asr_text = "lay left"
commands = ["play left", "play right", "loop left", "sync left"]
match = fuzzy_match_command(asr_text, commands)
# Returns: "play left" (score ~0.92)Pros: Fast, works immediately
Cons: Still relies on imperfect ASR
---
✅ Option 2: Constrained Decoding (Medium - 2 hours)
Force Wav2Vec2 to only output known DJ commands using a language model.
Implementation:
# Install required packages
pip install pyctcdecode
pip install https://github.com/kensho-technologies/pyctcdecode/releases/download/v0.5.0/pyctcdecode-0.5.0-py2.py3-none-any.whl
# Create constrained decoder
from pyctcdecode import build_ctcdecoder
# Define your command vocabulary
labels = ["a", "b", "c", ..., "z", " ", "'"] # Wav2Vec2 vocab
commands = [
"play left",
"play right",
"loop left",
"loop right",
"sync left",
"sync right",
# ... all your commands
]
# Build decoder with command vocabulary
decoder = build_ctcdecoder(
labels=labels,
kenlm_model_path=None, # No LM needed for small vocab
unigrams=commands # Constrain to these phrases
)
# Use in wav2vec_asr.py:
def transcribe_constrained(waveform, sample_rate):
processor, model, device = _load_wav2vec2()
# Get logits
inputs = processor(waveform, sampling_rate=16000, return_tensors="pt")
with torch.no_grad():
logits = model(inputs.input_values.to(device)).logits
# Constrained decoding
logits_np = logits.cpu().numpy()[0]
text = decoder.decode(logits_np)
return text.lower()Pros: Much better accuracy, still fast
Cons: Requires additional library
---
✅ Option 3: Fine-tune Wav2Vec2 on Your Voice (Best - 1 day)
Record yourself saying commands and fine-tune the model.
Step 1: Record Training Data
Create a recording script:
# dj_agent/scripts/record_training_data.py
import sounddevice as sd
import soundfile as sf
import os
commands = [
"play left", "play right",
"loop left", "loop right",
"sync left", "sync right",
"stop left", "stop right",
# ... add all commands
]
output_dir = "training_data/recordings"
os.makedirs(output_dir, exist_ok=True)
for i, cmd in enumerate(commands):
print(f"\n[{i+1}/{len(commands)}] Say: '{cmd}'")
print("Recording in 2 seconds...")
time.sleep(2)
# Record 3 variations
for j in range(3):
print(f" Variation {j+1}/3 - SPEAK NOW!")
audio = sd.rec(int(3 * 16000), samplerate=16000, channels=1, dtype='float32')
sd.wait()
# Save
filename = f"{output_dir}/{cmd.replace(' ', '_')}_{j+1}.wav"
sf.write(filename, audio, 16000)
print(f" ✅ Saved: {filename}")Record 3 variations of each command = ~100-150 audio files
Step 2: Create Training Manifest
# training_data/manifest.jsonl
{"audio": "recordings/play_left_1.wav", "text": "play left"}
{"audio": "recordings/play_left_2.wav", "text": "play left"}
{"audio": "recordings/play_left_3.wav", "text": "play left"}
# ... etcStep 3: Fine-tune Wav2Vec2
# dj_agent/scripts/finetune_wav2vec.py
from transformers import (
Wav2Vec2ForCTC,
Wav2Vec2Processor,
Trainer,
TrainingArguments
)
from datasets import load_dataset
# Load your recordings
dataset = load_dataset('json', data_files={'train': 'training_data/manifest.jsonl'})
# Fine-tune (simplified)
model_name = "facebook/wav2vec2-base-960h"
processor = Wav2Vec2Processor.from_pretrained(model_name)
model = Wav2Vec2ForCTC.from_pretrained(model_name)
training_args = TrainingArguments(
output_dir="./wav2vec2-dj-finetuned",
per_device_train_batch_size=4,
num_train_epochs=10,
save_steps=50,
evaluation_strategy="steps",
logging_steps=10,
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=dataset['train'],
)
trainer.train()
model.save_pretrained("./wav2vec2-dj-finetuned")Then use your fine-tuned model:
# In wav2vec_asr.py, change:
model_name = "./wav2vec2-dj-finetuned" # Your modelPros: Best accuracy, learns your voice
Cons: Takes time to record and train
---
✅ Option 4: Hybrid Approach (Recommended - 1 hour)
Combine phonetic matching + semantic embeddings for robust matching.
# dj_agent/scripts/run_rekordbox_voice_wav2vec_hybrid.py
from jellyfish import jaro_winkler_similarity
import numpy as np
def hybrid_command_match(
asr_text: str,
embedder, # EmbeddingGemmaProvider
orbiter, # RekordboxOrbiter
) -> tuple[str, float]:
"""
Match command using both phonetic + semantic similarity.
Returns:
(command_id, confidence_score)
"""
# Step 1: Phonetic matching for known patterns
phonetic_matches = {
"play": ["play", "hey", "they", "lay", "pay"],
"loop": ["loop", "lube", "look", "luke"],
"sync": ["sync", "sink", "think"],
"left": ["left", "laughed", "lift"],
"right": ["right", "write", "bright"],
}
# Normalize ASR text
normalized = asr_text.lower()
# Check for phonetic matches
for canonical, variants in phonetic_matches.items():
for variant in variants:
if variant in normalized:
normalized = normalized.replace(variant, canonical)
# Now normalized might be "play left" even if ASR said "hey laughed"
# Step 2: Use normalized text for embedding search
emb = embedder.embed_text(normalized)
hits = orbiter.index.search(emb, top_k=5)
if not hits:
return None, 0.0
# Step 3: Re-rank by phonetic similarity
best_hit = hits[0]
best_score = hits[0].score
for hit in hits:
cmd_text = hit.metadata.get("name", "").lower()
phonetic_score = jaro_winkler_similarity(normalized, cmd_text)
combined_score = 0.6 * hit.score + 0.4 * phonetic_score
if combined_score > best_score:
best_score = combined_score
best_hit = hit
return best_hit.command_id, best_score
# Use in on_text callback:
def on_text(text: str):
t0 = time.time()
print(f'\n📝 ASR (raw): "{text}"')
command_id, confidence = hybrid_command_match(text, embedder, orbiter)
if command_id and confidence > 0.75:
print(f' 🎯 Matched: {command_id} (confidence: {confidence:.2f})')
# Execute command
else:
print(f' ⚠️ Low confidence: {confidence:.2f}')Pros: Best balance of accuracy and speed
Cons: Requires phonetic mapping for each language
---
✅ Option 5: Use Whisper Instead (Alternative - 30 min)
OpenAI's Whisper is significantly more accurate than Wav2Vec2.
# Install Whisper
pip install openai-whisper
# Replace wav2vec_asr.py with whisper_asr.py:
import whisper
@lru_cache(maxsize=1)
def _load_whisper():
# Use tiny.en for speed, base.en for accuracy
model = whisper.load_model("tiny.en") # or "base.en"
return model
def transcribe(waveform: np.ndarray, sample_rate: int) -> str:
model = _load_whisper()
# Whisper expects float32 [-1, 1]
result = model.transcribe(waveform, language="en")
return result["text"].strip().lower()Pros: Much better accuracy out-of-box
Cons: Slower (100-300ms vs 60ms), larger model
---
📊 Comparison
| Method | Accuracy | Speed | Setup Time | Maintenance |
|---|---|---|---|---|
| Phonetic Matching | ⭐⭐⭐ | ⭐⭐⭐⭐⭐ | 30 min | Low |
| Constrained Decoding | ⭐⭐⭐⭐ | ⭐⭐⭐⭐ | 2 hours | Low |
| Fine-tuning | ⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐ | 1 day | Medium |
| Hybrid Approach | ⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | 1 hour | Low |
| Whisper | ⭐⭐⭐⭐⭐ | ⭐⭐⭐ | 30 min | Low |
---
🚀 Recommended Approach
### Immediate (Today):
1. Implement Hybrid Approach (phonetic + semantic matching)
2. Test with your voice commands
3. Measure accuracy improvement
### This Week:
1. Record training data (3 variations × 50 commands = 150 files)
2. Fine-tune Wav2Vec2 on your recordings
3. A/B test: fine-tuned model vs hybrid approach
### Alternative:
1. Try Whisper if Wav2Vec2 accuracy is still unsatisfactory
2. Gemini Live (cloud) has even better accuracy but requires internet
---
🔧 Quick Implementation
I can implement the Hybrid Approach for you right now. It will:
- Map phonetic variants ("hey" → "play", "laughed" → "left")
- Use semantic embeddings for final matching
- Give you ~90
Want me to implement this?
Promotion Decision
Attach run IDs, datasets, metrics, and reproduction commands.
Source Anchor
projects/Documentation/02-projects/dj-agent/studio/IMPROVE_WAV2VEC_ACCURACY.md
Detected Structure
Method · Evaluation · Code Anchors