Grand Diomande Research · Full HTML Reader

Improving Wav2Vec2 ASR Accuracy for DJ Commands

Wav2Vec2 is misrecognizing DJ commands: - "play left" → "hey laughed", "they left", "lay left" - Short, specific phrases are hard for general ASR models

Language as Infrastructure proposal experiment writeup candidate score 24 .md

Full Public Reader

Improving Wav2Vec2 ASR Accuracy for DJ Commands

Problem

Wav2Vec2 is misrecognizing DJ commands:
- "play left" → "hey laughed", "they left", "lay left"
- Short, specific phrases are hard for general ASR models

Solutions (Ranked by Effectiveness)

✅ Option 1: Phonetic Matching (Quick Fix - 30 min)

Instead of exact text matching, match by phonetic similarity.

Implementation:

python
# Install phonetics library
pip install fuzzy-matcher jellyfish

# Use in your command matching:
from jellyfish import jaro_winkler_similarity

def fuzzy_match_command(asr_text: str, commands: list[str]) -> str:
    """Match ASR text to closest command using phonetic similarity."""
    best_match = None
    best_score = 0.0

    for cmd in commands:
        score = jaro_winkler_similarity(asr_text.lower(), cmd.lower())
        if score > best_score:
            best_score = score
            best_match = cmd

    return best_match if best_score > 0.85 else None

# Example:
asr_text = "lay left"
commands = ["play left", "play right", "loop left", "sync left"]
match = fuzzy_match_command(asr_text, commands)
# Returns: "play left" (score ~0.92)

Pros: Fast, works immediately
Cons: Still relies on imperfect ASR

---

✅ Option 2: Constrained Decoding (Medium - 2 hours)

Force Wav2Vec2 to only output known DJ commands using a language model.

Implementation:

python
# Install required packages
pip install pyctcdecode
pip install https://github.com/kensho-technologies/pyctcdecode/releases/download/v0.5.0/pyctcdecode-0.5.0-py2.py3-none-any.whl

# Create constrained decoder
from pyctcdecode import build_ctcdecoder

# Define your command vocabulary
labels = ["a", "b", "c", ..., "z", " ", "'"]  # Wav2Vec2 vocab
commands = [
    "play left",
    "play right",
    "loop left",
    "loop right",
    "sync left",
    "sync right",
    # ... all your commands
]

# Build decoder with command vocabulary
decoder = build_ctcdecoder(
    labels=labels,
    kenlm_model_path=None,  # No LM needed for small vocab
    unigrams=commands  # Constrain to these phrases
)

# Use in wav2vec_asr.py:
def transcribe_constrained(waveform, sample_rate):
    processor, model, device = _load_wav2vec2()

    # Get logits
    inputs = processor(waveform, sampling_rate=16000, return_tensors="pt")
    with torch.no_grad():
        logits = model(inputs.input_values.to(device)).logits

    # Constrained decoding
    logits_np = logits.cpu().numpy()[0]
    text = decoder.decode(logits_np)
    return text.lower()

Pros: Much better accuracy, still fast
Cons: Requires additional library

---

✅ Option 3: Fine-tune Wav2Vec2 on Your Voice (Best - 1 day)

Record yourself saying commands and fine-tune the model.

Step 1: Record Training Data

Create a recording script:

python
# dj_agent/scripts/record_training_data.py
import sounddevice as sd
import soundfile as sf
import os

commands = [
    "play left", "play right",
    "loop left", "loop right",
    "sync left", "sync right",
    "stop left", "stop right",
    # ... add all commands
]

output_dir = "training_data/recordings"
os.makedirs(output_dir, exist_ok=True)

for i, cmd in enumerate(commands):
    print(f"\n[{i+1}/{len(commands)}] Say: '{cmd}'")
    print("Recording in 2 seconds...")
    time.sleep(2)

    # Record 3 variations
    for j in range(3):
        print(f"  Variation {j+1}/3 - SPEAK NOW!")
        audio = sd.rec(int(3 * 16000), samplerate=16000, channels=1, dtype='float32')
        sd.wait()

        # Save
        filename = f"{output_dir}/{cmd.replace(' ', '_')}_{j+1}.wav"
        sf.write(filename, audio, 16000)
        print(f"  ✅ Saved: {filename}")

Record 3 variations of each command = ~100-150 audio files

Step 2: Create Training Manifest

python
# training_data/manifest.jsonl
{"audio": "recordings/play_left_1.wav", "text": "play left"}
{"audio": "recordings/play_left_2.wav", "text": "play left"}
{"audio": "recordings/play_left_3.wav", "text": "play left"}
# ... etc

Step 3: Fine-tune Wav2Vec2

python
# dj_agent/scripts/finetune_wav2vec.py
from transformers import (
    Wav2Vec2ForCTC,
    Wav2Vec2Processor,
    Trainer,
    TrainingArguments
)
from datasets import load_dataset

# Load your recordings
dataset = load_dataset('json', data_files={'train': 'training_data/manifest.jsonl'})

# Fine-tune (simplified)
model_name = "facebook/wav2vec2-base-960h"
processor = Wav2Vec2Processor.from_pretrained(model_name)
model = Wav2Vec2ForCTC.from_pretrained(model_name)

training_args = TrainingArguments(
    output_dir="./wav2vec2-dj-finetuned",
    per_device_train_batch_size=4,
    num_train_epochs=10,
    save_steps=50,
    evaluation_strategy="steps",
    logging_steps=10,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset['train'],
)

trainer.train()
model.save_pretrained("./wav2vec2-dj-finetuned")

Then use your fine-tuned model:

python
# In wav2vec_asr.py, change:
model_name = "./wav2vec2-dj-finetuned"  # Your model

Pros: Best accuracy, learns your voice
Cons: Takes time to record and train

---

✅ Option 4: Hybrid Approach (Recommended - 1 hour)

Combine phonetic matching + semantic embeddings for robust matching.

python
# dj_agent/scripts/run_rekordbox_voice_wav2vec_hybrid.py

from jellyfish import jaro_winkler_similarity
import numpy as np

def hybrid_command_match(
    asr_text: str,
    embedder,  # EmbeddingGemmaProvider
    orbiter,   # RekordboxOrbiter
) -> tuple[str, float]:
    """
    Match command using both phonetic + semantic similarity.

    Returns:
        (command_id, confidence_score)
    """

    # Step 1: Phonetic matching for known patterns
    phonetic_matches = {
        "play": ["play", "hey", "they", "lay", "pay"],
        "loop": ["loop", "lube", "look", "luke"],
        "sync": ["sync", "sink", "think"],
        "left": ["left", "laughed", "lift"],
        "right": ["right", "write", "bright"],
    }

    # Normalize ASR text
    normalized = asr_text.lower()

    # Check for phonetic matches
    for canonical, variants in phonetic_matches.items():
        for variant in variants:
            if variant in normalized:
                normalized = normalized.replace(variant, canonical)

    # Now normalized might be "play left" even if ASR said "hey laughed"

    # Step 2: Use normalized text for embedding search
    emb = embedder.embed_text(normalized)
    hits = orbiter.index.search(emb, top_k=5)

    if not hits:
        return None, 0.0

    # Step 3: Re-rank by phonetic similarity
    best_hit = hits[0]
    best_score = hits[0].score

    for hit in hits:
        cmd_text = hit.metadata.get("name", "").lower()
        phonetic_score = jaro_winkler_similarity(normalized, cmd_text)
        combined_score = 0.6 * hit.score + 0.4 * phonetic_score

        if combined_score > best_score:
            best_score = combined_score
            best_hit = hit

    return best_hit.command_id, best_score

# Use in on_text callback:
def on_text(text: str):
    t0 = time.time()
    print(f'\n📝 ASR (raw): "{text}"')

    command_id, confidence = hybrid_command_match(text, embedder, orbiter)

    if command_id and confidence > 0.75:
        print(f'   🎯 Matched: {command_id} (confidence: {confidence:.2f})')
        # Execute command
    else:
        print(f'   ⚠️  Low confidence: {confidence:.2f}')

Pros: Best balance of accuracy and speed
Cons: Requires phonetic mapping for each language

---

✅ Option 5: Use Whisper Instead (Alternative - 30 min)

OpenAI's Whisper is significantly more accurate than Wav2Vec2.

python
# Install Whisper
pip install openai-whisper

# Replace wav2vec_asr.py with whisper_asr.py:
import whisper

@lru_cache(maxsize=1)
def _load_whisper():
    # Use tiny.en for speed, base.en for accuracy
    model = whisper.load_model("tiny.en")  # or "base.en"
    return model

def transcribe(waveform: np.ndarray, sample_rate: int) -> str:
    model = _load_whisper()

    # Whisper expects float32 [-1, 1]
    result = model.transcribe(waveform, language="en")
    return result["text"].strip().lower()

Pros: Much better accuracy out-of-box
Cons: Slower (100-300ms vs 60ms), larger model

---

📊 Comparison

MethodAccuracySpeedSetup TimeMaintenance
Phonetic Matching⭐⭐⭐⭐⭐⭐⭐⭐30 minLow
Constrained Decoding⭐⭐⭐⭐⭐⭐⭐⭐2 hoursLow
Fine-tuning⭐⭐⭐⭐⭐⭐⭐⭐⭐1 dayMedium
Hybrid Approach⭐⭐⭐⭐⭐⭐⭐⭐⭐1 hourLow
Whisper⭐⭐⭐⭐⭐⭐⭐⭐30 minLow

---

🚀 Recommended Approach

### Immediate (Today):
1. Implement Hybrid Approach (phonetic + semantic matching)
2. Test with your voice commands
3. Measure accuracy improvement

### This Week:
1. Record training data (3 variations × 50 commands = 150 files)
2. Fine-tune Wav2Vec2 on your recordings
3. A/B test: fine-tuned model vs hybrid approach

### Alternative:
1. Try Whisper if Wav2Vec2 accuracy is still unsatisfactory
2. Gemini Live (cloud) has even better accuracy but requires internet

---

🔧 Quick Implementation

I can implement the Hybrid Approach for you right now. It will:
- Map phonetic variants ("hey" → "play", "laughed" → "left")
- Use semantic embeddings for final matching
- Give you ~90

Want me to implement this?

Promotion Decision

Attach run IDs, datasets, metrics, and reproduction commands.

Source Anchor

projects/Documentation/02-projects/dj-agent/studio/IMPROVE_WAV2VEC_ACCURACY.md

Detected Structure

Method · Evaluation · Code Anchors