DJ Voice Control: Retrieval-Centric Architecture
The DJ Voice Control system adapts the speech-to-order retrieval-centric paradigm for real-time DJ performance control. Instead of matching spoken orders to menu items, we match spoken commands to DJ actions and keyboard shortcuts. This approach provides superior accuracy compared to traditional ASR + NLU pipelines by learning a direct semantic mapping between audio utterances and command intents.
Full Public Reader
DJ Voice Control: Retrieval-Centric Architecture
Introduction
The DJ Voice Control system adapts the speech-to-order retrieval-centric paradigm for real-time DJ performance control. Instead of matching spoken orders to menu items, we match spoken commands to DJ actions and keyboard shortcuts. This approach provides superior accuracy compared to traditional ASR + NLU pipelines by learning a direct semantic mapping between audio utterances and command intents.
Key Advantages:
- Sub-second latency: Direct audio → command matching without transcription
- Robust to variations: Handles different phrasings, accents, and noise
- No cloud dependency: Runs entirely locally for zero latency
- Deterministic execution: Constraint solver ensures valid command combinations
- Continuous improvement: Easy to add new commands and voice samples
System Architecture Overview
┌─────────────────┐
│ Microphone │
│ Audio Stream │
└────────┬────────┘
│
▼
┌─────────────────────────────────────────────────┐
│ Audio Front-End (streaming, 320ms chunks) │
│ • VAD (café noise → DJ booth noise) │
│ • Noise suppression (crowd, speakers, mixing) │
│ • Log-Mel features (80 bins, 25ms windows) │
└────────┬────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────┐
│ Dual-Encoder Model │
│ • Audio Tower: CNN + Transformer │
│ • Text Tower: EmbeddingGemma │
│ • Shared 512-dim embedding space │
└────────┬────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────┐
│ Vector Index (FAISS HNSW) │
│ • ~200 command embeddings │
│ • Metadata: deck, action_type, shortcuts │
│ • Sub-10ms search latency │
└────────┬────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────┐
│ Constraint Solver │
│ • Deck validation (left/right/both) │
│ • Context awareness (loop active, playing) │
│ • Conflict resolution (invalid combos) │
└────────┬────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────┐
│ Keyboard Action Executor │
│ • Press key combinations │
│ • Execute command chains │
│ • Debouncing and cooldown │
└─────────────────────────────────────────────────┘Phase 0: Infrastructure and Governance
Directory Structure
dj_agent/voice_control/
├── retrieval/ # NEW: Retrieval-centric system
│ ├── catalog/ # Command catalog management
│ │ ├── commands.yaml # All DJ commands with metadata
│ │ ├── constraints.yaml # Constraint rules
│ │ ├── aliases.yaml # Command variations/synonyms
│ │ └── loader.py # Catalog loader
│ ├── indexing/ # Vector index management
│ │ ├── document_generator.py # Create retrieval corpus
│ │ ├── embeddings.py # Text embedding generation
│ │ └── vector_index.py # FAISS index wrapper
│ ├── audio/ # Audio processing
│ │ ├── vad.py # Voice activity detection
│ │ ├── features.py # Log-Mel extraction
│ │ └── streaming.py # Streaming pipeline
│ ├── model/ # Dual-encoder training
│ │ ├── audio_tower.py # Audio encoder
│ │ ├── text_tower.py # Text encoder
│ │ ├── dual_encoder.py # Combined model
│ │ └── trainer.py # Training loop
│ ├── inference/ # Inference pipeline
│ │ ├── retriever.py # ANN search
│ │ ├── reranker.py # Cross-encoder reranking
│ │ └── pipeline.py # End-to-end pipeline
│ ├── constraints/ # Constraint solver
│ │ ├── solver.py # Rule engine
│ │ └── dialogue.py # Clarification logic
│ └── data/ # Data operations
│ ├── synthetic.py # TTS generation
│ ├── augmentation.py # Acoustic augmentation
│ └── manifest.py # Training data management
├── core/ # Existing system (for comparison)
│ ├── gemini_listener.py
│ └── voice_controller.py
└── scripts/
└── train_retrieval_model.py # Training scriptPhase 1: Command Catalog and Index Foundations
Command Catalog Structure (`commands.yaml`)
commands:
# Basic playback commands
- id: "play_left"
canonical: "play left"
category: "playback"
deck: "left"
action_type: "transport"
shortcut: "w"
variations:
- "play left deck"
- "start left"
- "left play"
- "play the left"
metadata:
priority: "high"
requires_deck_loaded: true
- id: "play_right"
canonical: "play right"
category: "playback"
deck: "right"
action_type: "transport"
shortcut: "s"
variations:
- "play right deck"
- "start right"
- "right play"
# Cue points
- id: "cue_1_left"
canonical: "cue 1 left"
category: "cue"
deck: "left"
action_type: "navigation"
shortcut: "1"
variations:
- "cue one left"
- "hot cue 1 left"
- "jump to cue 1 left"
# Chain commands
- id: "play_next"
canonical: "play next"
category: "automation"
deck: "context" # Uses current deck
action_type: "chain"
chain:
- action: "move_down"
delay: 0.1
- action: "load_deck"
delay: 0.2
- action: "play_deck"
delay: 0.3
variations:
- "next song"
- "play next track"
- "load and play next"Constraints (`constraints.yaml`)
constraints:
# Deck must be specified unless context is clear
- type: "deck_required"
applies_to: ["playback", "cue", "loop", "effects"]
exceptions: ["chain_commands"]
resolution: "use_current_deck"
# Can't cue while playing
- type: "state_conflict"
condition: "deck_playing == true"
conflicts_with: ["cue_jump"]
resolution: "allow_with_warning"
# Loop commands need active loop
- type: "prerequisite"
command_pattern: "exit_loop_*"
requires: "loop_active == true"
resolution: "ignore_if_not_met"Document Generation
The catalog expands into a retrieval corpus:
# Base command
{
"doc_id": "play_left_0",
"text": "play left",
"metadata": {
"command_id": "play_left",
"deck": "left",
"shortcut": "w",
"category": "playback",
"action_type": "transport"
}
}
# Variations
{
"doc_id": "play_left_1",
"text": "play left deck",
"metadata": { ... }
}
{
"doc_id": "play_left_2",
"text": "start left",
"metadata": { ... }
}Corpus Statistics:
- ~200 base commands (from your existing command_map)
- ~5 variations per command
- ~1,000 total documents in retrieval corpus
- Flat index sufficient (sub-1ms search)
Text Embeddings
Use EmbeddingGemma-300m (same as menu system):
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("google/embeddinggemma-300m")
embeddings = model.encode(command_texts) # 300-dim vectorsBuild FAISS index:
import faiss
# Flat index for exact search (sufficient for ~1k docs)
index = faiss.IndexFlatIP(300) # Inner product (cosine)
index.add(embeddings)Phase 2: Audio Front-End Processing
Adaptation for DJ Booth Environment
Key Differences from Café:
- Noise profile: Music playback, crowd noise, speaker feedback
- Speaking style: Louder, more forceful commands
- Latency requirement: Even tighter (<800ms total)
- Hands-free: Can't press PTT during performance
Pipeline Architecture
class DJAudioFrontEnd:
"""Streaming audio processing for DJ commands."""
def __init__(self):
self.sample_rate = 16000
self.chunk_size = 320 # 20ms chunks for ultra-low latency
self.vad = WebRTCVAD(aggressiveness=3) # Max sensitivity
# DJ-specific noise suppression
self.noise_profiles = {
'music': self._load_music_profile(),
'crowd': self._load_crowd_profile(),
'feedback': self._load_feedback_profile()
}
def process_chunk(self, audio_chunk):
"""Process 20ms audio chunk."""
# 1. Voice activity detection
is_speech = self.vad.is_speech(audio_chunk)
if not is_speech:
return None
# 2. Noise suppression (spectral subtraction)
clean_audio = self.suppress_noise(audio_chunk)
# 3. Log-Mel features
features = self.extract_logmel(clean_audio)
return features
def extract_logmel(self, audio):
"""Extract 80-bin log-Mel features."""
# 25ms windows, 10ms hop
mel = librosa.feature.melspectrogram(
y=audio,
sr=self.sample_rate,
n_mels=80,
n_fft=400,
hop_length=160
)
log_mel = librosa.power_to_db(mel)
return log_melPhase 3: Dual-Encoder Training
Architecture
Audio Tower:
class AudioTower(nn.Module):
def __init__(self):
# Conv feature extractor
self.conv = nn.Sequential(
nn.Conv1d(80, 128, kernel_size=3, stride=2),
nn.ReLU(),
nn.Conv1d(128, 256, kernel_size=3, stride=2),
nn.ReLU(),
)
# Transformer encoder
self.transformer = nn.TransformerEncoder(
nn.TransformerEncoderLayer(d_model=256, nhead=8),
num_layers=4
)
# Projection to shared space
self.projection = nn.Linear(256, 512)
def forward(self, log_mel_features):
# log_mel_features: [batch, time, 80]
x = self.conv(log_mel_features.transpose(1, 2)) # [batch, 256, time']
x = x.transpose(1, 2) # [batch, time', 256]
x = self.transformer(x) # [batch, time', 256]
x = x.mean(dim=1) # Temporal pooling → [batch, 256]
x = self.projection(x) # [batch, 512]
return F.normalize(x, dim=-1) # L2 normalizeText Tower:
class TextTower(nn.Module):
def __init__(self):
# Use EmbeddingGemma (frozen or fine-tuned)
self.embedder = SentenceTransformer("google/embeddinggemma-300m")
self.projection = nn.Linear(300, 512)
def forward(self, texts):
# texts: List[str]
embeddings = self.embedder.encode(texts, convert_to_tensor=True)
x = self.projection(embeddings)
return F.normalize(x, dim=-1)Training Objective
Contrastive Loss (InfoNCE):
def contrastive_loss(audio_embeds, text_embeds, temperature=0.07):
"""
audio_embeds: [batch, 512]
text_embeds: [batch, 512]
"""
# Compute similarity matrix
sim_matrix = torch.matmul(audio_embeds, text_embeds.T) / temperature
# [batch, batch]
# Positive pairs are on the diagonal
labels = torch.arange(len(audio_embeds)).to(device)
# Cross-entropy loss (each row should predict diagonal)
loss = F.cross_entropy(sim_matrix, labels)
return lossTraining Data
Initial Dataset (Your Voice):
1. Record command corpus:
- Read each command 3 times (clean)
- Record in booth environment (with music)
- ~200 commands × 3 reps × 2 conditions = 1,200 samples
2. Synthetic augmentation:
- Speed perturbation: 0.9x, 1.0x, 1.1x
- Pitch shift: ±2 semitones
- Background music mixing (SNR 10-20 dB)
- Crowd noise mixing
- → 12,000 augmented samples
3. TTS expansion:
- Use Piper TTS with multiple voices
- Generate paraphrases of commands
- → 50,000 additional samples
Later: Real-world collection
- Record actual performances
- Capture natural variations
- Handle disfluencies ("um", "uh")
Phase 4: Retrieval and Reranking Stack
ANN Service
class CommandRetriever:
"""Fast approximate nearest neighbor search for commands."""
def __init__(self, index_path, document_metadata):
# Load FAISS index
self.index = faiss.read_index(index_path)
self.docs = document_metadata # List[Dict]
def search(self, audio_embedding, top_k=10, deck_filter=None):
"""
Search for matching commands.
Args:
audio_embedding: [512] normalized vector
top_k: Number of candidates
deck_filter: Optional deck constraint ("left", "right", "both")
Returns:
List of (doc_id, score, metadata) tuples
"""
# ANN search
scores, indices = self.index.search(
audio_embedding.reshape(1, -1),
top_k
)
# Build results
results = []
for score, idx in zip(scores[0], indices[0]):
doc = self.docs[idx]
# Apply deck filter
if deck_filter and doc['metadata']['deck'] != deck_filter:
continue
results.append({
'command_id': doc['metadata']['command_id'],
'score': float(score),
'shortcut': doc['metadata']['shortcut'],
'metadata': doc['metadata']
})
return resultsStreaming Inference
class StreamingCommandDetector:
"""Detect commands from streaming audio."""
def __init__(self, model, retriever):
self.model = model
self.retriever = retriever
self.buffer = []
self.embedding_history = []
self.stability_threshold = 0.01
def process_chunk(self, audio_features):
"""Process incoming audio chunk."""
self.buffer.append(audio_features)
# Generate embedding every 320ms (16 chunks)
if len(self.buffer) >= 16:
# Encode audio
audio_tensor = torch.cat(self.buffer, dim=0)
embedding = self.model.encode_audio(audio_tensor)
# Check stability
if len(self.embedding_history) > 0:
prev_embedding = self.embedding_history[-1]
similarity = F.cosine_similarity(
embedding, prev_embedding, dim=0
)
# Trigger retrieval if stable
if 1.0 - similarity < self.stability_threshold:
return self.trigger_retrieval(embedding)
self.embedding_history.append(embedding)
self.buffer = [] # Clear buffer
return None
def trigger_retrieval(self, embedding):
"""Execute retrieval when stable."""
results = self.retriever.search(embedding, top_k=5)
# Check confidence
if len(results) == 0:
return None
top_score = results[0]['score']
if top_score < 0.6: # Low confidence
return None
# Check score gap
if len(results) > 1:
score_gap = results[0]['score'] - results[1]['score']
if score_gap < 0.1: # Ambiguous
return self.clarify(results[:2])
return results[0]Phase 5: Constraint Solver and Execution
Constraint Solver
class DJConstraintSolver:
"""Validate and resolve command constraints."""
def __init__(self, constraints_config):
self.constraints = self.load_constraints(constraints_config)
self.current_state = {
'left_deck': {'playing': False, 'loop_active': False},
'right_deck': {'playing': False, 'loop_active': False},
'current_deck': 'left'
}
def validate_command(self, command, context=None):
"""
Validate command against constraints.
Returns:
(is_valid, resolved_command, issues)
"""
issues = []
# Check deck specification
if command['metadata']['deck'] == 'context':
command = self.resolve_deck_context(command)
# Check prerequisites
prereqs = self.get_prerequisites(command)
for prereq in prereqs:
if not self.check_prerequisite(prereq):
issues.append(f"Prerequisite not met: {prereq}")
# Check conflicts
conflicts = self.check_conflicts(command)
if conflicts:
issues.append(f"Conflicts: {conflicts}")
is_valid = len(issues) == 0
return is_valid, command, issues
def resolve_deck_context(self, command):
"""Resolve 'context' deck to current deck."""
current = self.current_state['current_deck']
command['metadata']['deck'] = current
return commandCommand Executor
class CommandExecutor:
"""Execute keyboard actions for commands."""
def __init__(self):
self.keyboard = Controller()
self.last_command = None
self.last_time = 0
self.cooldown = 1.0 # seconds
def execute(self, command):
"""Execute command with debouncing."""
# Check cooldown
if self.should_debounce(command):
return False
# Execute based on action type
if command['metadata']['action_type'] == 'transport':
self.press_key(command['metadata']['shortcut'])
elif command['metadata']['action_type'] == 'chain':
self.execute_chain(command['chain'])
# Update state
self.last_command = command
self.last_time = time.time()
return True
def execute_chain(self, chain):
"""Execute sequence of actions."""
for step in chain:
self.press_key(step['shortcut'])
time.sleep(step['delay'])Phase 6: Data Operations
Recording Your Voice
Script: `scripts/record_command_corpus.py`
"""
Record your voice saying each command.
Usage:
python record_command_corpus.py --output-dir ./data/recordings
"""
import pyaudio
import wave
import yaml
def record_command_corpus(output_dir):
"""Record all commands from catalog."""
# Load catalog
with open('../retrieval/catalog/commands.yaml') as f:
commands = yaml.safe_load(f)['commands']
audio = pyaudio.PyAudio()
for cmd in commands:
print(f"\n=== Command: {cmd['canonical']} ===")
input("Press Enter when ready to record...")
# Record 3 repetitions
for rep in range(3):
print(f"Recording rep {rep+1}/3... Say: '{cmd['canonical']}'")
# Record 2 seconds
stream = audio.open(
format=pyaudio.paInt16,
channels=1,
rate=16000,
input=True,
frames_per_buffer=1024
)
frames = []
for _ in range(0, int(16000 / 1024 * 2)):
data = stream.read(1024)
frames.append(data)
stream.stop_stream()
stream.close()
# Save
filename = f"{cmd['id']}_rep{rep}.wav"
filepath = Path(output_dir) / filename
with wave.open(str(filepath), 'wb') as wf:
wf.setnchannels(1)
wf.setsampwidth(2)
wf.setframerate(16000)
wf.writeframes(b''.join(frames))
print(f"✓ Saved: {filename}")
print("\n✓ Recording complete!")Augmentation Pipeline
class AudioAugmentation:
"""Augment recorded commands."""
def __init__(self):
self.music_noise = self.load_music_samples()
self.crowd_noise = self.load_crowd_samples()
def augment(self, audio, sample_rate=16000):
"""Apply augmentations."""
augmented = []
# Original
augmented.append(('original', audio))
# Speed perturbation
for rate in [0.9, 1.1]:
aug = self.speed_perturb(audio, rate)
augmented.append((f'speed_{rate}', aug))
# Pitch shift
for semitones in [-2, 2]:
aug = self.pitch_shift(audio, semitones)
augmented.append((f'pitch_{semitones}', aug))
# Music mixing
for snr in [10, 15, 20]:
aug = self.mix_music(audio, snr)
augmented.append((f'music_snr{snr}', aug))
# Crowd noise
for snr in [10, 15, 20]:
aug = self.mix_crowd(audio, snr)
augmented.append((f'crowd_snr{snr}', aug))
return augmentedPhase 7: Training Pipeline
Training Script
# 1. Record your voice
python scripts/record_command_corpus.py --output-dir ./data/recordings
# 2. Augment recordings
python scripts/augment_recordings.py \
--input-dir ./data/recordings \
--output-dir ./data/augmented
# 3. Generate manifest
python scripts/create_manifest.py \
--recordings-dir ./data/augmented \
--catalog ./retrieval/catalog/commands.yaml \
--output ./data/train_manifest.jsonl
# 4. Train model
python scripts/train_retrieval_model.py \
--manifest ./data/train_manifest.jsonl \
--output-dir ./models/dual_encoder \
--epochs 50 \
--batch-size 32Phase 8: Evaluation and Deployment
Metrics
1. Retrieval Accuracy:
- Top-1 accuracy:
- Top-5 accuracy:
- MRR: Mean reciprocal rank
2. Latency:
- VAD latency: Time to detect speech start
- Encoding latency: Audio → embedding time
- Search latency: Embedding → results time
- Total latency: Mic → keyboard press
3. Robustness:
- Accuracy with music at different volumes
- Accuracy with crowd noise
- Accuracy at different speaking volumes
Deployment
# Production inference
from dj_agent.voice_control.retrieval import RetrievalVoiceController
controller = RetrievalVoiceController(
model_path='./models/dual_encoder/best.pt',
index_path='./models/index.faiss',
catalog_path='./retrieval/catalog/commands.yaml'
)
# Start streaming
controller.start() # Blocking, listens continuouslyComparison: Retrieval vs Gemini Live
| Aspect | Gemini Live | Retrieval System |
|---|---|---|
| Latency | 800ms-2s (network + buffering) | 200-400ms (all local) |
| Accuracy | 90-95 | |
| Privacy | Cloud processing | 100 |
| Customization | Limited (system prompts) | Full control (fine-tune model) |
| Cost | API usage costs | One-time training cost |
| Offline | ❌ Requires internet | ✅ Fully offline |
| Robustness | General ASR | Optimized for DJ booth |
Next Steps
1. Phase 1 (This week):
- ✅ Design architecture (done)
- ⏳ Create catalog from existing command_map
- ⏳ Record 200 commands × 3 reps = 600 samples
2. Phase 2 (Next week):
- Implement audio front-end
- Build augmentation pipeline
- Generate 12,000 training samples
3. Phase 3 (Week after):
- Implement dual-encoder
- Train on augmented data
- Evaluate on held-out test set
4. Phase 4 (Final week):
- Build retrieval pipeline
- Integrate with keyboard executor
- Test in live DJ session
Estimated Timeline: 4 weeks to production-ready system
---
Advantages of This Approach:
1. No more fragmentation: Direct audio → command matching
2. Your voice, optimized: Model learns your speaking patterns
3. Booth-optimized: Trained with music/crowd noise
4. Sub-400ms latency: All processing local
5. Easy to extend: Just record new commands and retrain
6. Deterministic: Constraint solver ensures valid actions
7. Transparent: Full visibility into what's happening
Ready to start Phase 1? 🎧🎛️
Promotion Decision
Promote into a technical note or architecture paper with implementation anchors.
Source Anchor
Comp-Core/apps/web/cc-studio/docs/dj_agent/voice_control/retrieval_architecture.md
Detected Structure
Introduction · Method · Evaluation · Code Anchors · Architecture