Grand Diomande Research · Full HTML Reader

DJ Voice Control: Retrieval-Centric Architecture

The DJ Voice Control system adapts the speech-to-order retrieval-centric paradigm for real-time DJ performance control. Instead of matching spoken orders to menu items, we match spoken commands to DJ actions and keyboard shortcuts. This approach provides superior accuracy compared to traditional ASR + NLU pipelines by learning a direct semantic mapping between audio utterances and command intents.

Agents That Account for Themselves architecture technical paper candidate score 62 .md

Full Public Reader

DJ Voice Control: Retrieval-Centric Architecture

Introduction

The DJ Voice Control system adapts the speech-to-order retrieval-centric paradigm for real-time DJ performance control. Instead of matching spoken orders to menu items, we match spoken commands to DJ actions and keyboard shortcuts. This approach provides superior accuracy compared to traditional ASR + NLU pipelines by learning a direct semantic mapping between audio utterances and command intents.

Key Advantages:
- Sub-second latency: Direct audio → command matching without transcription
- Robust to variations: Handles different phrasings, accents, and noise
- No cloud dependency: Runs entirely locally for zero latency
- Deterministic execution: Constraint solver ensures valid command combinations
- Continuous improvement: Easy to add new commands and voice samples

System Architecture Overview

┌─────────────────┐
│  Microphone     │
│  Audio Stream   │
└────────┬────────┘
         │
         ▼
┌─────────────────────────────────────────────────┐
│  Audio Front-End (streaming, 320ms chunks)      │
│  • VAD (café noise → DJ booth noise)            │
│  • Noise suppression (crowd, speakers, mixing)  │
│  • Log-Mel features (80 bins, 25ms windows)     │
└────────┬────────────────────────────────────────┘
         │
         ▼
┌─────────────────────────────────────────────────┐
│  Dual-Encoder Model                             │
│  • Audio Tower: CNN + Transformer               │
│  • Text Tower: EmbeddingGemma                   │
│  • Shared 512-dim embedding space               │
└────────┬────────────────────────────────────────┘
         │
         ▼
┌─────────────────────────────────────────────────┐
│  Vector Index (FAISS HNSW)                      │
│  • ~200 command embeddings                      │
│  • Metadata: deck, action_type, shortcuts       │
│  • Sub-10ms search latency                      │
└────────┬────────────────────────────────────────┘
         │
         ▼
┌─────────────────────────────────────────────────┐
│  Constraint Solver                              │
│  • Deck validation (left/right/both)            │
│  • Context awareness (loop active, playing)     │
│  • Conflict resolution (invalid combos)         │
└────────┬────────────────────────────────────────┘
         │
         ▼
┌─────────────────────────────────────────────────┐
│  Keyboard Action Executor                       │
│  • Press key combinations                       │
│  • Execute command chains                       │
│  • Debouncing and cooldown                      │
└─────────────────────────────────────────────────┘

Phase 0: Infrastructure and Governance

Directory Structure

dj_agent/voice_control/
├── retrieval/                    # NEW: Retrieval-centric system
│   ├── catalog/                  # Command catalog management
│   │   ├── commands.yaml         # All DJ commands with metadata
│   │   ├── constraints.yaml      # Constraint rules
│   │   ├── aliases.yaml          # Command variations/synonyms
│   │   └── loader.py             # Catalog loader
│   ├── indexing/                 # Vector index management
│   │   ├── document_generator.py # Create retrieval corpus
│   │   ├── embeddings.py         # Text embedding generation
│   │   └── vector_index.py       # FAISS index wrapper
│   ├── audio/                    # Audio processing
│   │   ├── vad.py                # Voice activity detection
│   │   ├── features.py           # Log-Mel extraction
│   │   └── streaming.py          # Streaming pipeline
│   ├── model/                    # Dual-encoder training
│   │   ├── audio_tower.py        # Audio encoder
│   │   ├── text_tower.py         # Text encoder
│   │   ├── dual_encoder.py       # Combined model
│   │   └── trainer.py            # Training loop
│   ├── inference/                # Inference pipeline
│   │   ├── retriever.py          # ANN search
│   │   ├── reranker.py           # Cross-encoder reranking
│   │   └── pipeline.py           # End-to-end pipeline
│   ├── constraints/              # Constraint solver
│   │   ├── solver.py             # Rule engine
│   │   └── dialogue.py           # Clarification logic
│   └── data/                     # Data operations
│       ├── synthetic.py          # TTS generation
│       ├── augmentation.py       # Acoustic augmentation
│       └── manifest.py           # Training data management
├── core/                         # Existing system (for comparison)
│   ├── gemini_listener.py
│   └── voice_controller.py
└── scripts/
    └── train_retrieval_model.py  # Training script

Phase 1: Command Catalog and Index Foundations

Command Catalog Structure (`commands.yaml`)

yaml
commands:
  # Basic playback commands
  - id: "play_left"
    canonical: "play left"
    category: "playback"
    deck: "left"
    action_type: "transport"
    shortcut: "w"
    variations:
      - "play left deck"
      - "start left"
      - "left play"
      - "play the left"
    metadata:
      priority: "high"
      requires_deck_loaded: true

  - id: "play_right"
    canonical: "play right"
    category: "playback"
    deck: "right"
    action_type: "transport"
    shortcut: "s"
    variations:
      - "play right deck"
      - "start right"
      - "right play"

  # Cue points
  - id: "cue_1_left"
    canonical: "cue 1 left"
    category: "cue"
    deck: "left"
    action_type: "navigation"
    shortcut: "1"
    variations:
      - "cue one left"
      - "hot cue 1 left"
      - "jump to cue 1 left"

  # Chain commands
  - id: "play_next"
    canonical: "play next"
    category: "automation"
    deck: "context"  # Uses current deck
    action_type: "chain"
    chain:
      - action: "move_down"
        delay: 0.1
      - action: "load_deck"
        delay: 0.2
      - action: "play_deck"
        delay: 0.3
    variations:
      - "next song"
      - "play next track"
      - "load and play next"

Constraints (`constraints.yaml`)

yaml
constraints:
  # Deck must be specified unless context is clear
  - type: "deck_required"
    applies_to: ["playback", "cue", "loop", "effects"]
    exceptions: ["chain_commands"]
    resolution: "use_current_deck"

  # Can't cue while playing
  - type: "state_conflict"
    condition: "deck_playing == true"
    conflicts_with: ["cue_jump"]
    resolution: "allow_with_warning"

  # Loop commands need active loop
  - type: "prerequisite"
    command_pattern: "exit_loop_*"
    requires: "loop_active == true"
    resolution: "ignore_if_not_met"

Document Generation

The catalog expands into a retrieval corpus:

python
# Base command
{
  "doc_id": "play_left_0",
  "text": "play left",
  "metadata": {
    "command_id": "play_left",
    "deck": "left",
    "shortcut": "w",
    "category": "playback",
    "action_type": "transport"
  }
}

# Variations
{
  "doc_id": "play_left_1",
  "text": "play left deck",
  "metadata": { ... }
}

{
  "doc_id": "play_left_2",
  "text": "start left",
  "metadata": { ... }
}

Corpus Statistics:
- ~200 base commands (from your existing command_map)
- ~5 variations per command
- ~1,000 total documents in retrieval corpus
- Flat index sufficient (sub-1ms search)

Text Embeddings

Use EmbeddingGemma-300m (same as menu system):

python
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("google/embeddinggemma-300m")
embeddings = model.encode(command_texts)  # 300-dim vectors

Build FAISS index:

python
import faiss

# Flat index for exact search (sufficient for ~1k docs)
index = faiss.IndexFlatIP(300)  # Inner product (cosine)
index.add(embeddings)

Phase 2: Audio Front-End Processing

Adaptation for DJ Booth Environment

Key Differences from Café:
- Noise profile: Music playback, crowd noise, speaker feedback
- Speaking style: Louder, more forceful commands
- Latency requirement: Even tighter (<800ms total)
- Hands-free: Can't press PTT during performance

Pipeline Architecture

python
class DJAudioFrontEnd:
    """Streaming audio processing for DJ commands."""

    def __init__(self):
        self.sample_rate = 16000
        self.chunk_size = 320  # 20ms chunks for ultra-low latency
        self.vad = WebRTCVAD(aggressiveness=3)  # Max sensitivity

        # DJ-specific noise suppression
        self.noise_profiles = {
            'music': self._load_music_profile(),
            'crowd': self._load_crowd_profile(),
            'feedback': self._load_feedback_profile()
        }

    def process_chunk(self, audio_chunk):
        """Process 20ms audio chunk."""
        # 1. Voice activity detection
        is_speech = self.vad.is_speech(audio_chunk)
        if not is_speech:
            return None

        # 2. Noise suppression (spectral subtraction)
        clean_audio = self.suppress_noise(audio_chunk)

        # 3. Log-Mel features
        features = self.extract_logmel(clean_audio)

        return features

    def extract_logmel(self, audio):
        """Extract 80-bin log-Mel features."""
        # 25ms windows, 10ms hop
        mel = librosa.feature.melspectrogram(
            y=audio,
            sr=self.sample_rate,
            n_mels=80,
            n_fft=400,
            hop_length=160
        )
        log_mel = librosa.power_to_db(mel)
        return log_mel

Phase 3: Dual-Encoder Training

Architecture

Audio Tower:

python
class AudioTower(nn.Module):
    def __init__(self):
        # Conv feature extractor
        self.conv = nn.Sequential(
            nn.Conv1d(80, 128, kernel_size=3, stride=2),
            nn.ReLU(),
            nn.Conv1d(128, 256, kernel_size=3, stride=2),
            nn.ReLU(),
        )

        # Transformer encoder
        self.transformer = nn.TransformerEncoder(
            nn.TransformerEncoderLayer(d_model=256, nhead=8),
            num_layers=4
        )

        # Projection to shared space
        self.projection = nn.Linear(256, 512)

    def forward(self, log_mel_features):
        # log_mel_features: [batch, time, 80]
        x = self.conv(log_mel_features.transpose(1, 2))  # [batch, 256, time']
        x = x.transpose(1, 2)  # [batch, time', 256]
        x = self.transformer(x)  # [batch, time', 256]
        x = x.mean(dim=1)  # Temporal pooling → [batch, 256]
        x = self.projection(x)  # [batch, 512]
        return F.normalize(x, dim=-1)  # L2 normalize

Text Tower:

python
class TextTower(nn.Module):
    def __init__(self):
        # Use EmbeddingGemma (frozen or fine-tuned)
        self.embedder = SentenceTransformer("google/embeddinggemma-300m")
        self.projection = nn.Linear(300, 512)

    def forward(self, texts):
        # texts: List[str]
        embeddings = self.embedder.encode(texts, convert_to_tensor=True)
        x = self.projection(embeddings)
        return F.normalize(x, dim=-1)

Training Objective

Contrastive Loss (InfoNCE):

python
def contrastive_loss(audio_embeds, text_embeds, temperature=0.07):
    """
    audio_embeds: [batch, 512]
    text_embeds: [batch, 512]
    """
    # Compute similarity matrix
    sim_matrix = torch.matmul(audio_embeds, text_embeds.T) / temperature
    # [batch, batch]

    # Positive pairs are on the diagonal
    labels = torch.arange(len(audio_embeds)).to(device)

    # Cross-entropy loss (each row should predict diagonal)
    loss = F.cross_entropy(sim_matrix, labels)
    return loss

Training Data

Initial Dataset (Your Voice):
1. Record command corpus:
- Read each command 3 times (clean)
- Record in booth environment (with music)
- ~200 commands × 3 reps × 2 conditions = 1,200 samples

2. Synthetic augmentation:
- Speed perturbation: 0.9x, 1.0x, 1.1x
- Pitch shift: ±2 semitones
- Background music mixing (SNR 10-20 dB)
- Crowd noise mixing
- → 12,000 augmented samples

3. TTS expansion:
- Use Piper TTS with multiple voices
- Generate paraphrases of commands
- → 50,000 additional samples

Later: Real-world collection
- Record actual performances
- Capture natural variations
- Handle disfluencies ("um", "uh")

Phase 4: Retrieval and Reranking Stack

ANN Service

python
class CommandRetriever:
    """Fast approximate nearest neighbor search for commands."""

    def __init__(self, index_path, document_metadata):
        # Load FAISS index
        self.index = faiss.read_index(index_path)
        self.docs = document_metadata  # List[Dict]

    def search(self, audio_embedding, top_k=10, deck_filter=None):
        """
        Search for matching commands.

        Args:
            audio_embedding: [512] normalized vector
            top_k: Number of candidates
            deck_filter: Optional deck constraint ("left", "right", "both")

        Returns:
            List of (doc_id, score, metadata) tuples
        """
        # ANN search
        scores, indices = self.index.search(
            audio_embedding.reshape(1, -1),
            top_k
        )

        # Build results
        results = []
        for score, idx in zip(scores[0], indices[0]):
            doc = self.docs[idx]

            # Apply deck filter
            if deck_filter and doc['metadata']['deck'] != deck_filter:
                continue

            results.append({
                'command_id': doc['metadata']['command_id'],
                'score': float(score),
                'shortcut': doc['metadata']['shortcut'],
                'metadata': doc['metadata']
            })

        return results

Streaming Inference

python
class StreamingCommandDetector:
    """Detect commands from streaming audio."""

    def __init__(self, model, retriever):
        self.model = model
        self.retriever = retriever
        self.buffer = []
        self.embedding_history = []
        self.stability_threshold = 0.01

    def process_chunk(self, audio_features):
        """Process incoming audio chunk."""
        self.buffer.append(audio_features)

        # Generate embedding every 320ms (16 chunks)
        if len(self.buffer) >= 16:
            # Encode audio
            audio_tensor = torch.cat(self.buffer, dim=0)
            embedding = self.model.encode_audio(audio_tensor)

            # Check stability
            if len(self.embedding_history) > 0:
                prev_embedding = self.embedding_history[-1]
                similarity = F.cosine_similarity(
                    embedding, prev_embedding, dim=0
                )

                # Trigger retrieval if stable
                if 1.0 - similarity < self.stability_threshold:
                    return self.trigger_retrieval(embedding)

            self.embedding_history.append(embedding)
            self.buffer = []  # Clear buffer

        return None

    def trigger_retrieval(self, embedding):
        """Execute retrieval when stable."""
        results = self.retriever.search(embedding, top_k=5)

        # Check confidence
        if len(results) == 0:
            return None

        top_score = results[0]['score']
        if top_score < 0.6:  # Low confidence
            return None

        # Check score gap
        if len(results) > 1:
            score_gap = results[0]['score'] - results[1]['score']
            if score_gap < 0.1:  # Ambiguous
                return self.clarify(results[:2])

        return results[0]

Phase 5: Constraint Solver and Execution

Constraint Solver

python
class DJConstraintSolver:
    """Validate and resolve command constraints."""

    def __init__(self, constraints_config):
        self.constraints = self.load_constraints(constraints_config)
        self.current_state = {
            'left_deck': {'playing': False, 'loop_active': False},
            'right_deck': {'playing': False, 'loop_active': False},
            'current_deck': 'left'
        }

    def validate_command(self, command, context=None):
        """
        Validate command against constraints.

        Returns:
            (is_valid, resolved_command, issues)
        """
        issues = []

        # Check deck specification
        if command['metadata']['deck'] == 'context':
            command = self.resolve_deck_context(command)

        # Check prerequisites
        prereqs = self.get_prerequisites(command)
        for prereq in prereqs:
            if not self.check_prerequisite(prereq):
                issues.append(f"Prerequisite not met: {prereq}")

        # Check conflicts
        conflicts = self.check_conflicts(command)
        if conflicts:
            issues.append(f"Conflicts: {conflicts}")

        is_valid = len(issues) == 0
        return is_valid, command, issues

    def resolve_deck_context(self, command):
        """Resolve 'context' deck to current deck."""
        current = self.current_state['current_deck']
        command['metadata']['deck'] = current
        return command

Command Executor

python
class CommandExecutor:
    """Execute keyboard actions for commands."""

    def __init__(self):
        self.keyboard = Controller()
        self.last_command = None
        self.last_time = 0
        self.cooldown = 1.0  # seconds

    def execute(self, command):
        """Execute command with debouncing."""
        # Check cooldown
        if self.should_debounce(command):
            return False

        # Execute based on action type
        if command['metadata']['action_type'] == 'transport':
            self.press_key(command['metadata']['shortcut'])
        elif command['metadata']['action_type'] == 'chain':
            self.execute_chain(command['chain'])

        # Update state
        self.last_command = command
        self.last_time = time.time()

        return True

    def execute_chain(self, chain):
        """Execute sequence of actions."""
        for step in chain:
            self.press_key(step['shortcut'])
            time.sleep(step['delay'])

Phase 6: Data Operations

Recording Your Voice

Script: `scripts/record_command_corpus.py`

python
"""
Record your voice saying each command.

Usage:
    python record_command_corpus.py --output-dir ./data/recordings
"""

import pyaudio
import wave
import yaml

def record_command_corpus(output_dir):
    """Record all commands from catalog."""

    # Load catalog
    with open('../retrieval/catalog/commands.yaml') as f:
        commands = yaml.safe_load(f)['commands']

    audio = pyaudio.PyAudio()

    for cmd in commands:
        print(f"\n=== Command: {cmd['canonical']} ===")
        input("Press Enter when ready to record...")

        # Record 3 repetitions
        for rep in range(3):
            print(f"Recording rep {rep+1}/3... Say: '{cmd['canonical']}'")

            # Record 2 seconds
            stream = audio.open(
                format=pyaudio.paInt16,
                channels=1,
                rate=16000,
                input=True,
                frames_per_buffer=1024
            )

            frames = []
            for _ in range(0, int(16000 / 1024 * 2)):
                data = stream.read(1024)
                frames.append(data)

            stream.stop_stream()
            stream.close()

            # Save
            filename = f"{cmd['id']}_rep{rep}.wav"
            filepath = Path(output_dir) / filename
            with wave.open(str(filepath), 'wb') as wf:
                wf.setnchannels(1)
                wf.setsampwidth(2)
                wf.setframerate(16000)
                wf.writeframes(b''.join(frames))

            print(f"✓ Saved: {filename}")

    print("\n✓ Recording complete!")

Augmentation Pipeline

python
class AudioAugmentation:
    """Augment recorded commands."""

    def __init__(self):
        self.music_noise = self.load_music_samples()
        self.crowd_noise = self.load_crowd_samples()

    def augment(self, audio, sample_rate=16000):
        """Apply augmentations."""
        augmented = []

        # Original
        augmented.append(('original', audio))

        # Speed perturbation
        for rate in [0.9, 1.1]:
            aug = self.speed_perturb(audio, rate)
            augmented.append((f'speed_{rate}', aug))

        # Pitch shift
        for semitones in [-2, 2]:
            aug = self.pitch_shift(audio, semitones)
            augmented.append((f'pitch_{semitones}', aug))

        # Music mixing
        for snr in [10, 15, 20]:
            aug = self.mix_music(audio, snr)
            augmented.append((f'music_snr{snr}', aug))

        # Crowd noise
        for snr in [10, 15, 20]:
            aug = self.mix_crowd(audio, snr)
            augmented.append((f'crowd_snr{snr}', aug))

        return augmented

Phase 7: Training Pipeline

Training Script

bash
# 1. Record your voice
python scripts/record_command_corpus.py --output-dir ./data/recordings

# 2. Augment recordings
python scripts/augment_recordings.py \
    --input-dir ./data/recordings \
    --output-dir ./data/augmented

# 3. Generate manifest
python scripts/create_manifest.py \
    --recordings-dir ./data/augmented \
    --catalog ./retrieval/catalog/commands.yaml \
    --output ./data/train_manifest.jsonl

# 4. Train model
python scripts/train_retrieval_model.py \
    --manifest ./data/train_manifest.jsonl \
    --output-dir ./models/dual_encoder \
    --epochs 50 \
    --batch-size 32

Phase 8: Evaluation and Deployment

Metrics

1. Retrieval Accuracy:
- Top-1 accuracy:
- Top-5 accuracy:
- MRR: Mean reciprocal rank

2. Latency:
- VAD latency: Time to detect speech start
- Encoding latency: Audio → embedding time
- Search latency: Embedding → results time
- Total latency: Mic → keyboard press

3. Robustness:
- Accuracy with music at different volumes
- Accuracy with crowd noise
- Accuracy at different speaking volumes

Deployment

python
# Production inference
from dj_agent.voice_control.retrieval import RetrievalVoiceController

controller = RetrievalVoiceController(
    model_path='./models/dual_encoder/best.pt',
    index_path='./models/index.faiss',
    catalog_path='./retrieval/catalog/commands.yaml'
)

# Start streaming
controller.start()  # Blocking, listens continuously

Comparison: Retrieval vs Gemini Live

AspectGemini LiveRetrieval System
Latency800ms-2s (network + buffering)200-400ms (all local)
Accuracy90-95
PrivacyCloud processing100
CustomizationLimited (system prompts)Full control (fine-tune model)
CostAPI usage costsOne-time training cost
Offline❌ Requires internet✅ Fully offline
RobustnessGeneral ASROptimized for DJ booth

Next Steps

1. Phase 1 (This week):
- ✅ Design architecture (done)
- ⏳ Create catalog from existing command_map
- ⏳ Record 200 commands × 3 reps = 600 samples

2. Phase 2 (Next week):
- Implement audio front-end
- Build augmentation pipeline
- Generate 12,000 training samples

3. Phase 3 (Week after):
- Implement dual-encoder
- Train on augmented data
- Evaluate on held-out test set

4. Phase 4 (Final week):
- Build retrieval pipeline
- Integrate with keyboard executor
- Test in live DJ session

Estimated Timeline: 4 weeks to production-ready system

---

Advantages of This Approach:
1. No more fragmentation: Direct audio → command matching
2. Your voice, optimized: Model learns your speaking patterns
3. Booth-optimized: Trained with music/crowd noise
4. Sub-400ms latency: All processing local
5. Easy to extend: Just record new commands and retrain
6. Deterministic: Constraint solver ensures valid actions
7. Transparent: Full visibility into what's happening

Ready to start Phase 1? 🎧🎛️

Promotion Decision

Promote into a technical note or architecture paper with implementation anchors.

Source Anchor

Comp-Core/apps/web/cc-studio/docs/dj_agent/voice_control/retrieval_architecture.md

Detected Structure

Introduction · Method · Evaluation · Code Anchors · Architecture