multimodal guide
**Voice and gestures are NOT separate control methods** - they're **complementary modalities** that enhance each other for expressive, creative DJ performance.
Full Public Reader
Multi-Modal Creative Control - Voice + Gestures Working Together
Philosophy
Voice and gestures are NOT separate control methods - they're complementary modalities that enhance each other for expressive, creative DJ performance.
Think of it like playing a musical instrument:
- Voice = Melody (what you want to do)
- Gestures = Rhythm (how you want to do it)
- Together = Musical expression
---
Creative Interaction Patterns
Pattern 1: Voice Sets Intent, Gesture Executes (Voice → Gesture)
Use Case: Quick deck switching without repeating words
You: "left deck" [voice]
→ System: Left deck selected (context set)
You: [swipe right] [gesture]
→ Action: Play left deck
You: [circle clockwise] [gesture]
→ Action: Loop left deck, 4 beats
You: [shake vertically] [gesture]
→ Action: Sync left deck
You: "right deck" [voice]
→ System: Right deck selected (context switched)
You: [swipe right] [gesture]
→ Action: Play right deckWhy it's creative:
- Say "left deck" ONCE, then perform multiple gestures on it
- Fluid workflow - voice sets stage, gestures perform
- Reduces verbal repetition during performance
---
Pattern 2: Gesture Triggers, Voice Refines (Gesture → Voice)
Use Case: Quick action, precise parameter
You: [circle clockwise] [gesture]
→ System: Loop detected (waiting for beat count...)
You: "8 beats" [voice]
→ Action: Loop 8 beats
---
You: [tilt left] [gesture]
→ System: Crossfader moving left...
You: "halfway" [voice]
→ Action: Crossfade to 50% left
---
You: [pinch fingers] [gesture]
→ System: Volume adjustment...
You: "minus 3 dB" [voice]
→ Action: Reduce volume by 3dBWhy it's creative:
- Quick gesture starts action
- Voice provides precise control
- Best of both: speed + precision
---
Pattern 3: Simultaneous for Emphasis (Voice + Gesture)
Use Case: High-confidence critical actions
You: "sync" [voice] + [shake phone] [gesture]
→ System: Both modalities agree (confidence: 95%)
→ Action: Sync with visual confirmation
---
You: "play" [voice] + [swipe right] [gesture]
→ System: Emphatic play command
→ Action: Play with strong intention signal
---
You: "drop" [voice] + [slam gesture] [gesture]
→ System: Drop detected with emphasis
→ Action: Trigger drop with visual effectWhy it's creative:
- Adds physical emphasis to voice commands
- Reduces false positives (both must agree)
- Feels more "performative" - body + voice aligned
---
Pattern 4: Gesture Continuous, Voice Interrupts (Gesture ⊙ Voice)
Use Case: Real-time parameter control
You: [tilt phone left, holding] [gesture]
→ System: Crossfader moving left... 10%... 20%... 30%...
You: "stop" [voice]
→ Action: Halt crossfader at current position
---
You: [slow circle gesture, looping] [gesture]
→ System: Effect parameter modulating...
You: "freeze" [voice]
→ Action: Lock parameter at current valueWhy it's creative:
- Gesture provides continuous control (like a fader)
- Voice provides discrete events (like a button)
- Combination = expressive parameter manipulation
---
Pattern 5: Voice Macro + Gesture Trigger (Voice ⊕ Gesture)
Use Case: Complex sequences with gestural timing
You: "transition to right deck" [voice]
→ System: Transition macro loaded (waiting for trigger...)
You: [swipe right] [gesture]
→ Action: Execute transition NOW
1. Sync right deck
2. Start crossfade
3. Apply filter to left
---
You: "build up" [voice]
→ System: Build-up macro loaded...
You: [upward swipe x3] [gesture]
→ Action: Trigger build-up in 3 stages
Stage 1: Add high-pass filter
Stage 2: Increase tempo
Stage 3: Remove bassWhy it's creative:
- Voice defines WHAT to do
- Gesture controls WHEN to do it
- Timing is physical, not verbal
---
Pattern 6: Contextual Gestures (Context-Aware)
Use Case: Same gesture, different meaning based on voice context
Context: Playing
---
You: [swipe right] [gesture]
→ Action: Pause (playing → stopped)
Context: Stopped
---
You: [swipe right] [gesture]
→ Action: Play (stopped → playing)
Context: Looping
---
You: [swipe right] [gesture]
→ Action: Exit loop (looping → normal playback)
---
Voice changes context:
---
You: "filter mode" [voice]
→ System: Filter context active
You: [tilt left] [gesture]
→ Action: Apply high-pass filter (not crossfade!)
You: "normal mode" [voice]
→ System: Normal context restored
You: [tilt left] [gesture]
→ Action: Crossfade left (normal behavior)Why it's creative:
- One gesture = multiple actions
- Voice switches "modes"
- Reduces gesture vocabulary needed
---
Creative Performance Scenarios
Scenario 1: Live Transition (Multi-Modal Flow)
Goal: Smooth transition from left to right deck using voice + gestures
[Left deck playing, right deck cued]
You: "right deck" [voice]
→ Context: Right deck selected
You: [sync gesture: shake] [gesture]
→ Action: Sync right to left
You: [play gesture: swipe right] [gesture]
→ Action: Play right deck
You: [slow tilt right, continuous] [gesture]
→ Action: Crossfader moving right... 20%... 40%... 60%...
You: "stop" [voice]
→ Action: Halt crossfader at 60%
You: [circle gesture] [gesture]
→ Action: Loop right deck, 4 beats
You: [continue tilt right] [gesture]
→ Action: Resume crossfade... 70%... 80%... 100%
[Transition complete - right deck now dominant]Total time: ~15 seconds
Voice commands: 2 ("right deck", "stop")
Gestures: 5
Result: Smooth, expressive transition with voice providing structure and gestures providing flow
---
Scenario 2: Creative Loop Building (Gesture-Heavy, Voice-Refined)
Goal: Build complex loop pattern using mostly gestures, voice for precision
[Left deck playing]
You: [circle gesture] [gesture]
→ System: Loop intent detected...
You: "4 beats" [voice]
→ Action: Loop 4 beats on left
You: [double-tap gesture] [gesture]
→ Action: Halve loop → 2 beats
You: [circle again] [gesture]
→ Action: Double loop → 4 beats
You: "offset by 1 beat" [voice]
→ Action: Shift loop start by 1 beat
You: [spread fingers gesture] [gesture]
→ Action: Increase loop volume
You: [pinch fingers gesture] [gesture]
→ Action: Decrease loop volume (fade out)
You: "exit" [voice]
→ Action: Exit loop gracefullyTotal time: ~20 seconds
Voice commands: 3 (parameters + exit)
Gestures: 6
Result: Complex loop manipulation with gestural flow and voice precision
---
Scenario 3: Effect Performance (Simultaneous Multi-Modal)
Goal: Apply effects with emphasis and real-time control
[Both decks playing]
You: "effect mode" [voice]
→ System: Effect context active
You: "reverb" [voice] + [upward swipe] [gesture]
→ Action: Apply reverb with strong intent (high confidence)
You: [tilt phone left→right continuously] [gesture]
→ Action: Reverb wet/dry modulating in real-time
You: "freeze" [voice]
→ Action: Lock reverb at current value
You: "filter" [voice] + [downward swipe] [gesture]
→ Action: Add low-pass filter
You: [circular gesture, continuous] [gesture]
→ Action: Filter cutoff frequency modulating
You: [slam gesture] [gesture]
→ Action: Remove all effects (drop)
You: "normal mode" [voice]
→ System: Effect context deactivatedTotal time: ~25 seconds
Voice commands: 5 (mode switches, effect types)
Gestures: 6 (triggers + continuous control)
Result: Expressive effect performance with voice structure and gesture expression
---
Advanced Techniques
Technique 1: Gesture Sequences (Combos)
Use Case: Chain gestures for compound actions
You: [circle CW] + [swipe right] + [tap twice] [gesture sequence]
→ Interpreted as: "Loop → Play → Cue next track"
→ Action: Start loop, begin playback, queue next trackImplementation:
def detect_gesture_sequence(gestures: List[str], window: float = 2.0):
"""
Detect gesture combinations within time window.
Args:
gestures: List of recent gestures
window: Time window for sequence (seconds)
Returns:
Compound action or None
"""
# Check for known sequences
if gestures == ['circle_cw', 'swipe_right', 'tap_twice']:
return "loop_play_cue_next"
# ... more sequencesTechnique 2: Voice Modifiers (Gesture Intensity)
Use Case: Voice changes gesture behavior
You: "gentle" [voice]
→ System: Gesture intensity modifier = 0.5
You: [tilt gesture] [gesture]
→ Action: Slow crossfade (gentle)
---
You: "hard" [voice]
→ System: Gesture intensity modifier = 2.0
You: [tilt gesture] [gesture]
→ Action: Fast crossfade (hard/aggressive)Implementation:
class GestureModifiers:
intensity_map = {
'gentle': 0.5,
'soft': 0.5,
'hard': 2.0,
'fast': 2.0,
'slow': 0.3,
}
def apply_modifier(self, gesture_value: float, modifier: str) -> float:
multiplier = self.intensity_map.get(modifier, 1.0)
return gesture_value * multiplierTechnique 3: Gesture Echo (Voice Looper)
Use Case: Record gesture, repeat on voice command
You: "record gesture" [voice]
→ System: Recording...
You: [circle CW] → [swipe right] → [tap] [gesture sequence]
→ System: Gesture recorded (3-step sequence)
You: "play gesture" [voice]
→ Action: Repeat recorded gesture sequence
You: "play gesture x4" [voice]
→ Action: Repeat sequence 4 timesWhy it's creative:
- Record complex gesture patterns
- Replay them verbally
- Create "gesture loops"
---
Technical Implementation
Multi-Modal Fusion Logic
class MultiModalFusion:
"""Combine voice + gesture with intelligent fusion."""
def fuse(
self,
voice: Optional[VoiceCommand],
gesture: Optional[GestureCommand],
time_delta: float,
) -> FusedCommand:
"""
Fuse voice and gesture inputs.
Fusion Rules:
1. If within 1 second → Consider as coordinated
2. If both present → Cross-validate (confidence boost)
3. If one missing → Use available modality
4. If conflict → Voice takes priority (explicit > implicit)
"""
# Case 1: Voice + Gesture (simultaneous)
if voice and gesture and time_delta < 0.5:
return self._fuse_simultaneous(voice, gesture)
# Case 2: Voice → Gesture (sequential)
elif voice and gesture and 0.5 < time_delta < 3.0:
return self._fuse_voice_then_gesture(voice, gesture)
# Case 3: Gesture → Voice (sequential)
elif gesture and voice and 0.5 < time_delta < 3.0:
return self._fuse_gesture_then_voice(gesture, voice)
# Case 4: Voice only
elif voice and not gesture:
return self._voice_only(voice)
# Case 5: Gesture only
elif gesture and not voice:
return self._gesture_only(gesture)
else:
return None # No valid input
def _fuse_simultaneous(self, voice, gesture):
"""Both inputs at same time - cross-validate."""
# Check if they agree
if self._are_compatible(voice.intent, gesture.intent):
# Agreement → High confidence
return FusedCommand(
action=voice.intent, # Voice is explicit
confidence=min(0.95, (voice.confidence + gesture.confidence) / 2 + 0.15),
source='voice+gesture',
)
else:
# Conflict → Voice wins (explicit)
return FusedCommand(
action=voice.intent,
confidence=voice.confidence * 0.9,
source='voice (conflicted)',
)---
Performance Optimization
Latency Budget
| Input | Processing | Fusion | Total |
|---|---|---|---|
| Voice (Gemini) | 200ms | 10ms | 210ms |
| Gesture (Sensor) | 20ms | 10ms | 30ms |
| Voice + Gesture | MAX(200, 20) + 10ms | 210ms |
Key insight: Sensor gestures are MUCH faster than voice, so:
- Critical actions → Gesture-triggered, voice-refined
- Complex actions → Voice-triggered, gesture-parametrized
---
Summary
Why Multi-Modal?
Voice Strengths:
- Explicit intent (clear what you want)
- Complex commands (multi-word)
- Precise parameters (numbers, names)
Gesture Strengths:
- Fast execution (<50ms)
- Continuous control (faders, knobs)
- Physical expression (performative)
Together:
- Voice provides structure (what + when)
- Gestures provide flow (how + expression)
- Result: Expressive, creative control
Design Principles
1. Neither modality is primary - They're equal partners
2. Context flows between modalities - Voice → Gesture or Gesture → Voice
3. Redundancy increases confidence - Both agree = high certainty
4. User has full freedom - Mix and match as desired
---
Make your DJ performance an instrument, not just a controller! 🎤+👋=🎧
Author: Computational Choreography
Version: 1.0 - Multi-Modal Creative Guide
Promotion Decision
Attach run IDs, datasets, metrics, and reproduction commands.
Source Anchor
Comp-Core/apps/web/cc-studio/docs/dj_agent/gesture_control/multimodal_guide.md
Detected Structure
Method · Evaluation