Grand Diomande Research · Full HTML Reader

multimodal guide

**Voice and gestures are NOT separate control methods** - they're **complementary modalities** that enhance each other for expressive, creative DJ performance.

Agents That Account for Themselves research note experiment writeup candidate score 26 .md

Full Public Reader

Multi-Modal Creative Control - Voice + Gestures Working Together

Philosophy

Voice and gestures are NOT separate control methods - they're complementary modalities that enhance each other for expressive, creative DJ performance.

Think of it like playing a musical instrument:
- Voice = Melody (what you want to do)
- Gestures = Rhythm (how you want to do it)
- Together = Musical expression

---

Creative Interaction Patterns

Pattern 1: Voice Sets Intent, Gesture Executes (Voice → Gesture)

Use Case: Quick deck switching without repeating words

You: "left deck" [voice]
→ System: Left deck selected (context set)

You: [swipe right] [gesture]
→ Action: Play left deck

You: [circle clockwise] [gesture]
→ Action: Loop left deck, 4 beats

You: [shake vertically] [gesture]
→ Action: Sync left deck

You: "right deck" [voice]
→ System: Right deck selected (context switched)

You: [swipe right] [gesture]
→ Action: Play right deck

Why it's creative:
- Say "left deck" ONCE, then perform multiple gestures on it
- Fluid workflow - voice sets stage, gestures perform
- Reduces verbal repetition during performance

---

Pattern 2: Gesture Triggers, Voice Refines (Gesture → Voice)

Use Case: Quick action, precise parameter

You: [circle clockwise] [gesture]
→ System: Loop detected (waiting for beat count...)

You: "8 beats" [voice]
→ Action: Loop 8 beats

---

You: [tilt left] [gesture]
→ System: Crossfader moving left...

You: "halfway" [voice]
→ Action: Crossfade to 50% left

---

You: [pinch fingers] [gesture]
→ System: Volume adjustment...

You: "minus 3 dB" [voice]
→ Action: Reduce volume by 3dB

Why it's creative:
- Quick gesture starts action
- Voice provides precise control
- Best of both: speed + precision

---

Pattern 3: Simultaneous for Emphasis (Voice + Gesture)

Use Case: High-confidence critical actions

You: "sync" [voice] + [shake phone] [gesture]
→ System: Both modalities agree (confidence: 95%)
→ Action: Sync with visual confirmation

---

You: "play" [voice] + [swipe right] [gesture]
→ System: Emphatic play command
→ Action: Play with strong intention signal

---

You: "drop" [voice] + [slam gesture] [gesture]
→ System: Drop detected with emphasis
→ Action: Trigger drop with visual effect

Why it's creative:
- Adds physical emphasis to voice commands
- Reduces false positives (both must agree)
- Feels more "performative" - body + voice aligned

---

Pattern 4: Gesture Continuous, Voice Interrupts (Gesture ⊙ Voice)

Use Case: Real-time parameter control

You: [tilt phone left, holding] [gesture]
→ System: Crossfader moving left... 10%... 20%... 30%...

You: "stop" [voice]
→ Action: Halt crossfader at current position

---

You: [slow circle gesture, looping] [gesture]
→ System: Effect parameter modulating...

You: "freeze" [voice]
→ Action: Lock parameter at current value

Why it's creative:
- Gesture provides continuous control (like a fader)
- Voice provides discrete events (like a button)
- Combination = expressive parameter manipulation

---

Pattern 5: Voice Macro + Gesture Trigger (Voice ⊕ Gesture)

Use Case: Complex sequences with gestural timing

You: "transition to right deck" [voice]
→ System: Transition macro loaded (waiting for trigger...)

You: [swipe right] [gesture]
→ Action: Execute transition NOW
   1. Sync right deck
   2. Start crossfade
   3. Apply filter to left

---

You: "build up" [voice]
→ System: Build-up macro loaded...

You: [upward swipe x3] [gesture]
→ Action: Trigger build-up in 3 stages
   Stage 1: Add high-pass filter
   Stage 2: Increase tempo
   Stage 3: Remove bass

Why it's creative:
- Voice defines WHAT to do
- Gesture controls WHEN to do it
- Timing is physical, not verbal

---

Pattern 6: Contextual Gestures (Context-Aware)

Use Case: Same gesture, different meaning based on voice context

Context: Playing
---
You: [swipe right] [gesture]
→ Action: Pause (playing → stopped)

Context: Stopped
---
You: [swipe right] [gesture]
→ Action: Play (stopped → playing)

Context: Looping
---
You: [swipe right] [gesture]
→ Action: Exit loop (looping → normal playback)

---

Voice changes context:
---
You: "filter mode" [voice]
→ System: Filter context active

You: [tilt left] [gesture]
→ Action: Apply high-pass filter (not crossfade!)

You: "normal mode" [voice]
→ System: Normal context restored

You: [tilt left] [gesture]
→ Action: Crossfade left (normal behavior)

Why it's creative:
- One gesture = multiple actions
- Voice switches "modes"
- Reduces gesture vocabulary needed

---

Creative Performance Scenarios

Scenario 1: Live Transition (Multi-Modal Flow)

Goal: Smooth transition from left to right deck using voice + gestures

[Left deck playing, right deck cued]

You: "right deck" [voice]
→ Context: Right deck selected

You: [sync gesture: shake] [gesture]
→ Action: Sync right to left

You: [play gesture: swipe right] [gesture]
→ Action: Play right deck

You: [slow tilt right, continuous] [gesture]
→ Action: Crossfader moving right... 20%... 40%... 60%...

You: "stop" [voice]
→ Action: Halt crossfader at 60%

You: [circle gesture] [gesture]
→ Action: Loop right deck, 4 beats

You: [continue tilt right] [gesture]
→ Action: Resume crossfade... 70%... 80%... 100%

[Transition complete - right deck now dominant]

Total time: ~15 seconds
Voice commands: 2 ("right deck", "stop")
Gestures: 5
Result: Smooth, expressive transition with voice providing structure and gestures providing flow

---

Scenario 2: Creative Loop Building (Gesture-Heavy, Voice-Refined)

Goal: Build complex loop pattern using mostly gestures, voice for precision

[Left deck playing]

You: [circle gesture] [gesture]
→ System: Loop intent detected...

You: "4 beats" [voice]
→ Action: Loop 4 beats on left

You: [double-tap gesture] [gesture]
→ Action: Halve loop → 2 beats

You: [circle again] [gesture]
→ Action: Double loop → 4 beats

You: "offset by 1 beat" [voice]
→ Action: Shift loop start by 1 beat

You: [spread fingers gesture] [gesture]
→ Action: Increase loop volume

You: [pinch fingers gesture] [gesture]
→ Action: Decrease loop volume (fade out)

You: "exit" [voice]
→ Action: Exit loop gracefully

Total time: ~20 seconds
Voice commands: 3 (parameters + exit)
Gestures: 6
Result: Complex loop manipulation with gestural flow and voice precision

---

Scenario 3: Effect Performance (Simultaneous Multi-Modal)

Goal: Apply effects with emphasis and real-time control

[Both decks playing]

You: "effect mode" [voice]
→ System: Effect context active

You: "reverb" [voice] + [upward swipe] [gesture]
→ Action: Apply reverb with strong intent (high confidence)

You: [tilt phone left→right continuously] [gesture]
→ Action: Reverb wet/dry modulating in real-time

You: "freeze" [voice]
→ Action: Lock reverb at current value

You: "filter" [voice] + [downward swipe] [gesture]
→ Action: Add low-pass filter

You: [circular gesture, continuous] [gesture]
→ Action: Filter cutoff frequency modulating

You: [slam gesture] [gesture]
→ Action: Remove all effects (drop)

You: "normal mode" [voice]
→ System: Effect context deactivated

Total time: ~25 seconds
Voice commands: 5 (mode switches, effect types)
Gestures: 6 (triggers + continuous control)
Result: Expressive effect performance with voice structure and gesture expression

---

Advanced Techniques

Technique 1: Gesture Sequences (Combos)

Use Case: Chain gestures for compound actions

You: [circle CW] + [swipe right] + [tap twice]  [gesture sequence]
→ Interpreted as: "Loop → Play → Cue next track"
→ Action: Start loop, begin playback, queue next track

Implementation:

python
def detect_gesture_sequence(gestures: List[str], window: float = 2.0):
    """
    Detect gesture combinations within time window.

    Args:
        gestures: List of recent gestures
        window: Time window for sequence (seconds)

    Returns:
        Compound action or None
    """
    # Check for known sequences
    if gestures == ['circle_cw', 'swipe_right', 'tap_twice']:
        return "loop_play_cue_next"

    # ... more sequences

Technique 2: Voice Modifiers (Gesture Intensity)

Use Case: Voice changes gesture behavior

You: "gentle" [voice]
→ System: Gesture intensity modifier = 0.5

You: [tilt gesture] [gesture]
→ Action: Slow crossfade (gentle)

---

You: "hard" [voice]
→ System: Gesture intensity modifier = 2.0

You: [tilt gesture] [gesture]
→ Action: Fast crossfade (hard/aggressive)

Implementation:

python
class GestureModifiers:
    intensity_map = {
        'gentle': 0.5,
        'soft': 0.5,
        'hard': 2.0,
        'fast': 2.0,
        'slow': 0.3,
    }

    def apply_modifier(self, gesture_value: float, modifier: str) -> float:
        multiplier = self.intensity_map.get(modifier, 1.0)
        return gesture_value * multiplier

Technique 3: Gesture Echo (Voice Looper)

Use Case: Record gesture, repeat on voice command

You: "record gesture" [voice]
→ System: Recording...

You: [circle CW] → [swipe right] → [tap]  [gesture sequence]
→ System: Gesture recorded (3-step sequence)

You: "play gesture" [voice]
→ Action: Repeat recorded gesture sequence

You: "play gesture x4" [voice]
→ Action: Repeat sequence 4 times

Why it's creative:
- Record complex gesture patterns
- Replay them verbally
- Create "gesture loops"

---

Technical Implementation

Multi-Modal Fusion Logic

python
class MultiModalFusion:
    """Combine voice + gesture with intelligent fusion."""

    def fuse(
        self,
        voice: Optional[VoiceCommand],
        gesture: Optional[GestureCommand],
        time_delta: float,
    ) -> FusedCommand:
        """
        Fuse voice and gesture inputs.

        Fusion Rules:
        1. If within 1 second → Consider as coordinated
        2. If both present → Cross-validate (confidence boost)
        3. If one missing → Use available modality
        4. If conflict → Voice takes priority (explicit > implicit)
        """

        # Case 1: Voice + Gesture (simultaneous)
        if voice and gesture and time_delta < 0.5:
            return self._fuse_simultaneous(voice, gesture)

        # Case 2: Voice → Gesture (sequential)
        elif voice and gesture and 0.5 < time_delta < 3.0:
            return self._fuse_voice_then_gesture(voice, gesture)

        # Case 3: Gesture → Voice (sequential)
        elif gesture and voice and 0.5 < time_delta < 3.0:
            return self._fuse_gesture_then_voice(gesture, voice)

        # Case 4: Voice only
        elif voice and not gesture:
            return self._voice_only(voice)

        # Case 5: Gesture only
        elif gesture and not voice:
            return self._gesture_only(gesture)

        else:
            return None  # No valid input

    def _fuse_simultaneous(self, voice, gesture):
        """Both inputs at same time - cross-validate."""
        # Check if they agree
        if self._are_compatible(voice.intent, gesture.intent):
            # Agreement → High confidence
            return FusedCommand(
                action=voice.intent,  # Voice is explicit
                confidence=min(0.95, (voice.confidence + gesture.confidence) / 2 + 0.15),
                source='voice+gesture',
            )
        else:
            # Conflict → Voice wins (explicit)
            return FusedCommand(
                action=voice.intent,
                confidence=voice.confidence * 0.9,
                source='voice (conflicted)',
            )

---

Performance Optimization

Latency Budget

InputProcessingFusionTotal
Voice (Gemini)200ms10ms210ms
Gesture (Sensor)20ms10ms30ms
Voice + GestureMAX(200, 20) + 10ms210ms

Key insight: Sensor gestures are MUCH faster than voice, so:
- Critical actions → Gesture-triggered, voice-refined
- Complex actions → Voice-triggered, gesture-parametrized

---

Summary

Why Multi-Modal?

Voice Strengths:
- Explicit intent (clear what you want)
- Complex commands (multi-word)
- Precise parameters (numbers, names)

Gesture Strengths:
- Fast execution (<50ms)
- Continuous control (faders, knobs)
- Physical expression (performative)

Together:
- Voice provides structure (what + when)
- Gestures provide flow (how + expression)
- Result: Expressive, creative control

Design Principles

1. Neither modality is primary - They're equal partners
2. Context flows between modalities - Voice → Gesture or Gesture → Voice
3. Redundancy increases confidence - Both agree = high certainty
4. User has full freedom - Mix and match as desired

---

Make your DJ performance an instrument, not just a controller! 🎤+👋=🎧

Author: Computational Choreography
Version: 1.0 - Multi-Modal Creative Guide

Promotion Decision

Attach run IDs, datasets, metrics, and reproduction commands.

Source Anchor

Comp-Core/apps/web/cc-studio/docs/dj_agent/gesture_control/multimodal_guide.md

Detected Structure

Method · Evaluation