Phase 3: Real-Time Gesture Streaming - Architecture
**Phase 3** connects the production training system (Phases 1 & 2) to live DJ performance, enabling real-time gesture recognition that triggers keyboard shortcuts, MIDI commands, or integrates with voice control.
Full Public Reader
Phase 3: Real-Time Gesture Streaming - Architecture
Overview
Phase 3 connects the production training system (Phases 1 & 2) to live DJ performance, enabling real-time gesture recognition that triggers keyboard shortcuts, MIDI commands, or integrates with voice control.
---
System Architecture
┌─────────────────────────────────────────────────────────────────┐
│ PHASE 3: REAL-TIME SYSTEM │
└─────────────────────────────────────────────────────────────────┘
┌──────────────┐
│ Phone │
│ (Sensors + │
│ Video) │
└──────┬───────┘
│ WebSocket (Phase 1)
↓
┌──────────────────────────────────────┐
│ Real-Time Gesture Stream Processor │ ← NEW
│ - Continuous recognition │
│ - Gesture debouncing │
│ - Confidence filtering │
│ - Performance monitoring │
└──────┬───────────────────────────────┘
│
├─────→ [Gesture Database] ← Phase 2
│ (Trained templates)
│
├─────→ [Gesture Recognizer] ← Phase 2
│ (Real-time matching)
│
↓
┌──────────────────────────────────────┐
│ Gesture → Action Mapper │ ← NEW
│ - Template → Keyboard shortcut │
│ - Template → MIDI command │
│ - Template → DJ software action │
└──────┬───────────────────────────────┘
│
├────────┐
↓ ↓
┌─────────┐ ┌──────────────────┐
│Keyboard │ │ Multimodal │
│Emulator │ │ Controller │
└─────────┘ │ (Voice+Gesture) │
└──────────────────┘
│
↓
┌────────────────┐
│ DJ Software │
│ (Rekordbox, │
│ Serato, etc.) │
└────────────────┘---
Components
1. Real-Time Gesture Stream Processor
Purpose: Continuous gesture recognition with debouncing and filtering.
Key Features:
- Sliding window analysis (0.5-2 seconds)
- Gesture debouncing (prevent double triggers)
- Confidence thresholding
- Multi-gesture detection (simultaneous gestures)
- Performance monitoring (latency, accuracy)
Input:
- Continuous sensor stream (Phase 1)
- Trained gesture templates (Phase 2)
Output:
- Recognized gesture events
- Confidence scores
- Timing metadata
2. Gesture → Action Mapper
Purpose: Map recognized gestures to executable actions.
Mapping Types:
1. Keyboard Shortcuts
- Gesture → Key combo (e.g., swipe_right → Cmd+Right)
- Platform-specific (macOS, Windows, Linux)
2. MIDI Commands
- Gesture → MIDI note/CC
- For DJ controllers, DAWs
3. Direct Integration
- Gesture → Rekordbox/Serato API call
- Gesture → Custom Python function
Configuration:
{
"swipe_right": {
"action": "keyboard",
"shortcut": "Cmd+Right",
"description": "Play/Pause"
},
"circle_cw": {
"action": "midi",
"note": 60,
"channel": 1,
"description": "Loop In"
}
}3. Multimodal Integration
Purpose: Combine gestures with voice commands.
Interaction Patterns:
1. Voice sets context → Gesture executes
2. Gesture starts → Voice refines
3. Simultaneous voice + gesture
4. Gesture sequence (macro)
Example:
User: "left deck" [voice]
→ System: Context = left deck
User: [swipe right] [gesture]
→ System: Execute "play left deck"
→ Output: Keyboard shortcut for left deck play4. Live Performance Dashboard
Purpose: Monitor gesture recognition during performance.
Displays:
- Last 10 recognized gestures
- Confidence scores
- Recognition latency
- Gesture frequency heatmap
- Accuracy trends
Alerts:
- Low confidence warnings
- Sensor disconnection
- High latency spikes
---
Data Flow
Recognition Pipeline
Sensor Reading
↓
[Buffer Window: 0.5-2s]
↓
[Feature Extraction]
↓
[Template Matching] ← Trained Templates
↓
[Confidence Check: >70%]
↓
[Debounce Filter: 200ms cooldown]
↓
[Action Mapper]
↓
[Keyboard/MIDI/API]
↓
DJ SoftwareTiming Requirements
| Stage | Target Latency | Acceptable |
|---|---|---|
| Sensor → Buffer | <10ms | <20ms |
| Feature extraction | <5ms | <10ms |
| Template matching | <30ms | <50ms |
| Action execution | <10ms | <20ms |
| Total (end-to-end) | <60ms | <100ms |
---
Configuration
Gesture Window Settings
WINDOW_SIZE = 1.0 # Seconds of data to analyze
WINDOW_OVERLAP = 0.5 # 50% overlap for smoothness
MIN_WINDOW_SIZE = 0.3 # Minimum gesture duration
MAX_WINDOW_SIZE = 3.0 # Maximum gesture durationDebouncing Settings
DEBOUNCE_COOLDOWN = 0.2 # 200ms between same gesture
GESTURE_TIMEOUT = 2.0 # Clear context after 2s
CONFIDENCE_THRESHOLD = 0.7 # Minimum confidenceAction Mapping
# Keyboard mapping
KEYBOARD_PLATFORM = "macos" # or "windows", "linux"
# MIDI mapping
MIDI_OUTPUT_PORT = "IAC Driver"
MIDI_CHANNEL = 1
# DJ software integration
DJ_SOFTWARE = "rekordbox" # or "serato", "traktor"---
Performance Optimization
1. Sliding Window Efficiency
Instead of analyzing every frame:
# Efficient: 10 Hz analysis (every 100ms)
analysis_interval = 0.1 # 100ms
# Still allows <200ms total latency2. Template Caching
Already implemented in Phase 2:
recognizer = GestureRecognizer(
enable_caching=True, # Cache templates
enable_calibration=True,
)
# 85-95% cache hit rate = 50% faster3. Parallel Recognition
For multi-gesture detection:
# Recognize multiple templates in parallel
with ThreadPoolExecutor() as executor:
futures = [
executor.submit(recognizer.recognize, features)
for features in window_variants
]
results = [f.result() for f in futures]4. Adaptive Thresholding
Lower threshold during high-confidence periods:
# If recent gestures are high confidence (>90%)
if recent_avg_confidence > 0.9:
current_threshold = 0.6 # Lower threshold
else:
current_threshold = 0.7 # Standard threshold---
Integration Points
A. Keyboard Emulation
Libraries:
- macOS: `pynput`, `pyautogui`
- Windows: `pyautogui`, `keyboard`
- Linux: `pynput`, `python-xlib`
Example:
from pynput.keyboard import Key, Controller
keyboard = Controller()
def execute_keyboard_action(shortcut: str):
"""Execute keyboard shortcut."""
# Parse "Cmd+Right"
keys = parse_shortcut(shortcut)
# Press keys
for key in keys:
keyboard.press(key)
# Release keys
for key in reversed(keys):
keyboard.release(key)B. MIDI Output
Library: `python-rtmidi` or `mido`
Example:
import rtmidi
midi_out = rtmidi.MidiOut()
ports = midi_out.get_ports()
# Open virtual port
midi_out.open_port(0)
def send_midi_note(note: int, velocity: int = 127):
"""Send MIDI note."""
# Note On
midi_out.send_message([0x90, note, velocity])
# Note Off (after 100ms)
time.sleep(0.1)
midi_out.send_message([0x80, note, 0])C. Multimodal Controller
Integration:
from src.dj_agent.gesture_control.multimodal_controller import (
MultiModalController
)
controller = MultiModalController()
# Process gesture
action = controller.process_gesture(
gesture_name="swipe_right",
confidence=0.92,
)
if action:
execute_action(action)---
Safety & Error Handling
1. Sensor Disconnection
if not sensor_bridge.is_connected():
logger.warning("Sensor disconnected, pausing recognition")
# Auto-reconnect (Phase 1)
await sensor_bridge.reconnect()2. Low Confidence Filtering
if result.confidence < confidence_threshold:
logger.debug(f"Low confidence: {result.confidence:.0%}")
# Optionally: Trigger visual feedback for user
return None3. Gesture Flooding
# Prevent rapid-fire gestures
if time.time() - last_gesture_time < debounce_cooldown:
logger.debug("Debouncing gesture")
return None4. Graceful Degradation
try:
result = recognizer.recognize(features)
except Exception as e:
logger.error(f"Recognition failed: {e}")
# Fall back to basic motion detection
result = detect_basic_motion(features)---
Live Performance Workflow
Setup (5 minutes)
1. Start real-time system:
python run_live_gesture_control.py2. Load trained gestures:
✅ Loaded 8 gesture templates
✅ Sensor connected
✅ Keyboard emulator ready3. Test each gesture:
Testing: swipe_right
→ ✅ Triggered: Play/PauseDuring Performance
1. Perform gestures naturally
2. System recognizes and executes
3. Dashboard shows real-time feedback
Monitoring
LIVE PERFORMANCE DASHBOARD
═══════════════════════════════════════════════════════════════
📊 Recognition Stats (Last 60s):
Total gestures: 45
Avg confidence: 89%
Avg latency: 34ms
Accuracy: 96%
🎯 Recent Gestures:
14:23:45 swipe_right 92% ✅ Play left deck
14:23:48 circle_cw 88% ✅ Loop 8 beats
14:23:52 tap_twice 94% ✅ Cue point
⚡ Performance:
Sensor rate: 98 Hz
Recognition rate: 10 Hz
Cache hit rate: 91%
⚠️ Warnings: None---
Deployment Modes
Mode 1: Standalone Gesture Control
Phone → Gesture Recognition → Keyboard/MIDI → DJ SoftwareNo voice, pure gesture control.
Mode 2: Multimodal (Voice + Gesture)
Phone → Gesture Recognition ──┐
├→ Multimodal Controller → DJ Software
Gemini Live (Voice) ──────────┘Voice and gestures work together.
Mode 3: Hybrid (Fallback)
Primary: Gesture Control
Fallback: Voice Control (if gesture fails)Best reliability for live performance.
---
Performance Benchmarks
Target Metrics
| Metric | Target | Acceptable | Critical |
|---|---|---|---|
| End-to-end latency | <60ms | <100ms | <150ms |
| Recognition accuracy | >95 | ||
| False positive rate | <2 | ||
| Gesture throughput | 10 Hz | 5 Hz | 2 Hz |
| Sensor dropout rate | <0.1 |
Real-World Performance
Based on production system (Phases 1 & 2):
- Recognition latency: 20-40ms ✅
- Cache hit rate: 85-95
- Sensor reconnection: Automatic ✅
- Data loss: Zero (auto-save)** ✅
---
Next Steps
Phase 3 Implementation:
1. Real-time gesture stream processor
2. Gesture → action mapper
3. Keyboard emulation
4. MIDI output
5. Multimodal integration
6. Live performance dashboard
7. Deployment scripts
Documentation:
- Real-time integration guide
- Keyboard/MIDI mapping reference
- Live performance best practices
- Troubleshooting guide
---
Phase 3 Architecture - Version 1.0
Author: Computational Choreography
Promotion Decision
Promote into a technical note or architecture paper with implementation anchors.
Source Anchor
Comp-Core/apps/web/cc-studio/docs/dj_agent/gesture_control/realtime_architecture.md
Detected Structure
Method · Evaluation · Code Anchors · Architecture