Gesture Control Architecture - Motion-Based DJ Interface
Transform your phone into a **motion-controlled DJ remote** using: - **Gemini Live Video**: Visual gesture interpretation - **Sensor Logger**: High-precision IMU data (accelerometer, gyroscope, magnetometer) - **Fusion Engine**: Combines both streams for robust recognition - **Training UI**: Practice and refine gestures for accuracy
Full Public Reader
Gesture Control Architecture - Motion-Based DJ Interface
Vision
Transform your phone into a motion-controlled DJ remote using:
- Gemini Live Video: Visual gesture interpretation
- Sensor Logger: High-precision IMU data (accelerometer, gyroscope, magnetometer)
- Fusion Engine: Combines both streams for robust recognition
- Training UI: Practice and refine gestures for accuracy
---
System Architecture
┌───────────────────────────────────────────────────────────────────┐
│ PHONE (Input Device) │
├─────────────────────────┬─────────────────────────────────────────┤
│ Camera (Video) │ Sensor Logger (IMU Data) │
│ - Hand gestures │ - Accelerometer │
│ - Body movements │ - Gyroscope │
│ - Spatial context │ - Magnetometer │
└──────────┬──────────────┴──────────────┬────────────────────────────┘
│ │
│ WebRTC/ │ WebSocket
│ Video Stream │ JSON Stream
▼ ▼
┌───────────────────────────────────────────────────────────────────┐
│ GESTURE RECOGNITION SYSTEM │
├─────────────────────────────────────────────────────────────────
│
│ ┌─────────────────────┐ ┌──────────────────────┐ │
│ │ Gemini Live Video │ │ Sensor Analyzer │ │
│ │ - Visual interpret │ │ - Pattern matching │ │
│ │ - Semantic meaning │ │ - Numerical precision│ │
│ │ - Confidence score │ │ - Temporal analysis │ │
│ └──────────┬──────────┘ └──────────┬───────────┘ │
│ │ │ │
│ │ ┌─────────────────────────┘ │
│ │ │ │
│ ▼ ▼ │
│ ┌─────────────────────────────────────────────┐ │
│ │ Gesture Fusion Engine │ │
│ │ - Combine video + sensor data │ │
│ │ - Cross-validate interpretations │ │
│ │ - Calculate combined confidence │ │
│ │ - Reduce false positives │ │
│ └──────────────────────┬──────────────────────┘ │
│ │ │
│ ▼ │
│ ┌─────────────────────────────────────────────┐ │
│ │ Gesture Matcher │ │
│ │ - Compare to trained gesture database │ │
│ │ - Find best match │ │
│ │ - Apply confidence threshold │ │
│ └──────────────────────┬──────────────────────┘ │
└─────────────────────────┼──────────────────────────────────────────┘
│
│ Matched Gesture
▼
┌───────────────────────────────────────────────────────────────────┐
│ KEYBOARD MAPPING LAYER │
├───────────────────────────────────────────────────────────────────┤
│ Gesture → Keyboard Shortcut → Rekordbox Action │
│ │
│ swipe_right → Cmd+Right → play/pause │
│ tap_twice → Space → cue │
│ circle_cw → L → loop 4 beats │
│ tilt_left → [ → crossfade left │
│ shake_vert → S → sync │
└─────────────────────────┬─────────────────────────────────────────┘
│
▼
┌───────────────────────────────────────────────────────────────────┐
│ REKORDBOX │
│ (Receives keyboard events) │
└───────────────────────────────────────────────────────────────────┘---
Component Details
1. Input Sources
#### Gemini Live Video Stream
- Purpose: Semantic understanding of gestures
- Strengths:
- Natural language interpretation ("user is swiping right")
- Contextual awareness (hand position, body orientation)
- Handles complex gestures (circles, waves, etc.)
- Weaknesses:
- Network latency (100-300ms)
- Less precise numerical data
- Lighting dependent
Data Format:
{
"gesture": "swipe_right",
"confidence": 0.92,
"description": "Hand moves rapidly from left to right",
"hand_position": {"x": 0.6, "y": 0.4}
}#### Sensor Logger IMU Data
- Purpose: High-precision motion capture
- Strengths:
- <10ms latency (nearly instantaneous)
- Numerical precision (exact acceleration/rotation values)
- Lighting independent
- Works even when camera view is obscured
- Weaknesses:
- No visual context
- Requires pattern matching (not semantic)
- Needs calibration
Data Format:
{
"timestamp": 1234567890.123,
"accelerometer": {"x": 2.5, "y": -0.3, "z": 9.8},
"gyroscope": {"x": 0.1, "y": -0.05, "z": 0.02},
"magnetometer": {"x": 20.1, "y": 15.3, "z": -40.2}
}2. Gesture Fusion Engine
Purpose: Combine video + sensor streams for robust recognition
Fusion Strategy:
def fuse_gesture(
video_interpretation: GestureInterpretation,
sensor_pattern: SensorPattern,
) -> FusedGesture:
"""
Combine video + sensor data.
Fusion rules:
1. If both agree (video says "swipe_right" AND sensor shows X-axis spike):
→ High confidence (0.9-0.95)
2. If one source is confident, other is uncertain:
→ Medium confidence (0.7-0.85)
3. If sources disagree:
→ Reject gesture (confidence <0.7)
"""
# Extract video gesture
video_gesture = video_interpretation.gesture_name
video_conf = video_interpretation.confidence
# Extract sensor gesture (from pattern matching)
sensor_gesture = match_sensor_pattern(sensor_pattern)
sensor_conf = sensor_gesture.confidence
# Check agreement
if video_gesture == sensor_gesture:
# Both agree - HIGH confidence
combined_conf = min(0.95, (video_conf + sensor_conf) / 2 + 0.1)
return FusedGesture(
gesture=video_gesture,
confidence=combined_conf,
source='both',
)
elif video_conf > 0.8 and sensor_conf < 0.5:
# Video confident, sensor uncertain - MEDIUM confidence
return FusedGesture(
gesture=video_gesture,
confidence=video_conf * 0.9,
source='video_primary',
)
elif sensor_conf > 0.8 and video_conf < 0.5:
# Sensor confident, video uncertain - MEDIUM confidence
return FusedGesture(
gesture=sensor_gesture.gesture,
confidence=sensor_conf * 0.9,
source='sensor_primary',
)
else:
# Disagreement or both uncertain - REJECT
return FusedGesture(
gesture='unknown',
confidence=0.0,
source='none',
)Benefits of Fusion:
- Reduces false positives (both must agree for high confidence)
- Handles partial failures (one source can compensate for the other)
- Improves accuracy (cross-validation between modalities)
3. Gesture Training System
Purpose: Record, label, and refine gesture patterns
Training Workflow:
1. Record Gesture
↓
User performs gesture (e.g., "swipe right")
System captures:
- Video frames (5fps)
- Sensor readings (100Hz)
↓
2. Label Gesture
↓
User assigns name: "swipe_right"
User assigns action: "play/pause"
↓
3. Extract Features
↓
Video features:
- Hand trajectory
- Speed
- Direction
Sensor features:
- Peak acceleration (X/Y/Z axes)
- Duration
- Rotation pattern
↓
4. Save to Database
↓
gesture_database.json:
{
"swipe_right": {
"video_features": {...},
"sensor_features": {...},
"keyboard_shortcut": "Cmd+Right",
"samples": 15 // Number of training examples
}
}
↓
5. Practice Mode
↓
User practices gesture
System shows:
- Confidence score
- Which features matched
- Suggestions for improvementTraining UI (Proposed):
┌─────────────────────────────────────────────────────────────┐
│ Gesture Trainer - Practice Mode │
├─────────────────────────────────────────────────────────────┤
│ │
│ Current Gesture: SWIPE RIGHT │
│ │
│ ┌───────────────────────┐ ┌───────────────────────────┐ │
│ │ Video Feed │ │ Sensor Visualization │ │
│ │ │ │ │ │
│ │ [Camera view] │ │ Accel: ████░░░░░ │ │
│ │ │ │ Gyro: ░░░░░░░░░ │ │
│ │ │ │ Mag: ░░░██████ │ │
│ └───────────────────────┘ └───────────────────────────┘ │
│ │
│ Last Attempt: ██████████████████░░░░ 85% match │
│ │
│ ✓ Direction: Correct (right) │
│ ✓ Speed: Good (2.3 m/s²) │
│ ⚠ Duration: Too slow (0.8s, target: 0.3-0.5s) │
│ │
│ Suggestions: │
│ • Move phone faster │
│ • Keep motion more horizontal │
│ │
│ [Practice] [Save Sample] [Test Against Rekordbox] │
└─────────────────────────────────────────────────────────────┘4. Gesture Mappings
Example Gesture Catalog:
| Gesture | Description | Video Cue | Sensor Pattern | Keyboard | Rekordbox Action |
|---|---|---|---|---|---|
| swipe_right | Swipe phone right | Hand right | accel_x > 2.0 | Cmd+Right | Play/Pause |
| swipe_left | Swipe phone left | Hand left | accel_x < -2.0 | Cmd+Left | Skip back |
| tap_twice | Double tap | Two taps | accel_z spikes x2 | Space | Cue |
| circle_cw | Draw clockwise circle | Hand circles | gyro rotation CW | L | Loop 4 beats |
| circle_ccw | Draw counter-clockwise | Hand circles | gyro rotation CCW | O | Exit loop |
| tilt_left | Tilt phone left | Hand tilts | accel_x < -1.5 hold | [ | Crossfade left |
| tilt_right | Tilt phone right | Hand tilts | accel_x > 1.5 hold | ] | Crossfade right |
| shake_vert | Shake up/down | Hand shakes | accel_y oscillates | S | Sync |
| pinch | Pinch fingers | Fingers together | N/A | - | Volume down |
| spread | Spread fingers | Fingers apart | N/A | + | Volume up |
---
Implementation Phases
### Phase 1: Foundation (Weeks 1-2)
- ✅ Sensor Logger bridge (WebSocket server)
- ✅ Gemini Video analyzer (video stream processing)
- ⬜ Basic gesture fusion engine
- ⬜ Keyboard shortcut injection
### Phase 2: Training System (Weeks 3-4)
- ⬜ Gesture recorder (capture video + sensor data)
- ⬜ Feature extraction (video + sensor)
- ⬜ Gesture database (storage and retrieval)
- ⬜ Training UI (practice mode)
### Phase 3: Recognition Pipeline (Weeks 5-6)
- ⬜ Sensor pattern matching
- ⬜ Gesture fusion algorithm
- ⬜ Confidence calibration
- ⬜ Real-time recognition loop
### Phase 4: Integration & Polish (Weeks 7-8)
- ⬜ Integrate with existing voice control
- ⬜ Multi-modal control (voice + gestures)
- ⬜ Performance optimization (<50ms total latency)
- ⬜ User documentation and demos
---
Performance Targets
| Metric | Target | Notes |
|---|---|---|
| Gesture Recognition Accuracy | >90 | |
| False Positive Rate | <5 | |
| Latency (Sensor) | <50ms | Sensor → Keyboard |
| Latency (Video) | <300ms | Video → Keyboard (Gemini API overhead) |
| Latency (Fused) | <100ms | Combined (sensor is faster) |
| Training Time | <5min | Per gesture (15 samples) |
---
Technical Challenges & Solutions
Challenge 1: Network Latency (Gemini Video)
Problem: Gemini Live API adds 100-300ms latency
Solutions:
1. Sensor-First Mode: Use sensor data for time-critical gestures (play/pause, sync)
2. Predictive Buffering: Start action on sensor spike, confirm with video
3. Hybrid Approach: Quick gestures = sensor only, complex gestures = both
Challenge 2: Gesture Ambiguity
Problem: "Swipe right" for play vs "swipe right" for next track
Solutions:
1. Context-Aware Mappings: Different mappings based on deck state
- If playing: swipe = pause
- If stopped: swipe = play
2. Multi-Gesture Sequences: Combine gestures for compound actions
- Swipe right + tap = play right deck
3. Voice + Gesture: Use voice to disambiguate
- Say "left" + swipe = action on left deck
Challenge 3: Sensor Calibration
Problem: Different phones have different sensor characteristics
Solutions:
1. Auto-Calibration: Measure baseline (gravity) on startup
2. Adaptive Thresholds: Learn user's typical motion range during training
3. Per-Device Profiles: Save calibration per phone model
---
Example Usage Scenarios
Scenario 1: Live Performance (Sensor-Primary Mode)
Goal: Minimal latency, simple gestures only
User: [Swipe right rapidly]
→ Sensor detects: accel_x spike (2.8 m/s²)
→ Pattern match: "swipe_right" (confidence: 0.9)
→ Keyboard: Cmd+Right
→ Rekordbox: Play/Pause
→ Latency: 45ms ✅No video stream needed - sensor is sufficient for simple gestures
Scenario 2: Studio Practice (Fusion Mode)
Goal: Learn complex gestures, high accuracy
User: [Draws circle clockwise]
→ Video: "circle_cw" (confidence: 0.85)
→ Sensor: gyro rotation pattern matches CW (confidence: 0.90)
→ Fusion: Both agree, combined confidence: 0.92 ✅
→ Keyboard: L
→ Rekordbox: Loop 4 beats
→ Latency: 180ms (acceptable for practice)Both streams used - high accuracy for learning
Scenario 3: Multi-Modal Control
Goal: Combine voice + gestures for complex actions
User: "left deck" [voice]
→ Context: Left deck selected
User: [Swipe right]
→ Sensor: "swipe_right"
→ Context-aware mapping: Play left deck
→ Keyboard: Cmd+Left → Cmd+Right
→ Rekordbox: Play left deck
User: [Circle clockwise]
→ Sensor + Video: "circle_cw"
→ Context-aware: Loop left deck, 4 beats
→ Keyboard: L
→ Rekordbox: Loop 4 beats on leftVoice sets context, gestures perform actions
---
Next Steps
1. Test Sensor Logger Integration
# Install Sensor Logger app on phone
# Configure to stream to ws://your-computer-ip:8765
python sensor_logger_bridge.py2. Test Gemini Video Analyzer
# Ensure camera access
python gemini_video_analyzer.py3. Design Gesture Catalog
- Which gestures map to which Rekordbox actions?
- Priority gestures for live performance?
4. Build Training UI
- Start with simple web-based UI
- Record 15 samples per gesture
- Measure recognition accuracy
---
Future Enhancements
### Multi-User Gesture Profiles
- Save gesture patterns per user
- Different users have different motion styles
- Auto-detect user from gesture "signature"
### Gesture Macros
- Chain multiple gestures into sequences
- Example: Circle + swipe = "loop and transition"
### Adaptive Learning
- System improves recognition over time
- Learns user's specific motion patterns
- Reduces false positives with experience
### Haptic Feedback
- Phone vibrates on successful gesture recognition
- Different vibration patterns for different actions
- Helps user learn correct gesture execution
---
Ready to turn your phone into a DJ controller! 🎧📱
Author: Computational Choreography
Version: 1.0 - Gesture Control Architecture
Promotion Decision
Promote into a technical note or architecture paper with implementation anchors.
Source Anchor
Comp-Core/apps/web/cc-studio/docs/dj_agent/gesture_control/architecture.md
Detected Structure
Method · Evaluation · Figures · Code Anchors · Architecture