MotionMix Architecture
MotionMix is a motion-to-music performance system. A webcam tracks body movement via pixel differencing, maps motion zones to audio parameters, and drives a real-time music engine. Multiple phone cameras provide visual coverage with AI director auto-switching. The system learns progressively: each training session captures new motion vocabulary, unlocking new sound transformations.
Full Public Reader
MotionMix Architecture
> Your body is the phrase. Motion discovers sound. Every session expands what's possible.
System Overview
MotionMix is a motion-to-music performance system. A webcam tracks body movement via pixel differencing, maps motion zones to audio parameters, and drives a real-time music engine. Multiple phone cameras provide visual coverage with AI director auto-switching. The system learns progressively: each training session captures new motion vocabulary, unlocking new sound transformations.
┌──────────────────────────────────────┐
│ DIRECTOR DASHBOARD :9404 │
│ ┌──────────┐ ┌──────────────────┐ │
│ │ HERO: │ │ ORB VISUALIZER │ │
│ │ Active │ │ Canvas 2D / │ │
│ │ Phone │ │ future: TD │ │
│ │ Camera │ │ (audio-reactive) │ │
│ └──────────┘ └──────────────────┘ │
│ [params strip: BRIGHT|DENSITY|ENERGY]│
│ [phone thumbnails with director cuts]│
└──────────────┬───────────────────────┘
│
┌──────────────┴───────────────────────┐
│ AUDIO ENGINE │
│ │
Webcam ──► │ SimpleHandTracker (pixel-diff 60fps) │
(hidden) │ │ │
│ ▼ │
│ ParamMapper (brightness/density/energy)│
│ │ │
│ ▼ │
│ ToneHouse (Tone.js: kick/clap/hat/ │
│ bass/pad + filter + reverb) │
│ │ │
│ ▼ │
│ Browser Audio Output (speakers) │
└───────────────────────────────────────┘
│
┌──────────────┴───────────────────────┐
│ DIRECTOR ENGINE (Rust) │
│ │
Phone 1 ──────► │ Murch Scoring (energy/motion/face/ │
Phone 2 ──────► │ staleness) every 500ms │
Phone 3 ──────► │ │ │
Phone 4 ──────► │ ▼ │
Phone 5 ──────► │ Auto-cut with BreathPhase pacing │
│ (inhale/hold/exhale cycle) │
│ │ │
│ ▼ │
│ WebSocket → Dashboard hero switches │
└───────────────────────────────────────┘Core Principle: Separation of Concerns
| Concern | Owner | Network |
|---|---|---|
| Audio generation | Mac1 webcam (pixel-diff tracker) | Local only |
| Visual coverage | 5 phone cameras (MJPEG streams) | WiFi to Mac1 |
| Director switching | Rust multicam-server scoring | WebSocket |
| Recording (video) | Each phone records 4K locally | None during session |
| Recording (audio) | Browser MediaRecorder on Mac1 | None |
| Visualizer (live) | Canvas 2D orb in dashboard | Local only |
| Visualizer (production) | TouchDesigner on Mac2 | OSC/bridged |
Phones NEVER produce audio. They are camera-only nodes.
Current LUME Extension: Body Truth Layer
The LUME/DYK + K11 Rekordbox architecture adds one missing authority layer above
the raw camera and motion feeds: `LumeBodyTruth`.
Current live split:
| Machine | Responsibility |
|---|---|
| Mac4 | Unity DYK visuals, Femto Mega depth/RGB, first Sony/mocopi integration target |
| K11 | Rekordbox, AirDeck gesture bridge, pose viewer, keyboard/MIDI command safety gate |
| MotionMix hub | Shared fused body bus and future `LumeBodyTruth` publisher |
| iPhones | Supplemental SAN/camera/motion evidence and training data |
`LumeBodyTruth` is specified in
[`LUME-BODY-TRUTH-CONTRACT.md`](./LUME-BODY-TRUTH-CONTRACT.md). It wraps the
current `fused_body_state`/`/skeleton-3d` path with stricter performer presence,
confidence, stillness, burst, tracking-loss, and gesture-intent semantics.
The invariant: Unity may visualize intent, but K11 remains the final Rekordbox
command gate.
File Map
Dashboard (served via `include_str!` from Rust binary)
multicam-server/
src/
main.rs # Rust: HTTP/WS server, director engine, Murch scoring
dashboard.html # The full dashboard UI (HTML + CSS + inline JS)
static/js/
hand-tracker-simple.js # Pixel-diff body zone tracker (THE audio driver)
param-mapper.js # Motion → audio parameter mapping
tone-house.js # Tone.js music engine (6 instruments, genre patterns)
visualizer.js # Canvas 2D orb + radial bars + waveform ring + particles
strudel-engine.js # Strudel pattern engine (secondary, future primary)
strudel-motion-bundle.js # Strudel runtime bundle
chest-flex.js # TF.js chest flex detector (one-shot triggers)Web App (Vercel: motionmix.vercel.app)
js/
app.js # Main app entry, wires everything together
hand-tracker-simple.js # Same pixel-diff tracker (canonical copy)
param-mapper.js # Same mapper
tone-house.js # Same engine
visualizer.js # Same orb
movement-vocab.js # Movement vocabulary classifier (future: per-session learning)
session-exporter.js # Exports session data to Supabase
director-client.js # Connects to multicam director for phone integration
mediapipe-tracker.js # MediaPipe Hands (legacy, replaced by pixel-diff)
gesture-map.js # Gesture → Strudel transformation mapping
lyria-engine.js # Google Lyria integration (experimental)
gemini-live.js # Gemini Live voice control (experimental)
gemini-vision.js # Gemini Vision scene analysis (experimental)
sensors.js # Phone accelerometer/gyroscope bridge
voice-control.js # Voice commands for genre/BPM changesiOS App (MotionMixApp)
Desktop/MotionMixApp/
MotionMixApp/
Services/
CameraService.swift # Camera capture + MJPEG streaming
DirectorHubClient.swift # WebSocket to multicam-server
AudioEngine.swift # DISABLED (camera-only mode)
MusicPlayerService.swift # DISABLED
RecordingService.swift # 4K local recording
StrudelWebEngine.swift # DISABLED
Views/
SettingsView.swift # Camera settings, resolution, stream toggleComputational Choreography (Rust, future integration)
Desktop/computational-choreography/
crates/
runtime/core-rs/src/
lim_rps/
config.rs # LIM-RPS configuration
dynamics.rs # Latent dynamics extraction
latent_state.rs # Latent state representation
operator.rs # CrossModalOperator, LimRpsOperator
solver.rs # Latent trajectory solver
spectral.rs # Spectral analysis of latent motion
motion/ # Motion capture ingestion
semantic/ # Semantic motion labeling
echelon/ # EchelonDiT model integration
retrieval/ # Motion retrieval from library
paper/paper.md # Research paper with LIM-RPS formalizationAudio Pipeline (Current: Session 1)
Webcam (640x480) ──► Downscale to 160x120
│
Pixel Diff (frame N vs N-1)
│
10 Body Zones:
┌─────────────────────────────────┐
│ rightArm head leftArm │
│ rightBody center leftBody │
│ centerL centerR │
│ legs │
│ full (entire frame) │
└─────────────────────────────────┘
│
Zone Motion → Params:
rightArm motion → brightness (pad volume + filter)
leftArm motion → density (hat + clap volume)
full motion → energy (kick volume, overall)
legs + center → bouncing (bass volume)
head motion → energy boost
core motion → open hat + reverb
│
ParamMapper:
Presence detection (5-frame absent → wind down)
Velocity boost (sudden moves = punchy changes)
Snap-back (motion stops → fast decay to baseline)
│
ToneHouse:
Kick: -8dB baseline → 0dB at full energy
Pad: -22dB baseline → -2dB at full brightness
Hats: silent → -4dB at full density
Clap: silent → -4dB at full density
Bass: silent → 0dB at full bounce
OHat: silent → -10dB at full core motion### Thresholds (tuned for seated/desk distance)
| Parameter | Value | Purpose |
|-----------|-------|---------|
| Pixel diff threshold | 28 | Per-pixel RGB diff to count as motion |
| Noise floor | 0.008 | Full-frame motion below this = zero all params |
| Presence threshold | 0.03 | Per-zone motion to count as "present" |
| Hard floor | 0.01 | Smoothed values below this snap to zero |
| Arm multiplier | 5x | Boosted for desk-distance arm motion |
| Energy multiplier | 3x full + 1x head | Boosted for seated upper-body use |
### Gesture Detection
| Gesture | Trigger | Audio Effect |
|---------|---------|-------------|
| raise_arms | Head zone + arm present, energy > 0.3 | Pad + hat open up, filter sweep to 6kHz |
| bounce | Rhythmic peaks in legs+center | Bass volume builds |
| punch | Body energy > 0.6 | Kick emphasis, 200ms boost |
| freeze | Body energy < 0.03 | Drums mute, pad-only breakdown, 4s recovery |
Progressive Learning Pipeline (Sessions 1-15)
### Session 1 (Current)
- Static mapping: zone motion → volume/filter
- 6 genres: house, techno, hiphop, afrobeat, jazz, ambient
- Movement detection: basic pixel-diff zones
- Audio: ToneHouse (Tone.js, 6 voices)
- Data captured: raw motion params per frame (brightness, density, energy + zone data)
### Sessions 2-5: Movement Vocabulary Discovery
- `movement-vocab.js` classifier starts learning YOUR specific gestures
- Each session captures motion trajectories tagged with audio outcomes
- Overnight training: movement classifier on Mac5 (MLX LoRA on gesture data)
- New gestures get labeled: "chest pump sequence", "wave flow", "shoulder roll"
- Each recognized gesture maps to a specific Strudel pattern transformation
### Sessions 6-10: Strudel Pattern Transformations
- Body goes from volume control → pattern EDITING
- Strudel's pattern language unlocks:
| Motion | Strudel Transformation | Audio Result |
|---|---|---|
| Arm wave | `rev` | Reverse the current phrase |
| Chest pump | `fast(2)` | Double-time the beat |
| Slow flow | `slow(2) . jux(rev)` | Half-time + stereo split |
| Shoulder roll | `striate(8)` | Spectral slicing |
| Full body surge | `superimpose(fast(2))` | Layer double-time over original |
| Freeze + hold | `chop(16)` | Granular slice the current phrase |
| Spin/turn | Genre cycle | Switch entire pattern set |
- Each new motion discovered = new transformation slot unlocked
- The system PROPOSES sounds when you enter unexplored motion regions
### Nightly Training Cron (`server/cc_training_cron.py`)
- LaunchAgent `com.cc.nightly-training` fires at 2 AM
- Tiered training based on accumulated session count:
- >= 5 sessions: retrain chest flex model (local, 10 epochs, 30 min)
- >= 15 sessions: start CC-MotionGen training (SSH to Mac5, 50 epochs)
- >= 30 sessions: start EchelonDiT training (SSH to Mac5, 100 epochs, 2h)
- Always runs CLIP embedding pipeline on new sessions first
- State tracked in `[home-path]`
- Session data in `Desktop/MotionMix/sessions/`, models to `Desktop/MotionMix/models/`
### Sessions 11-15: LIM-RPS Latent Frame Integration
LIM-RPS = Lipschitz-constrained Implicit Map for Recursive Polymodal Synthesis
- A unified equilibrium solver for cross-modal latent fusion
- Finds equilibrium latents z* where motion, audio, and other modalities converge through spectrally-normalized fixed-point iteration
- Motion goes through the Computational Choreography latent encoder
- LIM-RPS extracts: norm, velocity norm, acceleration norm, predictability
- Latent trajectory QUALITY determines musical COMPLEXITY
- High predictability = simple, resolved phrases
- Low predictability (novel motion) = complex, exploratory patterns
- Commitment scalar = how strongly the phrase resolves
- `CrossModalOperator` maps latent motion → Strudel parameters in latent space
- The system discovers sound regions you haven't heard before
- Full body = phrase editor, not phrase trigger
### The Key Insight
> "Our body is the phrase and we're redefining phrases. Training a song is perhaps just training a different phrase itself, because we are determining the structure."
Traditional: Verse → Chorus → Verse → Bridge
MotionMix: Your body determines what a "phrase" IS. A chest pump sequence = a phrase. A wave flow = a phrase. The TRANSITION between them = the musical transition.
Director Engine (Rust)
### Scoring Algorithm (Murch-Inspired)
Every 500ms, all connected cameras are scored:
| Factor | Weight | Source |
|---|---|---|
| Energy | 35 | |
| Motion | 25 | |
| Staleness | 25 | |
| Face score | 15 |
When you walk toward a phone camera, its face_score and energy rise. When you walk away, they drop. The director cuts to whichever phone has the best composite score, subject to BreathPhase pacing (inhale = accept cuts freely, hold = lock current, exhale = only accept strong cuts).
### Ghost Mode
Director proposes a cut, waits 2 seconds. If you don't veto (press V), it executes. Gives you manual override power during a performance.
Recording & Production Pipeline
### During Performance
1. Press PLAY (or S key) → ToneHouse starts, webcam tracker drives audio
2. Director auto-switches between phone cameras in the hero view
3. Press REC ALL → phones start 4K local recording, browser captures audio
### Post-Production (Option C: Production Quality)
1. Transfer 4K phone videos to Mac4 (AirDrop/Syncthing)
2. Transfer Mac1 audio (.webm) to Mac4
3. Export director cut log as EDL (timestamps of each camera switch)
4. Premiere Pro on Mac4: import all angles, apply EDL for auto-cuts
5. TD on Mac2 renders production visualizer from motion param timeline
6. Composite TD output over Premiere edit (Screen blend mode)
### Quick Reels (Screen Recording)
- Screen record the dashboard directly (macOS Cmd+Shift+5)
- Captures: director cuts + orb visualizer + audio in one take
- FIT/CROP toggle in header controls letterboxing
Infrastructure
| Machine | Role | Key Services |
|---|---|---|
| Mac1 | Dashboard host, audio engine, orchestrator | multicam-server :9404, webcam tracker, ToneHouse |
| Mac2 | TouchDesigner visualizer (future), motion capture | TD :9510, Mocopi receiver :9501 |
| Mac3 | Storage, creative workstation | Content, Serenity |
| Mac4 | Compute, Premiere Pro editing | Adobe CC, Ollama, exo master |
| Mac5 | ML training | MLX, LoRA fine-tuning, movement vocab training |
| Phones (5) | Camera-only nodes | MJPEG :8081, pose data, 4K local recording |
### Network
- All devices on Tailscale mesh VPN
- MJPEG streams: phone → Mac1 over WiFi (1-3 Mbps each)
- Director WebSocket: Mac1 :9404 ↔ dashboard
- Pose data: phone → Mac1 via DirectorHubClient WebSocket
- 5GHz WiFi recommended for 5+ camera streams
External Tools
| Tool | Status | Purpose |
|---|---|---|
| OBS | Installed (`/Applications/OBS.app`) | Future live streaming (TikTok, etc) |
| TouchDesigner | Mac2 (CC Echelon World) | Production visualizer, replaces Canvas 2D orb |
| Premiere Pro | Mac4 | Multi-angle edit with EDL auto-cuts |
| Mocopi | Available (Sony, 6 sensors) | 27-bone skeleton capture for TD and LIM-RPS |
| Strudel | Bundled in JS | Pattern language for advanced audio transformations |
Key Files Reference
| File | Location | What It Does |
|---|---|---|
| `hand-tracker-simple.js` | `multicam-server/static/js/` | THE audio driver. Pixel-diff on 160x120, 10 body zones |
| `param-mapper.js` | `multicam-server/static/js/` | Maps zone motion to brightness/density/energy with velocity + snap-back |
| `tone-house.js` | `multicam-server/static/js/` | Tone.js: kick, clap, hat, bass, pad, genre patterns, filter/reverb |
| `visualizer.js` | `multicam-server/static/js/` | Canvas 2D orb + radial bars + waveform + particles + beat detection |
| `dashboard.html` | `multicam-server/src/` | Full dashboard UI, embedded via `include_str!` |
| `main.rs` | `multicam-server/src/` | Rust director engine, HTTP/WS, Murch scoring, BreathPhase |
| `lim_rps/` | `computational-choreography/crates/runtime/core-rs/src/` | Latent frame encoder for future integration |
| `movement-vocab.js` | `js/` | Movement vocabulary classifier (to be trained per-session) |
| `strudel-engine.js` | `multicam-server/static/js/` | Strudel pattern engine (future primary audio) |
| `session-exporter.js` | `js/` | Exports motion + audio data to Supabase for training |
Session Data Schema (for nightly training)
Each session captures:
{
"session_id": "uuid",
"timestamp": "ISO",
"duration_s": 300,
"genre": "house",
"bpm": 124,
"frames": [
{
"t": 0.016,
"brightness": 0.42,
"density": 0.18,
"energy": 0.65,
"zones": {
"rightArm": 0.31,
"leftArm": 0.12,
"head": 0.08,
"center": 0.22,
"legs": 0.05,
"full": 0.19
},
"gesture": "raise_arms",
"gesture_intensity": 0.7,
"audio_state": {
"kick_vol": -4,
"pad_vol": -10,
"hat_vol": -14,
"filter_hz": 4200
}
}
],
"director_cuts": [
{ "t": 12.5, "from": "iphone-14-pro", "to": "ipad-a16", "confidence": 0.82, "reason": "face-score-drop" }
]
}This data feeds:
1. Movement vocabulary classifier training (Mac5, nightly)
2. Strudel transformation mapping refinement
3. Director scoring calibration
4. LIM-RPS latent encoder training (sessions 11+)
5. TD replay for production visualizer renders
Promotion Decision
Promote into a technical note or architecture paper with implementation anchors.
Source Anchor
MotionMix/ARCHITECTURE.md
Detected Structure
Method · Evaluation · Code Anchors · Architecture