Grand Diomande Research · Full HTML Reader

MotionMix Architecture

MotionMix is a motion-to-music performance system. A webcam tracks body movement via pixel differencing, maps motion zones to audio parameters, and drives a real-time music engine. Multiple phone cameras provide visual coverage with AI director auto-switching. The system learns progressively: each training session captures new motion vocabulary, unlocking new sound transformations.

Language as Infrastructure architecture technical paper candidate score 54 .md

Full Public Reader

MotionMix Architecture

> Your body is the phrase. Motion discovers sound. Every session expands what's possible.

System Overview

MotionMix is a motion-to-music performance system. A webcam tracks body movement via pixel differencing, maps motion zones to audio parameters, and drives a real-time music engine. Multiple phone cameras provide visual coverage with AI director auto-switching. The system learns progressively: each training session captures new motion vocabulary, unlocking new sound transformations.

                    ┌──────────────────────────────────────┐
                    │        DIRECTOR DASHBOARD :9404       │
                    │  ┌──────────┐  ┌──────────────────┐  │
                    │  │  HERO:   │  │  ORB VISUALIZER  │  │
                    │  │  Active  │  │  Canvas 2D /     │  │
                    │  │  Phone   │  │  future: TD      │  │
                    │  │  Camera  │  │  (audio-reactive) │  │
                    │  └──────────┘  └──────────────────┘  │
                    │  [params strip: BRIGHT|DENSITY|ENERGY]│
                    │  [phone thumbnails with director cuts]│
                    └──────────────┬───────────────────────┘
                                   │
                    ┌──────────────┴───────────────────────┐
                    │          AUDIO ENGINE                 │
                    │                                       │
         Webcam ──► │  SimpleHandTracker (pixel-diff 60fps) │
         (hidden)   │       │                               │
                    │       ▼                               │
                    │  ParamMapper (brightness/density/energy)│
                    │       │                               │
                    │       ▼                               │
                    │  ToneHouse (Tone.js: kick/clap/hat/   │
                    │    bass/pad + filter + reverb)         │
                    │       │                               │
                    │       ▼                               │
                    │  Browser Audio Output (speakers)      │
                    └───────────────────────────────────────┘
                                   │
                    ┌──────────────┴───────────────────────┐
                    │      DIRECTOR ENGINE (Rust)           │
                    │                                       │
    Phone 1 ──────► │  Murch Scoring (energy/motion/face/   │
    Phone 2 ──────► │    staleness) every 500ms             │
    Phone 3 ──────► │       │                               │
    Phone 4 ──────► │       ▼                               │
    Phone 5 ──────► │  Auto-cut with BreathPhase pacing     │
                    │  (inhale/hold/exhale cycle)            │
                    │       │                               │
                    │       ▼                               │
                    │  WebSocket → Dashboard hero switches   │
                    └───────────────────────────────────────┘

Core Principle: Separation of Concerns

ConcernOwnerNetwork
Audio generationMac1 webcam (pixel-diff tracker)Local only
Visual coverage5 phone cameras (MJPEG streams)WiFi to Mac1
Director switchingRust multicam-server scoringWebSocket
Recording (video)Each phone records 4K locallyNone during session
Recording (audio)Browser MediaRecorder on Mac1None
Visualizer (live)Canvas 2D orb in dashboardLocal only
Visualizer (production)TouchDesigner on Mac2OSC/bridged

Phones NEVER produce audio. They are camera-only nodes.

Current LUME Extension: Body Truth Layer

The LUME/DYK + K11 Rekordbox architecture adds one missing authority layer above
the raw camera and motion feeds: `LumeBodyTruth`.

Current live split:

MachineResponsibility
Mac4Unity DYK visuals, Femto Mega depth/RGB, first Sony/mocopi integration target
K11Rekordbox, AirDeck gesture bridge, pose viewer, keyboard/MIDI command safety gate
MotionMix hubShared fused body bus and future `LumeBodyTruth` publisher
iPhonesSupplemental SAN/camera/motion evidence and training data

`LumeBodyTruth` is specified in
[`LUME-BODY-TRUTH-CONTRACT.md`](./LUME-BODY-TRUTH-CONTRACT.md). It wraps the
current `fused_body_state`/`/skeleton-3d` path with stricter performer presence,
confidence, stillness, burst, tracking-loss, and gesture-intent semantics.

The invariant: Unity may visualize intent, but K11 remains the final Rekordbox
command gate.

File Map

Dashboard (served via `include_str!` from Rust binary)

multicam-server/
  src/
    main.rs              # Rust: HTTP/WS server, director engine, Murch scoring
    dashboard.html       # The full dashboard UI (HTML + CSS + inline JS)
  static/js/
    hand-tracker-simple.js   # Pixel-diff body zone tracker (THE audio driver)
    param-mapper.js          # Motion → audio parameter mapping
    tone-house.js            # Tone.js music engine (6 instruments, genre patterns)
    visualizer.js            # Canvas 2D orb + radial bars + waveform ring + particles
    strudel-engine.js        # Strudel pattern engine (secondary, future primary)
    strudel-motion-bundle.js # Strudel runtime bundle
    chest-flex.js            # TF.js chest flex detector (one-shot triggers)

Web App (Vercel: motionmix.vercel.app)

js/
  app.js                 # Main app entry, wires everything together
  hand-tracker-simple.js # Same pixel-diff tracker (canonical copy)
  param-mapper.js        # Same mapper
  tone-house.js          # Same engine
  visualizer.js          # Same orb
  movement-vocab.js      # Movement vocabulary classifier (future: per-session learning)
  session-exporter.js    # Exports session data to Supabase
  director-client.js     # Connects to multicam director for phone integration
  mediapipe-tracker.js   # MediaPipe Hands (legacy, replaced by pixel-diff)
  gesture-map.js         # Gesture → Strudel transformation mapping
  lyria-engine.js        # Google Lyria integration (experimental)
  gemini-live.js         # Gemini Live voice control (experimental)
  gemini-vision.js       # Gemini Vision scene analysis (experimental)
  sensors.js             # Phone accelerometer/gyroscope bridge
  voice-control.js       # Voice commands for genre/BPM changes

iOS App (MotionMixApp)

Desktop/MotionMixApp/
  MotionMixApp/
    Services/
      CameraService.swift         # Camera capture + MJPEG streaming
      DirectorHubClient.swift     # WebSocket to multicam-server
      AudioEngine.swift           # DISABLED (camera-only mode)
      MusicPlayerService.swift    # DISABLED
      RecordingService.swift      # 4K local recording
      StrudelWebEngine.swift      # DISABLED
    Views/
      SettingsView.swift          # Camera settings, resolution, stream toggle

Computational Choreography (Rust, future integration)

Desktop/computational-choreography/
  crates/
    runtime/core-rs/src/
      lim_rps/
        config.rs        # LIM-RPS configuration
        dynamics.rs      # Latent dynamics extraction
        latent_state.rs  # Latent state representation
        operator.rs      # CrossModalOperator, LimRpsOperator
        solver.rs        # Latent trajectory solver
        spectral.rs      # Spectral analysis of latent motion
    motion/              # Motion capture ingestion
    semantic/            # Semantic motion labeling
    echelon/             # EchelonDiT model integration
    retrieval/           # Motion retrieval from library
  paper/paper.md         # Research paper with LIM-RPS formalization

Audio Pipeline (Current: Session 1)

Webcam (640x480) ──► Downscale to 160x120
                           │
                     Pixel Diff (frame N vs N-1)
                           │
                     10 Body Zones:
                     ┌─────────────────────────────────┐
                     │  rightArm   head    leftArm     │
                     │  rightBody  center  leftBody    │
                     │  centerL    centerR             │
                     │         legs                    │
                     │         full (entire frame)     │
                     └─────────────────────────────────┘
                           │
                     Zone Motion → Params:
                       rightArm motion  → brightness (pad volume + filter)
                       leftArm motion   → density (hat + clap volume)
                       full motion      → energy (kick volume, overall)
                       legs + center    → bouncing (bass volume)
                       head motion      → energy boost
                       core motion      → open hat + reverb
                           │
                     ParamMapper:
                       Presence detection (5-frame absent → wind down)
                       Velocity boost (sudden moves = punchy changes)
                       Snap-back (motion stops → fast decay to baseline)
                           │
                     ToneHouse:
                       Kick:  -8dB baseline → 0dB at full energy
                       Pad:   -22dB baseline → -2dB at full brightness
                       Hats:  silent → -4dB at full density
                       Clap:  silent → -4dB at full density
                       Bass:  silent → 0dB at full bounce
                       OHat:  silent → -10dB at full core motion

### Thresholds (tuned for seated/desk distance)
| Parameter | Value | Purpose |
|-----------|-------|---------|
| Pixel diff threshold | 28 | Per-pixel RGB diff to count as motion |
| Noise floor | 0.008 | Full-frame motion below this = zero all params |
| Presence threshold | 0.03 | Per-zone motion to count as "present" |
| Hard floor | 0.01 | Smoothed values below this snap to zero |
| Arm multiplier | 5x | Boosted for desk-distance arm motion |
| Energy multiplier | 3x full + 1x head | Boosted for seated upper-body use |

### Gesture Detection
| Gesture | Trigger | Audio Effect |
|---------|---------|-------------|
| raise_arms | Head zone + arm present, energy > 0.3 | Pad + hat open up, filter sweep to 6kHz |
| bounce | Rhythmic peaks in legs+center | Bass volume builds |
| punch | Body energy > 0.6 | Kick emphasis, 200ms boost |
| freeze | Body energy < 0.03 | Drums mute, pad-only breakdown, 4s recovery |

Progressive Learning Pipeline (Sessions 1-15)

### Session 1 (Current)
- Static mapping: zone motion → volume/filter
- 6 genres: house, techno, hiphop, afrobeat, jazz, ambient
- Movement detection: basic pixel-diff zones
- Audio: ToneHouse (Tone.js, 6 voices)
- Data captured: raw motion params per frame (brightness, density, energy + zone data)

### Sessions 2-5: Movement Vocabulary Discovery
- `movement-vocab.js` classifier starts learning YOUR specific gestures
- Each session captures motion trajectories tagged with audio outcomes
- Overnight training: movement classifier on Mac5 (MLX LoRA on gesture data)
- New gestures get labeled: "chest pump sequence", "wave flow", "shoulder roll"
- Each recognized gesture maps to a specific Strudel pattern transformation

### Sessions 6-10: Strudel Pattern Transformations
- Body goes from volume control → pattern EDITING
- Strudel's pattern language unlocks:

MotionStrudel TransformationAudio Result
Arm wave`rev`Reverse the current phrase
Chest pump`fast(2)`Double-time the beat
Slow flow`slow(2) . jux(rev)`Half-time + stereo split
Shoulder roll`striate(8)`Spectral slicing
Full body surge`superimpose(fast(2))`Layer double-time over original
Freeze + hold`chop(16)`Granular slice the current phrase
Spin/turnGenre cycleSwitch entire pattern set
  • Each new motion discovered = new transformation slot unlocked
  • The system PROPOSES sounds when you enter unexplored motion regions

### Nightly Training Cron (`server/cc_training_cron.py`)
- LaunchAgent `com.cc.nightly-training` fires at 2 AM
- Tiered training based on accumulated session count:
- >= 5 sessions: retrain chest flex model (local, 10 epochs, 30 min)
- >= 15 sessions: start CC-MotionGen training (SSH to Mac5, 50 epochs)
- >= 30 sessions: start EchelonDiT training (SSH to Mac5, 100 epochs, 2h)
- Always runs CLIP embedding pipeline on new sessions first
- State tracked in `[home-path]`
- Session data in `Desktop/MotionMix/sessions/`, models to `Desktop/MotionMix/models/`

### Sessions 11-15: LIM-RPS Latent Frame Integration
LIM-RPS = Lipschitz-constrained Implicit Map for Recursive Polymodal Synthesis
- A unified equilibrium solver for cross-modal latent fusion
- Finds equilibrium latents z* where motion, audio, and other modalities converge through spectrally-normalized fixed-point iteration
- Motion goes through the Computational Choreography latent encoder
- LIM-RPS extracts: norm, velocity norm, acceleration norm, predictability
- Latent trajectory QUALITY determines musical COMPLEXITY
- High predictability = simple, resolved phrases
- Low predictability (novel motion) = complex, exploratory patterns
- Commitment scalar = how strongly the phrase resolves
- `CrossModalOperator` maps latent motion → Strudel parameters in latent space
- The system discovers sound regions you haven't heard before
- Full body = phrase editor, not phrase trigger

### The Key Insight
> "Our body is the phrase and we're redefining phrases. Training a song is perhaps just training a different phrase itself, because we are determining the structure."

Traditional: Verse → Chorus → Verse → Bridge
MotionMix: Your body determines what a "phrase" IS. A chest pump sequence = a phrase. A wave flow = a phrase. The TRANSITION between them = the musical transition.

Director Engine (Rust)

### Scoring Algorithm (Murch-Inspired)
Every 500ms, all connected cameras are scored:

FactorWeightSource
Energy35
Motion25
Staleness25
Face score15

When you walk toward a phone camera, its face_score and energy rise. When you walk away, they drop. The director cuts to whichever phone has the best composite score, subject to BreathPhase pacing (inhale = accept cuts freely, hold = lock current, exhale = only accept strong cuts).

### Ghost Mode
Director proposes a cut, waits 2 seconds. If you don't veto (press V), it executes. Gives you manual override power during a performance.

Recording & Production Pipeline

### During Performance
1. Press PLAY (or S key) → ToneHouse starts, webcam tracker drives audio
2. Director auto-switches between phone cameras in the hero view
3. Press REC ALL → phones start 4K local recording, browser captures audio

### Post-Production (Option C: Production Quality)
1. Transfer 4K phone videos to Mac4 (AirDrop/Syncthing)
2. Transfer Mac1 audio (.webm) to Mac4
3. Export director cut log as EDL (timestamps of each camera switch)
4. Premiere Pro on Mac4: import all angles, apply EDL for auto-cuts
5. TD on Mac2 renders production visualizer from motion param timeline
6. Composite TD output over Premiere edit (Screen blend mode)

### Quick Reels (Screen Recording)
- Screen record the dashboard directly (macOS Cmd+Shift+5)
- Captures: director cuts + orb visualizer + audio in one take
- FIT/CROP toggle in header controls letterboxing

Infrastructure

MachineRoleKey Services
Mac1Dashboard host, audio engine, orchestratormulticam-server :9404, webcam tracker, ToneHouse
Mac2TouchDesigner visualizer (future), motion captureTD :9510, Mocopi receiver :9501
Mac3Storage, creative workstationContent, Serenity
Mac4Compute, Premiere Pro editingAdobe CC, Ollama, exo master
Mac5ML trainingMLX, LoRA fine-tuning, movement vocab training
Phones (5)Camera-only nodesMJPEG :8081, pose data, 4K local recording

### Network
- All devices on Tailscale mesh VPN
- MJPEG streams: phone → Mac1 over WiFi (1-3 Mbps each)
- Director WebSocket: Mac1 :9404 ↔ dashboard
- Pose data: phone → Mac1 via DirectorHubClient WebSocket
- 5GHz WiFi recommended for 5+ camera streams

External Tools

ToolStatusPurpose
OBSInstalled (`/Applications/OBS.app`)Future live streaming (TikTok, etc)
TouchDesignerMac2 (CC Echelon World)Production visualizer, replaces Canvas 2D orb
Premiere ProMac4Multi-angle edit with EDL auto-cuts
MocopiAvailable (Sony, 6 sensors)27-bone skeleton capture for TD and LIM-RPS
StrudelBundled in JSPattern language for advanced audio transformations

Key Files Reference

FileLocationWhat It Does
`hand-tracker-simple.js``multicam-server/static/js/`THE audio driver. Pixel-diff on 160x120, 10 body zones
`param-mapper.js``multicam-server/static/js/`Maps zone motion to brightness/density/energy with velocity + snap-back
`tone-house.js``multicam-server/static/js/`Tone.js: kick, clap, hat, bass, pad, genre patterns, filter/reverb
`visualizer.js``multicam-server/static/js/`Canvas 2D orb + radial bars + waveform + particles + beat detection
`dashboard.html``multicam-server/src/`Full dashboard UI, embedded via `include_str!`
`main.rs``multicam-server/src/`Rust director engine, HTTP/WS, Murch scoring, BreathPhase
`lim_rps/``computational-choreography/crates/runtime/core-rs/src/`Latent frame encoder for future integration
`movement-vocab.js``js/`Movement vocabulary classifier (to be trained per-session)
`strudel-engine.js``multicam-server/static/js/`Strudel pattern engine (future primary audio)
`session-exporter.js``js/`Exports motion + audio data to Supabase for training

Session Data Schema (for nightly training)

Each session captures:

json
{
  "session_id": "uuid",
  "timestamp": "ISO",
  "duration_s": 300,
  "genre": "house",
  "bpm": 124,
  "frames": [
    {
      "t": 0.016,
      "brightness": 0.42,
      "density": 0.18,
      "energy": 0.65,
      "zones": {
        "rightArm": 0.31,
        "leftArm": 0.12,
        "head": 0.08,
        "center": 0.22,
        "legs": 0.05,
        "full": 0.19
      },
      "gesture": "raise_arms",
      "gesture_intensity": 0.7,
      "audio_state": {
        "kick_vol": -4,
        "pad_vol": -10,
        "hat_vol": -14,
        "filter_hz": 4200
      }
    }
  ],
  "director_cuts": [
    { "t": 12.5, "from": "iphone-14-pro", "to": "ipad-a16", "confidence": 0.82, "reason": "face-score-drop" }
  ]
}

This data feeds:
1. Movement vocabulary classifier training (Mac5, nightly)
2. Strudel transformation mapping refinement
3. Director scoring calibration
4. LIM-RPS latent encoder training (sessions 11+)
5. TD replay for production visualizer renders

Promotion Decision

Promote into a technical note or architecture paper with implementation anchors.

Source Anchor

MotionMix/ARCHITECTURE.md

Detected Structure

Method · Evaluation · Code Anchors · Architecture