Grand Diomande Research · Full HTML Reader

MotionMix Architecture

MotionMix is a motion-to-music performance system. A webcam tracks body movement via pixel differencing, maps motion zones to audio parameters, and drives a real-time music engine. Multiple phone cameras provide visual coverage with AI director auto-switching. The system learns progressively: each training session captures new motion vocabulary, unlocking new sound transformations.

Language as Infrastructure architecture technical paper candidate score 54 .md

Full Public Reader

MotionMix Architecture

> Your body is the phrase. Motion discovers sound. Every session expands what's possible.

System Overview

                    ┌──────────────────────────────────────┐
                    │        DIRECTOR DASHBOARD :9404       │
                    │  ┌──────────┐  ┌──────────────────┐  │
                    │  │  HERO:   │  │  ORB VISUALIZER  │  │
                    │  │  Active  │  │  Canvas 2D /     │  │
                    │  │  Phone   │  │  future: TD      │  │
                    │  │  Camera  │  │  (audio-reactive) │  │
                    │  └──────────┘  └──────────────────┘  │
                    │  [params strip: BRIGHT|DENSITY|ENERGY]│
                    │  [phone thumbnails with director cuts]│
                    └──────────────┬───────────────────────┘
                                   │
                    ┌──────────────┴───────────────────────┐
                    │          AUDIO ENGINE                 │
                    │                                       │
         Webcam ──► │  SimpleHandTracker (pixel-diff 60fps) │
         (hidden)   │       │                               │
                    │       ▼                               │
                    │  ParamMapper (brightness/density/energy)│
                    │       │                               │
                    │       ▼                               │
                    │  ToneHouse (Tone.js: kick/clap/hat/   │
                    │    bass/pad + filter + reverb)         │
                    │       │                               │
                    │       ▼                               │
                    │  Browser Audio Output (speakers)      │
                    └───────────────────────────────────────┘
                                   │
                    ┌──────────────┴───────────────────────┐
                    │      DIRECTOR ENGINE (Rust)           │
                    │                                       │
    Phone 1 ──────► │  Murch Scoring (energy/motion/face/   │
    Phone 2 ──────► │    staleness) every 500ms             │
    Phone 3 ──────► │       │                               │
    Phone 4 ──────► │       ▼                               │
    Phone 5 ──────► │  Auto-cut with BreathPhase pacing     │
                    │  (inhale/hold/exhale cycle)            │
                    │       │                               │
                    │       ▼                               │
                    │  WebSocket → Dashboard hero switches   │
                    └───────────────────────────────────────┘

Core Principle: Separation of Concerns

Concern	Owner	Network
Audio generation	Mac1 webcam (pixel-diff tracker)	Local only
Visual coverage	5 phone cameras (MJPEG streams)	WiFi to Mac1
Director switching	Rust multicam-server scoring	WebSocket
Recording (video)	Each phone records 4K locally	None during session
Recording (audio)	Browser MediaRecorder on Mac1	None
Visualizer (live)	Canvas 2D orb in dashboard	Local only
Visualizer (production)	TouchDesigner on Mac2	OSC/bridged

Phones NEVER produce audio. They are camera-only nodes.

Current LUME Extension: Body Truth Layer

The LUME/DYK + K11 Rekordbox architecture adds one missing authority layer above
the raw camera and motion feeds: `LumeBodyTruth`.

Current live split:

Machine	Responsibility
Mac4	Unity DYK visuals, Femto Mega depth/RGB, first Sony/mocopi integration target
K11	Rekordbox, AirDeck gesture bridge, pose viewer, keyboard/MIDI command safety gate
MotionMix hub	Shared fused body bus and future `LumeBodyTruth` publisher
iPhones	Supplemental SAN/camera/motion evidence and training data

`LumeBodyTruth` is specified in
[`LUME-BODY-TRUTH-CONTRACT.md`](./LUME-BODY-TRUTH-CONTRACT.md). It wraps the
current `fused_body_state`/`/skeleton-3d` path with stricter performer presence,
confidence, stillness, burst, tracking-loss, and gesture-intent semantics.

The invariant: Unity may visualize intent, but K11 remains the final Rekordbox
command gate.

File Map

Dashboard (served via `include_str!` from Rust binary)

multicam-server/
  src/
    main.rs              # Rust: HTTP/WS server, director engine, Murch scoring
    dashboard.html       # The full dashboard UI (HTML + CSS + inline JS)
  static/js/
    hand-tracker-simple.js   # Pixel-diff body zone tracker (THE audio driver)
    param-mapper.js          # Motion → audio parameter mapping
    tone-house.js            # Tone.js music engine (6 instruments, genre patterns)
    visualizer.js            # Canvas 2D orb + radial bars + waveform ring + particles
    strudel-engine.js        # Strudel pattern engine (secondary, future primary)
    strudel-motion-bundle.js # Strudel runtime bundle
    chest-flex.js            # TF.js chest flex detector (one-shot triggers)

Web App (Vercel: motionmix.vercel.app)

js/
  app.js                 # Main app entry, wires everything together
  hand-tracker-simple.js # Same pixel-diff tracker (canonical copy)
  param-mapper.js        # Same mapper
  tone-house.js          # Same engine
  visualizer.js          # Same orb
  movement-vocab.js      # Movement vocabulary classifier (future: per-session learning)
  session-exporter.js    # Exports session data to Supabase
  director-client.js     # Connects to multicam director for phone integration
  mediapipe-tracker.js   # MediaPipe Hands (legacy, replaced by pixel-diff)
  gesture-map.js         # Gesture → Strudel transformation mapping
  lyria-engine.js        # Google Lyria integration (experimental)
  gemini-live.js         # Gemini Live voice control (experimental)
  gemini-vision.js       # Gemini Vision scene analysis (experimental)
  sensors.js             # Phone accelerometer/gyroscope bridge
  voice-control.js       # Voice commands for genre/BPM changes

iOS App (MotionMixApp)

Desktop/MotionMixApp/
  MotionMixApp/
    Services/
      CameraService.swift         # Camera capture + MJPEG streaming
      DirectorHubClient.swift     # WebSocket to multicam-server
      AudioEngine.swift           # DISABLED (camera-only mode)
      MusicPlayerService.swift    # DISABLED
      RecordingService.swift      # 4K local recording
      StrudelWebEngine.swift      # DISABLED
    Views/
      SettingsView.swift          # Camera settings, resolution, stream toggle

Computational Choreography (Rust, future integration)

Desktop/computational-choreography/
  crates/
    runtime/core-rs/src/
      lim_rps/
        config.rs        # LIM-RPS configuration
        dynamics.rs      # Latent dynamics extraction
        latent_state.rs  # Latent state representation
        operator.rs      # CrossModalOperator, LimRpsOperator
        solver.rs        # Latent trajectory solver
        spectral.rs      # Spectral analysis of latent motion
    motion/              # Motion capture ingestion
    semantic/            # Semantic motion labeling
    echelon/             # EchelonDiT model integration
    retrieval/           # Motion retrieval from library
  paper/paper.md         # Research paper with LIM-RPS formalization

Audio Pipeline (Current: Session 1)

Webcam (640x480) ──► Downscale to 160x120
                           │
                     Pixel Diff (frame N vs N-1)
                           │
                     10 Body Zones:
                     ┌─────────────────────────────────┐
                     │  rightArm   head    leftArm     │
                     │  rightBody  center  leftBody    │
                     │  centerL    centerR             │
                     │         legs                    │
                     │         full (entire frame)     │
                     └─────────────────────────────────┘
                           │
                     Zone Motion → Params:
                       rightArm motion  → brightness (pad volume + filter)
                       leftArm motion   → density (hat + clap volume)
                       full motion      → energy (kick volume, overall)
                       legs + center    → bouncing (bass volume)
                       head motion      → energy boost
                       core motion      → open hat + reverb
                           │
                     ParamMapper:
                       Presence detection (5-frame absent → wind down)
                       Velocity boost (sudden moves = punchy changes)
                       Snap-back (motion stops → fast decay to baseline)
                           │
                     ToneHouse:
                       Kick:  -8dB baseline → 0dB at full energy
                       Pad:   -22dB baseline → -2dB at full brightness
                       Hats:  silent → -4dB at full density
                       Clap:  silent → -4dB at full density
                       Bass:  silent → 0dB at full bounce
                       OHat:  silent → -10dB at full core motion

### Thresholds (tuned for seated/desk distance)
| Parameter | Value | Purpose |
|-----------|-------|---------|
| Pixel diff threshold | 28 | Per-pixel RGB diff to count as motion |
| Noise floor | 0.008 | Full-frame motion below this = zero all params |
| Presence threshold | 0.03 | Per-zone motion to count as "present" |
| Hard floor | 0.01 | Smoothed values below this snap to zero |
| Arm multiplier | 5x | Boosted for desk-distance arm motion |
| Energy multiplier | 3x full + 1x head | Boosted for seated upper-body use |

### Gesture Detection
| Gesture | Trigger | Audio Effect |
|---------|---------|-------------|
| raise_arms | Head zone + arm present, energy > 0.3 | Pad + hat open up, filter sweep to 6kHz |
| bounce | Rhythmic peaks in legs+center | Bass volume builds |
| punch | Body energy > 0.6 | Kick emphasis, 200ms boost |
| freeze | Body energy < 0.03 | Drums mute, pad-only breakdown, 4s recovery |

Progressive Learning Pipeline (Sessions 1-15)

### Session 1 (Current)
- Static mapping: zone motion → volume/filter
- 6 genres: house, techno, hiphop, afrobeat, jazz, ambient
- Movement detection: basic pixel-diff zones
- Audio: ToneHouse (Tone.js, 6 voices)
- Data captured: raw motion params per frame (brightness, density, energy + zone data)

### Sessions 2-5: Movement Vocabulary Discovery
- `movement-vocab.js` classifier starts learning YOUR specific gestures
- Each session captures motion trajectories tagged with audio outcomes
- Overnight training: movement classifier on Mac5 (MLX LoRA on gesture data)
- New gestures get labeled: "chest pump sequence", "wave flow", "shoulder roll"
- Each recognized gesture maps to a specific Strudel pattern transformation

### Sessions 6-10: Strudel Pattern Transformations
- Body goes from volume control → pattern EDITING
- Strudel's pattern language unlocks:

Motion	Strudel Transformation	Audio Result
Arm wave	`rev`	Reverse the current phrase
Chest pump	`fast(2)`	Double-time the beat
Slow flow	`slow(2) . jux(rev)`	Half-time + stereo split
Shoulder roll	`striate(8)`	Spectral slicing
Full body surge	`superimpose(fast(2))`	Layer double-time over original
Freeze + hold	`chop(16)`	Granular slice the current phrase
Spin/turn	Genre cycle	Switch entire pattern set

Each new motion discovered = new transformation slot unlocked
The system PROPOSES sounds when you enter unexplored motion regions

### Nightly Training Cron (`server/cc_training_cron.py`)
- LaunchAgent `com.cc.nightly-training` fires at 2 AM
- Tiered training based on accumulated session count:
- >= 5 sessions: retrain chest flex model (local, 10 epochs, 30 min)
- >= 15 sessions: start CC-MotionGen training (SSH to Mac5, 50 epochs)
- >= 30 sessions: start EchelonDiT training (SSH to Mac5, 100 epochs, 2h)
- Always runs CLIP embedding pipeline on new sessions first
- State tracked in `[home-path]`
- Session data in `Desktop/MotionMix/sessions/`, models to `Desktop/MotionMix/models/`

### Sessions 11-15: LIM-RPS Latent Frame Integration
LIM-RPS = Lipschitz-constrained Implicit Map for Recursive Polymodal Synthesis
- A unified equilibrium solver for cross-modal latent fusion
- Finds equilibrium latents z* where motion, audio, and other modalities converge through spectrally-normalized fixed-point iteration
- Motion goes through the Computational Choreography latent encoder
- LIM-RPS extracts: norm, velocity norm, acceleration norm, predictability
- Latent trajectory QUALITY determines musical COMPLEXITY
- High predictability = simple, resolved phrases
- Low predictability (novel motion) = complex, exploratory patterns
- Commitment scalar = how strongly the phrase resolves
- `CrossModalOperator` maps latent motion → Strudel parameters in latent space
- The system discovers sound regions you haven't heard before
- Full body = phrase editor, not phrase trigger

### The Key Insight
> "Our body is the phrase and we're redefining phrases. Training a song is perhaps just training a different phrase itself, because we are determining the structure."

Traditional: Verse → Chorus → Verse → Bridge
MotionMix: Your body determines what a "phrase" IS. A chest pump sequence = a phrase. A wave flow = a phrase. The TRANSITION between them = the musical transition.

Director Engine (Rust)

### Scoring Algorithm (Murch-Inspired)
Every 500ms, all connected cameras are scored:

Factor	Weight	Source
Energy	35
Motion	25
Staleness	25
Face score	15

When you walk toward a phone camera, its face_score and energy rise. When you walk away, they drop. The director cuts to whichever phone has the best composite score, subject to BreathPhase pacing (inhale = accept cuts freely, hold = lock current, exhale = only accept strong cuts).

### Ghost Mode
Director proposes a cut, waits 2 seconds. If you don't veto (press V), it executes. Gives you manual override power during a performance.

Recording & Production Pipeline

### During Performance
1. Press PLAY (or S key) → ToneHouse starts, webcam tracker drives audio
2. Director auto-switches between phone cameras in the hero view
3. Press REC ALL → phones start 4K local recording, browser captures audio

### Post-Production (Option C: Production Quality)
1. Transfer 4K phone videos to Mac4 (AirDrop/Syncthing)
2. Transfer Mac1 audio (.webm) to Mac4
3. Export director cut log as EDL (timestamps of each camera switch)
4. Premiere Pro on Mac4: import all angles, apply EDL for auto-cuts
5. TD on Mac2 renders production visualizer from motion param timeline
6. Composite TD output over Premiere edit (Screen blend mode)

### Quick Reels (Screen Recording)
- Screen record the dashboard directly (macOS Cmd+Shift+5)
- Captures: director cuts + orb visualizer + audio in one take
- FIT/CROP toggle in header controls letterboxing

Infrastructure

Machine	Role	Key Services
Mac1	Dashboard host, audio engine, orchestrator	multicam-server :9404, webcam tracker, ToneHouse
Mac2	TouchDesigner visualizer (future), motion capture	TD :9510, Mocopi receiver :9501
Mac3	Storage, creative workstation	Content, Serenity
Mac4	Compute, Premiere Pro editing	Adobe CC, Ollama, exo master
Mac5	ML training	MLX, LoRA fine-tuning, movement vocab training
Phones (5)	Camera-only nodes	MJPEG :8081, pose data, 4K local recording

### Network
- All devices on Tailscale mesh VPN
- MJPEG streams: phone → Mac1 over WiFi (1-3 Mbps each)
- Director WebSocket: Mac1 :9404 ↔ dashboard
- Pose data: phone → Mac1 via DirectorHubClient WebSocket
- 5GHz WiFi recommended for 5+ camera streams

External Tools

Tool	Status	Purpose
OBS	Installed (`/Applications/OBS.app`)	Future live streaming (TikTok, etc)
TouchDesigner	Mac2 (CC Echelon World)	Production visualizer, replaces Canvas 2D orb
Premiere Pro	Mac4	Multi-angle edit with EDL auto-cuts
Mocopi	Available (Sony, 6 sensors)	27-bone skeleton capture for TD and LIM-RPS
Strudel	Bundled in JS	Pattern language for advanced audio transformations

Key Files Reference

File	Location	What It Does
`hand-tracker-simple.js`	`multicam-server/static/js/`	THE audio driver. Pixel-diff on 160x120, 10 body zones
`param-mapper.js`	`multicam-server/static/js/`	Maps zone motion to brightness/density/energy with velocity + snap-back
`tone-house.js`	`multicam-server/static/js/`	Tone.js: kick, clap, hat, bass, pad, genre patterns, filter/reverb
`visualizer.js`	`multicam-server/static/js/`	Canvas 2D orb + radial bars + waveform + particles + beat detection
`dashboard.html`	`multicam-server/src/`	Full dashboard UI, embedded via `include_str!`
`main.rs`	`multicam-server/src/`	Rust director engine, HTTP/WS, Murch scoring, BreathPhase
`lim_rps/`	`computational-choreography/crates/runtime/core-rs/src/`	Latent frame encoder for future integration
`movement-vocab.js`	`js/`	Movement vocabulary classifier (to be trained per-session)
`strudel-engine.js`	`multicam-server/static/js/`	Strudel pattern engine (future primary audio)
`session-exporter.js`	`js/`	Exports motion + audio data to Supabase for training

Session Data Schema (for nightly training)

Each session captures:

json

{
  "session_id": "uuid",
  "timestamp": "ISO",
  "duration_s": 300,
  "genre": "house",
  "bpm": 124,
  "frames": [
    {
      "t": 0.016,
      "brightness": 0.42,
      "density": 0.18,
      "energy": 0.65,
      "zones": {
        "rightArm": 0.31,
        "leftArm": 0.12,
        "head": 0.08,
        "center": 0.22,
        "legs": 0.05,
        "full": 0.19
      },
      "gesture": "raise_arms",
      "gesture_intensity": 0.7,
      "audio_state": {
        "kick_vol": -4,
        "pad_vol": -10,
        "hat_vol": -14,
        "filter_hz": 4200
      }
    }
  ],
  "director_cuts": [
    { "t": 12.5, "from": "iphone-14-pro", "to": "ipad-a16", "confidence": 0.82, "reason": "face-score-drop" }
  ]
}

This data feeds:
1. Movement vocabulary classifier training (Mac5, nightly)
2. Strudel transformation mapping refinement
3. Director scoring calibration
4. LIM-RPS latent encoder training (sessions 11+)
5. TD replay for production visualizer renders

Promotion Decision

Promote into a technical note or architecture paper with implementation anchors.

Source Anchor

MotionMix/ARCHITECTURE.md

Detected Structure

Method · Evaluation · Code Anchors · Architecture