Grand Diomande Research · Full HTML Reader

MotionMix — Technical Architecture

> Full system architecture: hardware sensors → Rust engine → neural synthesis → multi-machine rendering > Last updated: 2026-04-16

Language as Infrastructure architecture technical paper candidate score 54 .md

Full Public Reader

MotionMix — Technical Architecture

> Full system architecture: hardware sensors → Rust engine → neural synthesis → multi-machine rendering
> Last updated: 2026-04-16

---

System Overview

MotionMix is a real-time motion-to-music synthesis platform. A performer's body movement is captured via wearable sensors and iPhone cameras, processed through a Rust engine (Echelon) into a 128-dimensional canonical vector, fed through a 5-layer neural network (SAN), and output as audio parameters, camera decisions, visual effects, and 3D body rendering across a fleet of devices.

Sony Mocopi (27 bones)  ──┐
iPhone IMU (60Hz)        ──┤
iPhone Camera (Vision)   ──┤──→ Echelon (Rust) ──→ SAN (5-layer) ──→ Audio + Camera + Visuals
Apple Watch (wrist)      ──┘         │                                       │
                                     │                                       ├──→ LiveDirector (Mac)
                                     ├──→ Avatar Pipeline ──→ Metal mesh     ├──→ TouchDesigner (Mac5)
                                     └──→ Accountability  ──→ rep/sleep      └──→ Unity VFX (Mac4)

---

Layer 0: Hardware Sensors

### Sony Mocopi
- 27 inertial measurement units worn on body joints
- Streams skeleton data at 30Hz via UDP/OSC
- Each bone: world-space position [x,y,z] + orientation quaternion [qw,qx,qy,qz]
- OSC formats: `/mcp/sklt/joint/N` (per-bone), `/mocopi/skel` (flat 189 floats), `/mcp/BonePos/{name}` (position-only)

### iPhone CoreMotion
- Accelerometer, gyroscope, gravity vector, attitude quaternion
- 60Hz sampling via CMMotionManager
- Feeds the LIM-RPS latent space dynamics in Rust

### iPhone Camera + Vision
- AVCaptureSession rear camera
- Apple Vision VNDetectHumanBodyPoseRequest extracts 14 body joints
- Each joint: normalized [x, y, confidence]
- Derived metrics: body energy, bouncing (hip Y oscillation), core motion, leg motion

### Apple Watch
- WatchConnectivity session for wrist energy + jerk
- Supplements IMU data when available

---

Layer 1: Sensor Ingestion (Swift)

All ingestion services run in the MotionMixApp iOS application.

### MocopiReceiver.swift
- `@MainActor final class MocopiReceiver: ObservableObject`
- NWListener on UDP `:9500` using Apple Network framework
- Pure-Swift OSC parser (no third-party dependencies)
- Auto-detects quaternion ordering (qw-first vs qw-last) by magnitude comparison
- 27-bone accumulator with auto-flush at 23+ bones or 33ms time boundary
- Output: `echelonBridge.mocopiExtractor.update(bones:)` on MainActor
- All parsing methods are `nonisolated` for background queue safety

### SensorService.swift
- CMMotionManager at 60Hz
- Buffers device motion frames
- Watch session via WCSession
- Bonjour mesh relay discovery for multi-device sync
- Output: `echelonBridge.updateSensor()` + `onLatentUpdate` callback

### CameraService.swift
- AVCaptureSession with configurable position (front/back)
- Distributes CGImage frames to: PoseService, LiveStreamServer, RecordingService, FaceAnalyzer
- Gemini analysis frames every ~3 seconds

### PoseService.swift
- VNDetectHumanBodyPoseRequest on every camera frame
- Extracts 14 joints with confidence thresholds
- Computes 6 pose features: meanX, meanY, stdX, stdY, rangeX, rangeY
- These features overwrite 128D vector positions [63:69]
- Output: `onPoseUpdated` callback with MotionMixPoseFrame

---

Layer 2: Rust Engine — Echelon

Source: `Desktop/Comp-Core/core/audio-media/cc-echelon/`
Binary: `Desktop/MotionMixApp/Frameworks/libechelon_ios.a` (22MB arm64)
Header: `Desktop/MotionMixApp/Frameworks/include/echelon.h`

2a: EchelonCore (cc-brain)

The core motion intelligence engine.

LIM-RPS Latent Space
- 16-dimensional Riemannian manifold
- Geodesic dynamics with velocity, curvature, jerk
- State: z[16] (position), velocity[16], plus 7 scalar features (norm, speed, curvature, grounding, verticality, rotation, coherence)

Lexicon
- 8 expressive scalars derived from latent dynamics:
- tension, divergence, transition_intensity, dissolution
- reformation, resolution, energy, expressivity

Section State Machine
- 7 states: Entry → MicroInitiation → StableSection → Divergence → Transitional → Reformation → Resolution
- Transitions driven by latent space dynamics (speed, curvature, divergence thresholds)

Motion Pipeline
- Anticipation Kernel: commitment, uncertainty, transition_pressure, recovery_margin, phase_stiffness, novelty, stability
- Gesture Classifier: class_id, confidence, commitment (from pose joint trajectories)
- Bar Boundary Detector: tempo-aware beat tracking for musical structure

2b: SAN Pipeline (Somatic Adaptive Network)

5-layer neural network replacing 200+ hardcoded thresholds with learned mappings.
135K parameters. Runs at 30Hz.

128D input ──→ [L1: FAN] ──→ [L2: FuseMoE] ──→ [L3: NHA] ──→ [L4: TTT] ──→ [L5: FiLM Heads]
                  │               │                  │              │               │
            Normalizer      6 experts          7-mode ODE      Hebbian         5 output
            (input→128D)    top-2 gate         RK4 integr.     fast weights    heads
                            128D→40D/expert    learned phase    bar-boundary    (44D total)
LayerNameParamsFunction
L1FAN (Feature-Aligned Normalizer)~400Running mean/var normalization for 128D input
L2FuseMoE (Fused Mixture of Experts)~51K6 experts (128D→40D each), top-2 gating, load balancing
L3NHA (Neural Harmonic Attractor)~15K7-mode ODE with RK4 integration, learned phase coupling
L4TTT (Test-Time Training)~6.8KHebbian fast weights, adapts at bar boundaries within a session
L5FiLM Heads~62K5 output projections with feature-wise linear modulation

Output Heads (44D total):
- Audio (20D): 4 instruments (kick, hihat, bass, pad) x 5 params each
- Camera (9D): 7 angle scores + cut timing + transition type
- Pattern (2D): intensity + variation
- Phrase (4D): forming + growing + stable + dissolving
- Gesture (9D): 8 class probabilities + confidence

2c: 128D Canonical Vector Layout

The central data structure connecting all subsystems. Assembled from multiple sources in `getDynamics128()`:

Index      Source          Content
──────────────────────────────────────────────────────────────
[0:16]     Rust            z vector (16D LIM-RPS latent position)
[16:32]    Rust            z padding (zeros)
[32:48]    Rust            velocity (16D LIM-RPS latent velocity)
[48:63]    Rust            velocity padding (zeros)
[63:69]    Swift OVERRIDE  pose features [meanX, meanY, stdX, stdY, rangeX, rangeY]
                           (Mocopi 6D if fresh, else Vision 6D)
[69:75]    Rust            temporal scalars [internal_tempo, phase, periodicity,
                           grounding, verticality, rotation, coherence]
[75]       Swift           modality mask (camera=1, pocket=2, mocopi=4, watch=8, /15)
[76:100]   Swift           Mocopi 24D features (joint vel, limb ratios, symmetry)
[100:102]  Swift           Pocket IMU (pitch, roll)
[102:104]  Swift           Watch (HR normalized, wrist energy)
[104:128]  —               reserved (future: Femto Bolt, LiDAR, etc.)

Critical invariant: Swift overwrites [63:69] after Rust writes [0:76]. The Rust values at [64:68] (norm, speed, curvature, curvature_rate, jerk) are clobbered. Training data must match this runtime layout exactly.

Training note: V5 weights were trained on 104D only (dims [104:128] = zeros). The SAN Rust config already declares `input_dim: 128`. V6 retrain will activate the full 128D with Mocopi/Watch/IMU features.

2d: Avatar Pipeline (cc-brain/src/avatar/)

Ported from Meta's AI4AnimationPy (Facebook Research, Paul & Sebastian Starke). 3,115 lines of Rust.

FileLinesPurpose
skeleton.rs766Quaternion [w,x,y,z] math, 4x4 matrix transforms, forward kinematics chain
bvh.rs449BVH animation file parser, ZXY Euler convention, frame-by-frame playback
skinning.rs453Linear Blend Skinning + Dual Quaternion skinning, 4 bones/vertex max
glb.rs1,026Full glTF/GLB binary parser, accessor resolution, mesh primitives, skeleton extraction
mod.rs421AvatarPipeline struct + 9 FFI symbols

Data flow:

Mocopi 27 bones [x,y,z,qw,qx,qy,qz] per bone
    ↓ avatar_update_bones()
Forward Kinematics (parent→child chain)
    ↓
World-space bone positions (27 x [x,y,z])
    ↓ Linear Blend Skinning
Deformed mesh vertices (N x [x,y,z])
    ↓ avatar_get_deformed_positions()
Metal vertex buffer → GPU rendering

44 tests pass. Supports both live Mocopi input and BVH animation replay.

2e: Accountability Engine (cc-brain/src/accountability/)

Exercise detection and sleep/wake classification. 6 files, ~900 lines, 27 tests, 9 FFI symbols.

FilePurpose
joint_angle.rs`joint_angle_3d()` / `joint_angle_2d()` for elbow, knee, torso angles from bone positions
rep_counter.rsEMA-filtered peak-valley rep detector (alpha=0.3, hysteresis=3 frames, min_separation=15 frames)
exercise.rsPushUpDetector (elbow+torso FSM), SquatDetector (knee FSM), ExerciseClassifier with bout detection
sleep.rsSleepWakeDetector: 30s EMA on hip height ratio + body energy, 5-min threshold
types.rsExerciseType, SleepState, RepEvent, AccountabilityEventFFI enums/structs
mod.rsAccountabilityEngine + 9 FFI functions

Exercise Detection State Machines:

Push-up: IDLE → DOWN (elbow<120, torso<30, 3 frames) → UP (elbow>155, 3 frames) → counted
Squat:   STANDING → DESCENDING (knee<150) → BOTTOM (knee<110) → ASCENDING (knee>120) → counted (knee>155)

Sleep States: AwakeActive (0), AwakeStill (1), Resting (2), Sleeping (3)
- Classification: height ratio vs standing + body energy threshold
- Sleeping requires sustained low activity for 5 minutes (9000 frames at 30Hz)

Bout Detection: Counts active-exercise frames (not rep events). Threshold: 90 frames (3 seconds at 30Hz).

---

Layer 3: Swift Bridge — EchelonBridge.swift

`@MainActor` class running at 60Hz via CADisplayLink.

Key invariant: EchelonBridge owns SANService as a non-optional `let` property. Never use @StateObject for SANService.

60Hz Loop

Display Link fires (60Hz)
    ↓
echelon_update_sensor() — feed buffered IMU frames to Rust
    ↓
echelon_step(dt) — advance Rust engine one timestep
    ↓
echelon_get_latent() — read latent state
echelon_get_lexicon() — read expressive scalars
echelon_get_anticipation() — read anticipation kernel
echelon_get_gesture() — read gesture classifier
    ↓
getDynamics128() — assemble 128D canonical vector:
  [0:76]    from Rust echelon_get_dynamics_128()
  [63:69]   overwrite with pose features (Mocopi or Vision)
  [75]      modality mask
  [76:100]  Mocopi 24D from MocopiFeatureExtractor
  [100:104] Pocket IMU + Watch features
  [104:128] reserved (zeros)
    ↓
san_step() — feed 128D to SAN pipeline (30Hz, every other frame)
    ↓
san_get_output() — read SAN output (44D)
    ↓
Distribute to consumers: AudioEngine, AutoDirector, StrudelEngine

### MocopiFeatureExtractor
- Receives 27 bones from MocopiReceiver
- Extracts 24D features: joint velocities, limb ratios, symmetry metrics
- Written into 128D vector at positions [76:100]

---

Layer 4: Output Consumers (Swift)

### AudioEngine
- AVAudioEngine with source node render callback
- SAN audio head drives 4 instruments: kick, hihat, bass, pad
- Each instrument: 5 parameters (energy, onset probability, spectral centroid, modulation, presence)
- StrudelWebEngine: WebKit-based pattern sequencer, receives SAN pattern head (intensity + variation)
- Mix factor crossfade: 0.0 = pure heuristic thresholds, 1.0 = pure SAN learned output

### AutoDirector
- SAN camera head: 7 camera angle scores for multi-cam switching
- Cut timing + transition type (hard cut, crossfade, etc.)
- Connected to DirectorHubClient (WebSocket) for centralized orchestration

### LiveDirector (macOS — MotionMixLiveDirector)
- Receives pose telemetry from all iOS devices via persistent WebSocket
- MJPEG preview streams from each camera node
- Centralized camera switching decisions
- SwiftUI for Mac, built from `Desktop/MotionMixLiveDirector/`

### Visual Pipeline
- MetalRenderer: particle system driven by latent state (orb, spine, horizon parameters from UIState)
- ChestFlexDetector: maps pectoral flex to visual + audio triggers
- Flex direction feeds Metal shader for particle offset

---

Layer 5: 3D Rendering Pipeline (Cross-Machine)

TouchDesigner (Mac5 — [ip])

mocopi_to_td.py (Mac1, HTTP :9407)
    ↓ converts to OSC
    ↓ fan-out to FANOUT_TARGETS
TouchDesigner :9501 — skeleton (27 bones pos + rot)
TouchDesigner :9500 — performance signals (energy, tension, brightness, density)
    ↓
Render: geometry → camera → lights → render → bloom → level → output
    ↓
1920x1080 real-time 3D body visualization

Scripts: `Desktop/cc-touchdesigner/`
- `cc_network_builder.py` — builds /cc container in TD (aurora + bloom pipeline)
- `mocopi_to_td.py` — HTTP→OSC bridge with fan-out relay
- `osc_bridge.py` — performance signal poller (20Hz)

### Unity (Mac4 — [ip])
- Unity 6 LTS, project: DepthReactiveVisuals
- Receives texture/data from Mac5 via Thunderbolt 5 direct link (sub-millisecond latency)
- VFX Graph production rendering for audience-facing display
- Adobe suite (Illustrator/Photoshop/AE/Premiere) for generative art runs alongside

### Fan-Out Relay
`mocopi_to_td.py` on Mac1 acts as a multi-destination relay:

python
FANOUT_TARGETS = [
    ("[ip]", 9501),   # Mac5 TouchDesigner
    # ("192.168.1.X", 9500),    # iPhone MocopiReceiver (add device IP)
]

Every OSC message is duplicated to all targets. Enables simultaneous iPhone SAN processing + Mac5 3D rendering from a single Mocopi stream.

---

Layer 6: Training Pipeline (Offline)

### Capture
- SANTrajectoryLogger: JSONL capture at 5Hz on device
- Records: 128D input vector + SAN output + track metadata
- Stored in `Documents/san-training/*.jsonl` on device
- NSLock-protected I/O for thread safety

Transfer + Alignment

devicectl copy ← device Documents/san-training/
    ↓ /Volumes/HD1/training-phrases/device_captures/
build_v5_pairs.py
    ↓ align 128D captures with audio features from playlist NPZ
    ↓ per-frame: rms_energy, onset_strength, chroma[12], mfcc[20], spectral_centroid
Aligned training pairs (input: 128D, target: 44D)

### Training
- `train_san_v5.py`: MLX framework, AdamW optimizer, early stopping
- Current: V5 weights, 5,408 real training pairs, val loss 0.028
- Output: `san_v5_weights.bin` + `san_v5_manifest.json`

Deployment

Copy weights to MotionMixApp/Resources/
    ↓
Rebuild libechelon_ios.a (if Rust code changed)
    ↓
xcodebuild -workspace ... ENABLE_DEBUG_DYLIB=NO
    ↓
xcrun devicectl device install app --device {ID} {APP}
xcrun devicectl device process launch --device {ID} com.openclaw.MotionMixApp

---

FFI Surface

All C-ABI symbols declared in `echelon.h`, linked via `libechelon_ios.a`:

ModuleSymbolsPurpose
EchelonCore18Lifecycle, sensor input, processing, latent/lexicon/pose output
SAN13Pipeline lifecycle, step, output, training data, weight loading, benchmark
ClaimBridge5N'Ko inscription detection from 128D latent vector
Avatar9Skeleton FK, GLB mesh, bone update, deformed vertices, triangle indices
Accountability9Exercise detection, sleep/wake, rep counting, event polling
Total54

---

Device Fleet

DeviceIDRole
iPhone 16 Plus880B4058Primary (full mode — all sensors + audio + SAN)
iPhone 16 Pro Max84109044Secondary (full mode)
iPhone 14 Pro Max45896348Camera node (pose streaming only, no SAN)
iPad A16 (Mohamed's)1DE6FABCShootView gallery
iPad A161938B9B3ShootView gallery
Mac1localBuild host, relay, orchestration
Mac4[ip]Unity VFX + Adobe generative art
Mac5[ip]TouchDesigner 3D rendering, ML compute

---

Build Commands

bash
# Rust engine tests
cd Desktop/Comp-Core/core/audio-media/cc-echelon
cargo test -p cc-brain --lib

# Rust engine build (iOS arm64)
cargo build -p echelon-ios --target aarch64-apple-ios --release
cp target/aarch64-apple-ios/release/libechelon_ios.a Desktop/MotionMixApp/Frameworks/

# iOS app
cd Desktop/MotionMixApp
xcodebuild build -workspace MotionMixApp.xcworkspace -scheme MotionMixApp \
  -destination 'generic/platform=iOS' ENABLE_DEBUG_DYLIB=NO

# LiveDirector (macOS)
cd Desktop/MotionMixLiveDirector
xcodebuild build -project MotionMixLiveDirector.xcodeproj \
  -scheme MotionMixLiveDirector -destination 'platform=macOS'

# Deploy to device
xcrun devicectl device install app --device {DEVICE_ID} {APP_PATH}
xcrun devicectl device process launch --device {DEVICE_ID} com.openclaw.MotionMixApp

---

Key Invariants

1. EchelonBridge owns SANService — `let san = SANService()` as non-optional. Never @StateObject.
2. 128D [63:69] overwrite — Swift pose features (Mocopi or Vision) clobber Rust values at these indices. Training data must match.
3. Rust/Python naming swap — MoE experts: Python `down` = Rust `up` (first projection). Weight loading cross-maps.
4. @MainActor for all @Published — No DispatchQueue.main.async wrappers. Direct synchronous updates.
5. No Mirror(reflecting:) — Banned in 30Hz hot path. Use direct tuple indexing for FFI structs.
6. Camera-node mode — Camera nodes skip SAN, audio, training capture. Early return in wireServices().
7. nonisolated parsing — MocopiReceiver parsing chain is nonisolated. Only flush() crosses to MainActor via Task.
8. Install + launch — devicectl install does NOT auto-launch. Always follow with devicectl process launch.

Promotion Decision

Promote into a technical note or architecture paper with implementation anchors.

Source Anchor

MotionMixApp/ARCHITECTURE.md

Detected Structure

Method · Evaluation · Code Anchors · Architecture