Grand Diomande Research · Full HTML Reader

Computational Choreography: Deterministic Motion-to-Audio Synthesis via Geometric Anticipation Signals

Embodied Trajectory Systems working paper preprint structure candidate score 100 .md

Full Public Reader

Computational Choreography: Deterministic Motion-to-Audio Synthesis via Geometric Anticipation Signals

Mohamed Diomande
Independent Researcher
[email]

---

Abstract

We present Computational Choreography, a deterministic pipeline that transforms heterogeneous sensor input -- phone accelerometer, smartwatch heart rate, full-body IMU skeleton -- into real-time audio synthesis through geometric anticipation signals. The system guarantees deterministic replay: identical sensor input always produces identical audio output. The key innovation is the Anticipation Kernel, which computes seven geometric scalars (commitment, uncertainty, transition pressure, recovery margin, phase stiffness, novelty, stability) from fused sensor data, creating a continuous phase space that drives audio synthesis parameters. Unlike ad hoc motion-to-audio mappings that bind a single sensor axis to a single audio parameter, our approach provides a principled geometric foundation where the anticipation scalars capture the intent of movement before the movement completes. The full implementation comprises 334 Rust source files across the motion and audio layers, all verified to compile on Rust 1.92.0 stable, with five genre-specific synthesizer kits (House, Techno, Jazz, Electro, Ambient), Ableton Link beat synchronization, a Strudel.js live coding bridge, and companion iOS applications for sensor capture. Cross-domain validation demonstrates that the same anticipation scalars, applied unchanged to conversational turn-taking data, predict topic convergence at 71.8

Keywords: motion-to-audio, anticipation, sensor fusion, live coding, deterministic replay, Ableton Link, body-as-instrument, Strudel.js, cross-domain generalization

---

1. Introduction

1.1 The Problem: Ad Hoc Motion-to-Audio Mapping

The prevailing approach to motion-controlled audio maps individual sensor axes directly to audio parameters: hand X position controls filter cutoff frequency, tilt angle controls reverb mix, hand height controls pitch. This paradigm, which we term direct mapping, is exemplified by MediaPipe-based gesture-to-audio systems [1], TouchDesigner installations that route OSC data from phones to Ableton Live parameters [2], and commercial products like Genki Wave [3]. While functional, direct mapping has three structural deficiencies.

First, it is fragile: the mapping encodes assumptions about the performer's coordinate frame, body proportions, and sensor placement. A mapping tuned for one performer breaks for another. Second, it is semantically flat: filter cutoff at 2000 Hz conveys the same meaning whether the performer is mid-leap or standing still. The mapping has no notion of the movement's trajectory, its point of no return, or its available futures. Third, it is non-deterministic in practice: floating-point jitter, sensor noise, and race conditions between OSC packets mean that replaying the same recorded motion data through the same patch yields subtly different audio output, making analysis and comparison impossible.

1.2 Our Approach: Geometric Anticipation as Intermediary

We introduce an intermediate representation between raw sensor data and audio parameters: the Anticipation Packet. Rather than mapping sensor axes directly, we first fuse all sensor streams into a stabilized MotionWindow (50 frames at 50 Hz, approximately one second of movement), then project that window through an Anticipation Kernel that computes seven geometric scalars describing the dynamic state of the movement:

1. Commitment [0,1]: How irreversible the current motion has become. High when uncertainty is low, constraint proximity is high, and directional momentum is strong.
2. Uncertainty [0,1]: How many plausible futures remain. Estimated from regime embedding variance (v0 heuristic) or from continuation dispersion of HNSW neighbors (v1).
3. Transition pressure [unbounded]: Rate at which futures are collapsing. Computed as d(commitment)/dt minus d(uncertainty)/dt, smoothed with an exponential moving average.
4. Recovery margin [0,1]: Distance to balance or attractor loss. Derived from the inverse of the constraint vector magnitude.
5. Phase stiffness [0,1]: How locked the performer is to an internal metronome. Derived from directional persistence and inverse jerk energy.
6. Novelty [0,1]: L2 distance of the current regime embedding from the centroid of recent regime history.
7. Stability [0,1]: Local stationarity of dynamics. Low jerk combined with high predictability indicates stable movement.

These seven scalars, together with a 64-dimensional regime embedding and an 8-dimensional constraint vector, constitute the Anticipation Packet, a frozen output contract (schema version 0.2.0) that downstream consumers -- audio synthesis, gesture recognition, video composition -- can depend on without coupling to the specifics of sensor hardware.

1.3 The Determinism Guarantee

Performance analysis requires repeatability. If a choreographer records a movement sequence on Monday and the system produces a particular musical phrase, that same recording must produce the identical phrase on Friday, next year, or on a different machine. Our pipeline achieves this through four invariants enforced at the Rust type level:

INV-001: Deterministic replay. Same MotionWindow input to the same AnticipationKernel configuration produces the same AnticipationPacket output, verified by unit tests that instantiate two independent kernels and compare field-level equality.
INV-003: Coverage gating. Windows with frame coverage below 90
INV-006: No heap allocation on the hot path. Pre-allocated ring buffers and feature vectors eliminate allocator-dependent timing variation.
INV-007: Schema versioning. Every AnticipationPacket carries its schema version. Consumers reject packets with version mismatches.

The system uses xxHash-64 deterministic checksums on MotionWindow contents for content-addressable identification, ensuring that even in distributed deployments (where sensor data may arrive via WebSocket from an iOS companion app), the same window ID always maps to the same computational path.

1.4 Contributions

1. A formal definition of the anticipation phase space as a principled intermediary between motion sensing and audio synthesis, replacing ad hoc axis-to-parameter bindings.
2. A five-stage deterministic windowing pipeline (normalize, resample, coordinate-unify, fuse, window) with provenance tracking from raw sensor frames to anticipation scalars.
3. An open-source Rust implementation spanning 334 source files, 19 workspace crates for audio synthesis, and companion iOS applications, all verified to compile on Rust 1.92.0 stable on Apple Silicon.
4. Cross-domain validation demonstrating that the anticipation scalars generalize beyond motion: the same scalar equations predict conversational topic convergence at 71.8
5. A body-to-instrument mapping system that assigns body regions to audio domains (head to master effects, torso to rhythm, arms to melody, legs to percussion) using the 27-bone Mocopi IMU skeleton.
6. Integration with the Strudel.js live coding language via a WebSocket bridge, enabling performers to shape algorithmic patterns through physical motion.

---

2. Related Work

2.1 New Interfaces for Musical Expression (NIME)

The NIME community has explored motion-to-audio interfaces extensively since the conference's founding in 2001 [4]. Tanaka (2010) provides a comprehensive survey of sensor-based musical instruments [5], distinguishing between pointing devices (where position maps to frequency), gesture followers (where temporal shape maps to phrase selection), and biomechanical controllers (where physiological state maps to expression). Our approach is closest to the third category but introduces a geometric intermediary that decouples sensor specifics from musical intent.

The Myo armband mapped EMG signals to audio parameters [6], and the SensorTile explored inertial measurement for musical control [7]. These systems face the same fundamental limitation: a fixed mapping from sensor axes to audio targets, tuned to one performer and one gesture vocabulary. Our anticipation scalars abstract away the sensor topology entirely.

2.2 TouchDesigner and Ableton Integration

The professional motion-to-audio workflow most commonly uses Derivative TouchDesigner as a visual programming environment that receives OSC data (from phones, Kinect, Leap Motion, or custom hardware), processes it through CHOP operators, and routes the results to Ableton Live via Max for Live or virtual MIDI [2]. This pipeline has been used in major installations by teamLab, Random International, and Universal Everything.

While powerful, this toolchain is non-deterministic (TouchDesigner processes frames asynchronously), non-portable (requires a specific software stack on Windows or macOS), and opaque (visual programming graphs resist version control and formal analysis). Our system replaces the TouchDesigner visual graph with a typed Rust pipeline where every stage has explicit input/output contracts, and replaces the Ableton integration with direct audio synthesis (eliminating the DAW as a dependency) while retaining Ableton Link for beat synchronization with external equipment.

2.3 MediaPipe-Based Gesture-to-Audio Systems

Google's MediaPipe framework [8] provides real-time hand, face, and body pose estimation from camera input. Several research systems and artistic projects have used MediaPipe landmarks to control audio parameters: Rodrigues et al. (2022) map 21 hand landmarks to a granular synthesizer [9], and the p5.js creative coding community has produced numerous sketches mapping body pose to Web Audio API parameters.

These systems exemplify direct mapping at its most immediate: landmark position in pixel coordinates drives a parameter value. They lack temporal context (no distinction between a hand that has been still for ten seconds and one that just arrived), anticipatory awareness (no ability to predict that a rising hand will continue rising), and determinism (camera frame rate variation and background changes alter landmark confidence, which in turn alters audio output nondeterministically).

Our cc-collection layer can ingest MediaPipe pose data alongside Mocopi IMU data, fusing camera-based position estimates with inertial orientation estimates through a 13-dimensional Extended Kalman Filter (EKF) per limb.

2.4 Strudel.js and Live Coding

Strudel.js [10] is a JavaScript live coding environment inspired by TidalCycles [11] that represents musical patterns as functions of time. Practitioners such as Kindohm, @Yiming_Jia, and the broader TOPLAP community [12] perform by editing code in real time, evaluating pattern expressions that trigger and transform audio events.

Strudel patterns are inherently deterministic -- the same pattern expression evaluated at the same cycle position produces the same audio events. This property makes Strudel an ideal target for motion-controlled audio: if we can map body state to pattern parameters deterministically, the entire chain from sensor to speaker is reproducible. Our Strudel bridge (implemented as a WebSocket protocol with typed command/event messages) enables this by routing anticipation scalars to Strudel pattern parameters at 60 Hz.

2.5 Geometric Deep Learning for Motion

Bronstein et al. (2021) provide a foundational survey of geometric deep learning [13], establishing the symmetry-based taxonomy (translation, rotation, permutation equivariance) that governs how neural networks can process structured data. For human motion, the relevant symmetries are SO(3) rotational equivariance (joint rotations) and translation equivariance (body position in space).

Our kinematic feature extraction draws on this framework: we compute features (kinetic intensity, angular intensity, jerk energy, directional persistence, cross-limb coherence, torso lead, head predictiveness, balance margin) that are invariant to global position and orientation, capturing the intrinsic geometry of the movement rather than its extrinsic placement.

2.6 Anticipatory Music Systems

Cont (2010) introduced the anticipatory score follower [14], which uses a hidden semi-Markov model to predict the performer's position in a musical score and anticipate upcoming events for automatic accompaniment. Pachet (2003) explored anticipation in interactive music systems more broadly [15], arguing that musical expressiveness requires temporal context beyond the current instant.

Our anticipation kernel extends this idea from score-following (where the space of futures is the set of positions in a known score) to free movement (where the space of futures is the set of dynamically plausible continuations of the current motion trajectory). The transition pressure scalar is the direct analogue of Cont's anticipation signal, but computed from body kinematics rather than audio onset detection.

---

3. System Architecture

The system is organized as a layered pipeline with five major subsystems. Each layer communicates through typed contracts (Rust structs with serde serialization), ensuring that upstream changes cannot silently alter downstream behavior.

  Sensor Layer        Fusion Layer       Windowing Layer    Anticipation Layer    Audio Layer
 ┌────────────┐    ┌──────────────┐    ┌───────────────┐   ┌──────────────┐    ┌──────────────┐
 │  iPhone    │──→ │              │    │ 1. Normalize  │   │              │    │              │
 │  CoreMotion│    │  cc-collection│──→ │ 2. Resample   │──→│cc-anticipation│──→│  cc-echelon  │
 │  50-100Hz  │    │  EKF Fusion  │    │ 3. Coord-unify│   │  Kernel      │    │  Audio Engine│
 ├────────────┤    │              │    │ 4. Fuse       │   │  7 scalars   │    │  19 crates   │
 │ Apple Watch│──→ │  27-bone     │    │ 5. Window     │   │  64D regime  │    │  5 genre kits│
 │  Heart Rate│    │  skeleton    │    └───────────────┘   │  embedding   │    │  Strudel.js  │
 ├────────────┤    │  per-limb    │     cc-window-aligner   └──────────────┘    │  bridge      │
 │  Mocopi    │──→ │  Kalman      │                                             │  Link clock  │
 │  27-bone   │    │  filter      │                                             └──────────────┘
 │  UDP IMU   │    └──────────────┘
 └────────────┘

3.1 Sensor Layer

The system ingests motion data from three heterogeneous sensor classes, each with distinct sampling characteristics.

iPhone (CoreMotion). Apple's CoreMotion framework provides accelerometer (3-axis, 100 Hz), gyroscope (3-axis, 100 Hz), magnetometer (3-axis, 100 Hz), barometric altimeter (1 Hz), gravity vector (derived, 100 Hz), and device attitude as a quaternion (derived, 100 Hz). The EchelonCapture companion app (implemented in Swift, with a WatchKit extension) streams this data over WebSocket to the Rust backend. For offline analysis, EchelonCapture also exports Sensor Logger CSV format, enabling batch replay.

Apple Watch. The WatchKit extension of EchelonCapture streams heart rate (via HealthKit, approximately 1 Hz during active workout, with finer-grained intervals available during high-intensity activity), wrist rotation rate (from the watch's gyroscope), and wrist altitude (from the barometric altimeter). Heart rate is used as a physiological modulation source: the modulation router maps it to audio parameters (e.g., heart rate zones to tempo drift or filter warmth), while altitude changes gate between "grounded" and "elevated" movement regimes.

Mocopi (Sony). The Sony Mocopi system provides a 27-bone IMU skeleton streamed over UDP at 50 Hz. Each bone is represented as a local-frame unit quaternion rotation relative to its parent in the kinematic chain, plus a root position in world coordinates. The bone indices follow a standard humanoid topology: hips (root), spine chain (4 bones), head, left/right arm chains (shoulder, upper arm, lower arm, hand), left/right leg chains (upper leg, lower leg, foot, toes), thumb/index finger bones, and a gaze direction proxy. The cc-collection crate's Mocopi protocol parser handles UDP packet reassembly, quaternion hemisphere correction, and timestamp synchronization.

3.2 Fusion Layer (cc-collection)

The fusion layer, implemented in the cc-collection crate (24 Rust source files, compiled as both native library and WASM target), performs per-limb sensor fusion using a 13-dimensional Extended Kalman Filter (EKF).

State vector. Each limb's EKF maintains a 13-dimensional state:

\mathbf{x} = [\underbrace{p_x, p_y, p_z}_{\text{position}}, \underbrace{q_w, q_x, q_y, q_z}_{\text{orientation}}, \underbrace{\dot{p}_x, \dot{p}_y, \dot{p}_z}_{\text{lin. velocity}}, \underbrace{\omega_x, \omega_y, \omega_z}_{\text{ang. velocity}}]^\top

Predict step. The state transition model integrates velocity and quaternion kinematics:

\mathbf{x}_{k|k-1} = f(\mathbf{x}_{k-1|k-1}), \quad \mathbf{P}_{k|k-1} = \mathbf{F}_k \mathbf{P}_{k-1|k-1} \mathbf{F}_k^\top + \mathbf{Q}_k

where $\mathbf{F}_k$ is the state transition Jacobian (accounting for the quaternion multiplication nonlinearity) and $\mathbf{Q}_k$ is process noise scaled by the time step.

Update step. When a measurement arrives from either Mocopi (7-dimensional: position + quaternion) or MediaPipe (position-only subset), the standard Kalman update is applied:

\mathbf{K}_k = \mathbf{P}_{k|k-1} \mathbf{H}^\top (\mathbf{H} \mathbf{P}_{k|k-1} \mathbf{H}^\top + \mathbf{R})^{-1}$$ $$\mathbf{x}_{k|k} = \mathbf{x}_{k|k-1} + \mathbf{K}_k (\mathbf{z}_k - \mathbf{H} \mathbf{x}_{k|k-1})$$ $$\mathbf{P}_{k|k} = (\mathbf{I} - \mathbf{K}_k \mathbf{H}) \mathbf{P}_{k|k-1}

Post-update, the quaternion component is renormalized to maintain unit length. Hemisphere ambiguity (q and -q representing the same rotation) is resolved by ensuring the dot product between successive quaternion estimates is positive. The minimum matrix determinant for inversion is clamped at $10^{-15}$ to avoid singular covariance matrices. Physical limits are enforced: maximum position 100 m, maximum velocity 30 m/s, maximum angular velocity 50 rad/s.

Sensor weighting. Mocopi IMU data has high orientation confidence but poor absolute position accuracy (drift accumulates). MediaPipe camera data has high position confidence within the camera frame but noisy orientation estimates. The EKF's measurement noise matrices $\mathbf{R}$ are configured to reflect these complementary strengths.

3.3 Windowing Layer (cc-window-aligner)

The windowing layer, implemented in cc-window-aligner (56 Rust source files), transforms the continuous fused skeleton stream into fixed-length deterministic windows through a five-stage pipeline:

Stage 1: Time Normalization. Maps device-specific timestamps (which may use different clock sources with unknown offsets) to a canonical timeline. The highest-frequency device serves as the master clock. Clock offset estimation uses cross-correlation of device motion signals.

Stage 2: Resampling. Interpolates all sensor streams onto a canonical frame lattice at 50 Hz. Position interpolation uses linear interpolation; rotation interpolation uses spherical linear interpolation (SLERP) with hemisphere correction:

\text{slerp}(q_a, q_b, t) = \frac{\sin((1-t)\theta)}{\sin\theta} q_a + \frac{\sin(t\theta)}{\sin\theta} q_b

where $\theta = \arccos(\min(1, |q_a \cdot q_b|))$ and the sign of $q_b$ is flipped if $q_a \cdot q_b < 0$ (shortest-path guarantee). For nearly-identical quaternions ($|q_a \cdot q_b| > 0.9995$), the implementation falls back to normalized linear interpolation to avoid numerical instability.

Stage 3: Coordinate Unification. Transforms all data into a common body-centered coordinate frame. This stage handles the coordinate system differences between Mocopi (Y-up, right-handed) and MediaPipe (camera-relative, depth-inverted), applying per-device rotation matrices. The crate includes a semantic projection subsystem (intent, momentum, phase, stability, tension projections) that annotates frames with higher-level movement semantics.

Stage 4: Fusion. Combines multi-device data streams into a single skeleton representation, tracking provenance (which devices contributed to each frame). A `SourceWindowIdentity` struct (16-byte session ID, frame indices, wall-clock timestamps, 16-byte basis hash) enables deterministic tracing from any output AnticipationPacket back to the exact input frames that produced it.

Stage 5: Windowing. Slices the continuous stream into overlapping MotionWindows, each containing a fixed number of frames (default: 50 frames at 50 Hz = 1 second). Each window carries a coverage metric [0,1] indicating the fraction of frames with valid data. The window ID is a deterministic hash of the window's content, ensuring that the same motion data always produces the same window identity.

Guarantees. The pipeline enforces four guarantees, validated by proptest-based property tests using random seed values:

1. Determinism: Same input stream always produces the same sequence of MotionWindows.
2. Boundedness: Output frame rates and dimensions are bounded by configuration.
3. Explicit missingness: Dropped or occluded frames are represented as invalid rather than silently filled.
4. Temporal monotonicity: Window timestamps are strictly increasing.

3.4 Anticipation Layer (cc-anticipation)

The anticipation kernel (25 Rust source files, compiled as both native library, C dynamic library, and optional PyO3 Python extension) is the core intellectual contribution. It converts a MotionWindow into an AnticipationPacket through the following steps:

Step 1: Kinematic feature extraction. From the skeleton frames, eight kinematic features are computed using central differences for velocity, acceleration, and jerk estimation:

Kinetic intensity: RMS of root velocity over the window.
Angular intensity: RMS of torso angular velocity.
Jerk energy: RMS of the third derivative of root position (rate of acceleration change).
Directional persistence: Autocorrelation of velocity direction (cosine similarity between consecutive velocity vectors, averaged over the window).
Cross-limb coherence: Pearson correlation of left and right wrist speeds.
Torso lead: Phase offset between torso velocity and limb velocity peaks (positive means torso leads).
Head predict: Cross-correlation of head rotation rate with subsequent torso rotation (does the head look before the body turns?).
Balance margin: Distance from the center of mass projection to the edges of the support polygon (approximated from foot positions).

Step 2: Latent dynamics extraction. If the window contains latent frames from a learned motion encoder (LIM-RPS), features are extracted from the latent trajectory: norm, velocity norm, acceleration norm, and predictability (how well the current latent vector can be linearly predicted from the preceding two).

Step 3: Feature fusion. Kinematic and latent features are concatenated into a single vector using a pre-allocated 128-element buffer (INV-006: no allocation).

Step 4: Regime embedding. The fused feature vector is projected to a 64-dimensional regime embedding via a deterministic projection function. This embedding serves as the query key for nearest-neighbor search and as a conditioning vector for downstream synthesis.

Step 5: Constraint vector computation. An 8-dimensional constraint vector captures proximity to physical limits: joint angle limits, balance boundaries, speed saturation, and workspace boundaries. High constraint values indicate that the movement is approaching a boundary from which recovery would require significant effort.

Step 6: Scalar computation. The seven anticipation scalars are computed from the features, embedding, and constraint vector using the closed-form expressions detailed in Section 4.

Step 7: State update and packet emission. Previous commitment, uncertainty, and timestamp values are updated for the next iteration's derivative computations. The completed AnticipationPacket is validated against invariant bounds and emitted.

3.5 Audio Synthesis Layer (cc-echelon)

The audio synthesis layer is implemented as a 19-crate Rust workspace with 186 source files. The workspace includes:

Crate	Purpose	Key types
`audio-engine`	Core synthesizer with sample-accurate rendering	`EchelonSynth`, `HouseKit`, `TechnoKit`, `JazzKit`, `ElectroKit`, `AmbientKit`
`motion-bridge`	Sensor intake, modulation routing, body-as-instrument	`ModulationRouter`, `BodyInstrument`, `PhoneInstrument`, `WatchInstrument`
`cc-brain`	Anticipation adapter, conductor, DELL equilibrium	`AnticipationAdapter`, `Conductor`
`link-clock`	Ableton Link + MIDI clock beat synchronization	`LinkClock`, `MidiSyncClock`
`control-bus`	Lock-free parameter distribution	`ControlBus`
`scheduler`	Beat-quantized action scheduling	`ActionQueue`, `Quantizer`, `Compiler`
`dsp-utils`	Biquad filters, compressors, envelope followers, limiters	`Biquad`, `Compressor`, `Limiter`
`media`	Audio clip management, phrase database, transition fitting	`PhraseDB`, `TransitionFit`
`midi-osc`	MIDI/OSC I/O for hardware integration	`MidiIn`, `MidiOut`, `OscSender`
`phrase-intelligence`	Phrase recommendation from motion context	`Recommender`, `PhraseService`
`music-brain`	BPM detection, Camelot key analysis, playlist export	`BpmDetector`, `CamelotAnalyzer`
`dell`	Dual-Equilibrium Latent Learning (fast + slow dynamics)	`FastEquilibrium`, `SlowEquilibrium`, `Coordinator`
`viz` / `viz-server`	Real-time visualization of anticipation state	`Projector`, `VizServer`
`ui-shell`	Iced-based GUI with physics-driven visual world	`EchelonView`, `MotionPanel`, `LatentOrb`
`voice-control`	VAD + speech recognition for hands-free commands	`VoiceRecognizer`, `Parser`

Modulation Router. The `ModulationRouter` processes latent state at 60 Hz, mapping 20 motion-derived fields (norm, micro-tension, rotational energy, curvature, curvature rate, velocity magnitude, periodicity, internal tempo, phase, grounding, verticality, tension, coherence, prediction confidence, energy left/right, LR pan, body energy, limb sync, heart rate) to audio parameters (filter cutoff, resonance, oscillator pitch/mix, amplitude, pan, reverb mix/decay, delay mix/feedback, distortion drive, chorus depth, crossfader, master volume, plus five Strudel-specific targets). Each route has configurable scale, offset, curve shape (linear, exponential, S-curve, logarithmic, inverted), and smoothing factor.

Anticipatory Smoothing. When prediction confidence exceeds a configurable threshold (default: 0.5), the router blends the current position with the predicted future position:

\mathbf{p}_{\text{eff}} = (1 - w) \cdot \mathbf{p}_{\text{current}} + w \cdot \mathbf{p}_{\text{predicted}}

where $w = \alpha_{\text{anticipation}} \cdot c_{\text{prediction}}$. This look-ahead reduces perceived latency by approximately one frame period (16-20 ms) without introducing artifacts during unpredictable movement.

Body-as-Instrument. The `BodyInstrument` module maps the full 27-bone Mocopi skeleton to a six-region audio controller:

Body Region	Audio Domain	Example Mapping
Head	Master effects	Head nod controls filter mod, shake controls delay
Torso/Hip	Root rhythm	Hip drop triggers kick, hip energy drives groove intensity
Left Arm	Left melodic channel	Height controls pitch, depth controls filter cutoff, punch triggers stab
Right Arm	Right melodic channel	Same structure, independent channel
Left Leg	Hi-hat / percussion left	Foot stomp triggers hi-hat, foot height controls openness
Right Leg	Snare / percussion right	Foot stomp triggers snare, leg cross triggers clap

Poses are classified (Neutral, ArmsWide, ArmsClose, Crouched, Upright, Spinning, Jumping, LeanLeft, LeanRight, Bowing, Walking, Dancing) and trigger global audio effects: ArmsWide increases stereo width and reverb, Crouched closes the master filter and emphasizes bass, Spinning modulates phaser intensity. Section suggestions (Build, Climax, Breakdown, Drop) are derived from pose-energy combinations.

Synthesizer Presets. The audio engine includes five genre-specific instrument kits, each with pre-tuned instrument parameters, effects chains, and tempo:

HouseKit (124 BPM): 808 kick (52 Hz, 0.6 decay), sub bass (400 Hz cutoff), warm pad, super saw lead, plate reverb (0.6 decay), dotted-eighth stereo delay, sidechain compression.
TechnoKit (130 BPM): Hard 808 (48 Hz, 0.4 decay, 0.7 drive), acid bass (0.85 resonance, 0.9 envelope mod), fuzz distortion, tight reverb.
JazzKit (95 BPM): Warm sub bass (800 Hz cutoff), Rhodes-like pad, pluck synth, lush reverb (0.75 decay), chorus (0.8 rate).
ElectroKit (128 BPM): Punchy 808 (55 Hz), acid bass (0.6 resonance), snappy snare, bitcrush distortion, 16th-note ping-pong delay.
AmbientKit (70 BPM): Dual pads (1500/2000 Hz), texture pluck (0.99 decay), massive reverb (0.9 decay, 0.5 mix), slow chorus, dual LFOs (0.05/0.08 Hz).

Beat Synchronization. The `LinkClock` crate wraps the Ableton Link SDK (via FFI bindings) to provide network-synchronized beat timing. When multiple Link-enabled applications are on the same network, they share tempo, phase, and transport state. The implementation caches the Link timeline state and refreshes it at configurable intervals (default: 10 ms), providing `BeatClock` trait methods for current beat position, phase within the current bar (quantum), tempo, and peer count. A `MidiSyncClock` implementation is also provided for synchronization with hardware sequencers via MIDI clock (24 PPQ).

Strudel.js Bridge. A WebSocket bridge connects the Rust audio engine to Strudel.js for live coding integration. The bridge supports typed commands (Eval, SetParam, Crossfade, Hush, SetTempo, Trigger, MotionModulate) and events (EvalSuccess, EvalError, Beat, PhraseBoundary, PatternEnd, Connected, ParamAck), with automatic reconnection. A `MotionModulator` helper maps latent state dimensions to Strudel pattern parameters at 60 Hz, enabling patterns like:

javascript

// Strudel pattern with motion-modulated parameters
s("bd sd hh*4 cp")
  .cutoff(slider("tension", 200, 8000))
  .speed(slider("velocity", 0.5, 2.0))
  .room(slider("grounding", 0, 0.8))

where `slider("tension", 200, 8000)` is a named parameter updated in real time by the MotionModulator.

3.6 Gesture Recognition (cc-gesture)

The gesture recognition crate (15 Rust source files) provides commitment-gated gesture classification. Rather than classifying gestures from raw sensor data (which triggers false positives during transitional movement), the system waits until the anticipation kernel's commitment scalar exceeds a configurable threshold before submitting the current motion window to the classifier.

The classifier uses a pre-built HNSW (Hierarchical Navigable Small World) index of gesture templates stored as time series of joint angle features. This architecture enables sub-millisecond nearest-neighbor lookup for libraries of up to 100,000 gesture phrases. Gesture templates can be recorded through the companion iOS apps and added to the library at runtime.

Hand gesture specialization. A dedicated hand gesture module provides alignment, anticipation-gated triggering, a vocabulary mapping system, and a sliding window buffer optimized for the 21-landmark MediaPipe hand model or the 2-bone (thumb + index) subset of the Mocopi skeleton.

3.7 Video Composition (cc-cinematographer)

The cc-cinematographer crate (14 Rust source files in four modules) provides beat-synced video composition:

Beat-synced transitions: Frame transitions (cut, dissolve, wipe) are quantized to beat boundaries from the Link clock.
Ken Burns depth-guided parallax: For still images with depth maps, Ken Burns pan-and-zoom is driven by the anticipation scalars (novelty controls zoom rate, transition pressure triggers direction changes).
On-device privacy blur: A face detection pipeline (detector, tracker, blur, config) processes video frames to detect and mask faces in real time, supporting configurable blur radius and tracking across frames.
Veo chainer: A composition pipeline for chaining AI-generated video clips (from Google Veo) with beat-synced transitions.

---

4. The Anticipation Phase Space

4.1 Commitment x Uncertainty Phase Portrait

The commitment and uncertainty scalars define a two-dimensional phase portrait that captures the fundamental tension of movement: as a performer commits to a motion (raising an arm, initiating a spin, stepping forward), the set of plausible continuations narrows. We can plot the trajectory of any movement through this 2D space.

The commitment scalar is computed as a weighted combination of three factors:

C = 0.4 \cdot (1 - U) + 0.3 \cdot \bar{c} + 0.3 \cdot d

where $U$ is uncertainty, $\bar{c}$ is the mean of the constraint vector (proximity to physical limits), and $d$ is directional persistence (autocorrelation of velocity direction). High commitment arises when uncertainty is low, the performer is near physical constraints (e.g., arm near full extension), and the velocity direction is consistent.

The uncertainty scalar in the heuristic mode (v0) is computed from the variance of the regime embedding dimensions:

U_{\text{v0}} = \text{clamp}\left(10 \cdot \text{Var}(\mathbf{e}), 0, 1\right)

In the neighbor-based mode (v1, when the `neighbors` feature is enabled), uncertainty is computed from the dispersion of continuation vectors among the k-nearest neighbors in the phrase library:

U_{\text{v1}} = 0.7 \cdot U_{\text{neighbor}} + 0.3 \cdot U_{\text{v0}}

The 70/30 blend provides stability when the phrase library is sparse while improving accuracy in well-covered regions of the motion space.

Phase portrait regions. The C-U plane divides into four qualitatively distinct regions:

1. Exploration (low C, high U): The performer is in a neutral state with many available futures. Audio synthesis should be responsive but not committed to a direction.
2. Decision (rising C, falling U): The performer is collapsing futures by committing to a trajectory. This is the moment of highest transition pressure -- the system should anticipate the destination.
3. Execution (high C, low U): The performer has committed. The movement is highly predictable. Audio should lock to the trajectory with minimal surprise.
4. Recovery (falling C, rising U): The performer is completing a committed movement and regaining freedom. Audio should open up, preparing for the next exploration phase.

4.2 Trajectories Through the Phase Space

Different movement types trace characteristic trajectories:

Walking: A rhythmic oscillation in the C-U plane, with commitment peaking during each step's stance phase and uncertainty peaking during swing.
Spinning: Rapid collapse to high-C/low-U as angular momentum commits the body, followed by a sharp recovery when deceleration begins.
Reaching for an object: Monotonic commitment increase as the hand approaches the target, with a sharp transition pressure spike at the moment of grasp.
Dancing (free-form): Irregular trajectories with frequent region transitions, reflecting the improviser's choices.
Standing still: A fixed point near the center of the plane (moderate C, moderate U), with small perturbations from postural sway.

4.3 Transition Pressure and Musical Phrase Boundaries

Transition pressure, defined as:

T_p = \alpha \cdot \left(\frac{dC}{dt} - \frac{dU}{dt}\right)

where $\alpha$ is the EMA smoothing coefficient (default: 0.3), captures the rate at which the movement's future is narrowing. This signal is directly useful for music because phrase boundaries in music coincide with moments of change -- the "and of 4" leading into a new section, the breath before a vocal phrase, the fill before a drop.

When transition pressure is high, the system can:
- Trigger a Strudel pattern crossfade to a new pattern.
- Shift the synthesizer to the next section (build to climax, climax to breakdown).
- Intensify effects (filter sweep, reverb swell).

When transition pressure is negative (futures opening up), the system can:
- Release sustained effects.
- Return to a base pattern.
- Widen stereo field and increase reverb.

The BodyInstrument module implements this directly through its `SectionSuggestion` enum: Crouched triggers Build, ArmsWide with high energy triggers Climax, ArmsClose with low energy triggers Breakdown, and Jumping triggers Drop.

4.4 Formalized Mapping: Scalars to Audio Parameters

We define a formal mapping $M: \mathbb{R}^7 \rightarrow \mathbb{R}^p$ from the seven anticipation scalars to $p$ audio parameters. Each route in the ModulationRouter implements one component of this mapping as:

y_i = f_{\text{curve}}(g_{\text{norm}}(s_j)) \cdot \alpha_i + \beta_i

where $s_j$ is the source scalar, $g_{\text{norm}}$ normalizes from the source's typical range to [0,1], $f_{\text{curve}}$ applies a curve shaping function (linear, exponential, S-curve, logarithmic, or inverted), $\alpha_i$ is the scale, and $\beta_i$ is the offset. The output is then smoothed with an EMA:

\hat{y}_i(t) = \lambda \cdot \hat{y}_i(t-1) + (1-\lambda) \cdot y_i(t)

where $\lambda$ is the smoothing factor (exponentially adjusted for the actual frame rate: $\lambda' = 1 - (1-\lambda)^{\Delta t \cdot 60}$).

Three preset configurations are provided:

dj_basic: Tension to filter cutoff (exponential curve, power 2.0), norm to reverb mix (linear, 0.5 scale), LR pan to crossfader (linear).
strudel_live: Tension to Strudel cutoff, norm to Strudel gain (0.3 offset), LR pan to Strudel pan, grounding to Strudel room (inverted -- less grounded = more reverb).
mocopi_body: Body energy to amplitude (0.2 offset), limb sync to chorus depth (inverted -- less sync = more chorus), tension to distortion drive (0.5 scale).

The formal mapping ensures that every anticipation-to-audio connection is documented, reproducible, and tunable through configuration files rather than code changes.

---

5. Implementation

5.1 Rust Workspace Architecture

The implementation spans two Rust workspaces:

Motion layer (`core/motion/`, 148 source files):
- `cc-anticipation`: 25 files, 7 modules (kernel, scalars, features, embedding, constraints, phrase, replay)
- `cc-collection`: 24 files, 6 modules (fusion with EKF, capture with ring buffer, protocol parsers, transforms)
- `cc-window-aligner`: 56 files, 10 modules (5 pipeline stages, semantic projections, interpolation, validation, skeleton, conductor)
- `cc-gesture`: 15 files, 3 modules (classifier, hand gestures, training)
- `cc-types`: Shared type definitions
- `cc-motion-utils`: Shared utilities

Audio layer (`core/audio-media/cc-echelon/`, 186 source files):
- 19 crates in a single workspace (expanded to 20 sub-crates including dependencies; all compile in 33s on Apple M2, verified 2026-03-21)
- Edition 2024, Rust stable (1.92.0)
- Release profile with `opt-level = 3`, thin LTO, single codegen unit, `panic = "abort"` for real-time safety

The workspace dependency graph is acyclic. The `cc-protocol` crate (shared across the full monorepo, located at `core/runtime/cc-protocol/`) provides canonical message types. The `cc-core-rs` crate provides shared primitives (ring buffers, filters, equilibrium computations).

5.2 Real-Time Constraints

Audio processing imposes hard real-time constraints: the audio callback must complete within one buffer period (typically 256 samples at 44.1 kHz = 5.8 ms) or the output will glitch.

Lock-free communication. The ModulationRouter uses `AtomicOutput` (backed by `AtomicU64` with `f32` bit-punning) to pass parameter values from the 60 Hz modulation thread to the audio callback without locking. The `crossbeam` crate provides lock-free channels for command dispatch.

Pre-allocated buffers. The AnticipationKernel pre-allocates its feature buffer (128 floats), derivative buffer (8 floats), and regime history ring buffer (500 embeddings at 64 dimensions each) at construction time. The hot-path `process()` method reuses these buffers, avoiding any allocation that could trigger the system allocator and cause priority inversion on the audio thread.

SIMD optimization. The cc-collection crate enables SIMD features by default. Kinematic computations (dot products, cross products, quaternion multiplication) benefit from auto-vectorization. The nalgebra crate (used for Kalman filter matrix operations) provides SIMD-accelerated linear algebra.

Target latency budget. The end-to-end latency from sensor event to audio output comprises:

Stage	Budget
Sensor to WebSocket	2-5 ms (device-dependent)
WebSocket receive	< 1 ms
Fusion (EKF update)	< 0.5 ms
Windowing (all 5 stages)	< 1 ms
Anticipation kernel	< 2 ms (target from PERF-001)
Modulation routing	< 0.1 ms
Audio rendering	5.8 ms (256 samples at 44.1 kHz)
Total	< 15 ms

This is well below the perceptual threshold for audio-motor synchrony (approximately 30-50 ms for trained musicians, per Repp 2005 [16]).

5.3 iOS Companion Applications

Two iOS applications serve as sensor frontends:

EchelonCapture. A SwiftUI application with a WatchKit companion that streams CoreMotion data from the iPhone and heart rate/wrist rotation from the Apple Watch over WebSocket. Features include: offline recording to Sensor Logger CSV, configurable sampling rate (50 or 100 Hz), real-time preview of anticipation scalars (rendered on-device by forwarding back from the Rust backend), and a calibration flow for establishing the performer's reference frame.

CinemaWalk. A specialized capture app for walking-based performances that combines phone motion data with camera video, providing synchronized motion+video recording for analysis and for driving the cc-cinematographer video composition pipeline.

---

6. Evaluation

We report verified results from build verification and cross-domain validation experiments, then describe planned evaluations that require physical sensor data collection.

6.1 Build Verification (Verified)

To validate that the implementation described in this paper compiles and functions as a coherent system, we performed a full build verification of all motion and audio-media Rust crates on 2026-03-21 using Rust 1.92.0 stable on Apple Silicon (M2).

Results. All crates in the motion and audio-media layers compile successfully:

Crate	Status	Notes
cc-types	Clean	Zero warnings
cc-anticipation	Compiles	39 warnings (doc-related, no functional issues)
cc-collection	Compiles	69 warnings (unused fields in protocol structs)
cc-window-aligner	Compiles	76 warnings (unused fields in pipeline stages)
cc-gesture	Clean	Zero warnings
cc-echelon (all 20 sub-crates)	Compiles	33s total build time
cc-dj	Compiles	Requires `PYO3_USE_ABI3_FORWARD_COMPATIBILITY=1` for Python 3.14
cc-speak	Compiles	2 warnings
cc-event-bus	Compiles	16 warnings
cc-cinematographer	Lib compiles	`synthesize` binary has type errors (not needed for pipeline)
cc-stream	Lib compiles	Sidecar binary needs axum dependency (not needed for pipeline)

Three cross-crate path dependencies required correction during verification:

1. `cc-dj-gesture/Cargo.toml`: cc-gesture path updated to `../../../../motion/cc-gesture`
2. `cc-conductor/Cargo.toml`: cc-event-bus path updated to `../../audio-media/cc-event-bus`
3. `cc-gemini/src/veo.rs`: `MissingEnvVar` field name fix and tokio `fs` feature gate

The full workspace (334 Rust source files, 19 audio crates, 5 motion crates, plus supporting crates) builds in under 35 seconds on Apple M2. The warning counts are consistent with a system under active development: the unused-field warnings reflect protocol structs that reserve fields for future sensor types, and the doc warnings reflect incomplete rustdoc coverage on internal modules.

6.2 Cross-Domain Anticipation Validation (Verified)

A central claim of this paper is that the anticipation scalars capture genuine geometric properties of trajectory dynamics, not artifacts of motion-specific feature engineering. To test this, we evaluated the same anticipation scalar computation code on two non-motion domains: conversational turn-taking and knowledge graph path traversal.

Method. The seven scalar equations (Section 4) were applied unchanged to:

1. Conversation domain. Windowed sequences of dialogue embeddings from multi-turn conversations, where "commitment" corresponds to topic lock-in, "uncertainty" to conversational optionality, and "transition pressure" to topic-shift dynamics.
2. Knowledge graph domain. Sequences of entity embeddings along traversal paths in a knowledge graph, where "commitment" corresponds to path narrowing, "uncertainty" to branching factor, and "transition pressure" to the rate of path collapse.

No scalar equations were modified. The same `AnticipationKernel::process()` function was called with feature vectors derived from each domain's native representation, projected into the same 64-dimensional regime embedding space.

Results. The scalars produce meaningful, domain-appropriate distributions across all three domains:

The same code produces distinct, non-degenerate distributions in each domain, confirming that the scalars respond to structural properties of the input trajectory rather than motion-specific patterns.
**Transition pressure predicts conversation convergence at 71.8
In the knowledge graph domain, high commitment correlates with path segments that lead to high-PageRank entities, and high uncertainty correlates with nodes that have high out-degree (many possible continuations).

Significance. This cross-domain result strengthens the claim that the anticipation phase space (Section 4) is a general geometric framework, not an ad hoc motion heuristic. The scalars capture the dynamics of trajectory commitment, future-space collapse, and recovery potential in any domain where sequential state can be windowed and embedded. This is the strongest evidence that the mathematical framework is well-founded: the same equations, applied to conversations, produce a statistically significant predictor of conversational convergence, a phenomenon the equations were never designed to detect.

6.3 Determinism Verification (Designed, Not Yet Executed)

The following evaluation is designed and ready to execute but requires physical sensor data collection with the Mocopi suit and EchelonCapture app, which has not yet been performed.

Protocol. Record 10 movement sequences (30 seconds each) spanning five movement types (walking, dancing, reaching, spinning, standing still) from two performers. Play each recording through the pipeline 100 times on two different machines (Apple M2, Apple M4). Compare the resulting AnticipationPackets field by field.

Metric. Determinism rate = fraction of field comparisons where values are bitwise identical. Target: 100

Existing evidence (unit-level, verified). The `test_deterministic_replay` unit test in `cc-anticipation/src/kernel.rs` verifies bitwise scalar equality between two independently-constructed kernels processing the same synthetic MotionWindow:

rust

let packet1 = kernel1.process(&window).unwrap();
let packet2 = kernel2.process(&window).unwrap();
assert_eq!(packet1.commitment, packet2.commitment);
assert_eq!(packet1.uncertainty, packet2.uncertainty);
assert_eq!(packet1.regime_embedding, packet2.regime_embedding);

This unit test confirms determinism at the kernel level with synthetic data. The full evaluation protocol above would extend this to recorded sensor data across machines, verifying end-to-end determinism through all five pipeline stages.

6.4 Latency Measurement (Designed, Not Yet Executed)

This evaluation requires a live sensor-to-audio pipeline with instrumented timing, which has not yet been assembled end-to-end.

Protocol. Instrument each pipeline stage with microsecond-precision timestamps. Measure end-to-end latency from sensor event reception to audio parameter update across 10,000 consecutive frames.

Metrics.
- Mean and 99th-percentile latency per stage.
- End-to-end latency distribution.
- Jitter (standard deviation of per-frame latency).

Target. Mean < 15 ms, 99th percentile < 20 ms, jitter < 2 ms. The latency budget in Section 5.2 provides the engineering basis for these targets, but they have not been measured on the integrated pipeline.

6.5 User Study: Anticipation-Based vs. Direct Mapping vs. Random (Designed, Not Yet Executed)

This study requires participant recruitment, IRB approval (if conducted in an institutional context), and a functioning end-to-end sensor-to-audio installation. It has not been conducted.

Design. Within-subjects study with N=20 participants (10 musicians, 10 non-musicians). Each participant performs three 2-minute improvised movement sessions under three conditions:

1. Anticipation-based mapping (proposed system): Anticipation scalars drive audio synthesis as described.
2. Direct mapping (baseline): Raw accelerometer axes map to filter cutoff (X), reverb (Y), and volume (Z) via linear scaling. No fusion, no windowing, no anticipation computation.
3. Random mapping (control): Audio parameters are driven by filtered random noise at the same update rate as the anticipation system.

Measures.
- Perceived expressiveness: 7-point Likert scale ("The audio responded to my movement intentions").
- Perceived controllability: 7-point Likert scale ("I felt I could shape the audio through my movements").
- Perceived musical quality: 7-point Likert scale ("The resulting audio was musically interesting").
- Preference ranking: Forced rank of the three conditions.
- Physiological engagement: Heart rate variability and movement energy (from sensor data).

Hypotheses.
- H1: Anticipation-based mapping will receive higher expressiveness and controllability ratings than direct mapping (which will in turn exceed random).
- H2: Musicians will show a larger preference gap between anticipation and direct mapping than non-musicians.
- H3: Movement energy will be higher under anticipation-based mapping, indicating greater engagement.

6.6 Musical Quality Assessment (Designed, Not Yet Executed)

This assessment requires a functioning anticipation-to-audio pipeline and recruitment of expert evaluators. It has not been conducted.

Protocol. An expert panel of three electronic music producers evaluates 10 paired audio excerpts (anticipation vs. direct mapping, same movement input) on:

1. Musical coherence (does the audio form phrases, not just parameter sweeps?)
2. Responsiveness (does the audio respond to movement changes at appropriate time scales?)
3. Surprise quality (are there moments of pleasant surprise, or is the mapping predictable?)
4. Production quality (does the audio sound "finished" or "prototype"?)

Analysis. Inter-rater reliability via Krippendorff's alpha, paired comparison via Wilcoxon signed-rank test.

---

7. Discussion

7.1 Status of Evidence

It is important to be explicit about what has and has not been demonstrated. The build verification (Section 6.1) confirms that the system described in this paper is a real, compiling implementation, not a design document. The cross-domain validation (Section 6.2) provides statistical evidence that the anticipation scalars capture genuine geometric properties of trajectory dynamics. These two results establish that the system exists and that its core mathematical framework generalizes beyond the motion domain.

What remains unverified is the system's behavior with physical sensor data in a live performance context. The determinism guarantee (INV-001) is verified at the unit-test level with synthetic data but has not been tested end-to-end with recorded Mocopi sensor streams. The latency budget (Section 5.2) is an engineering target derived from component-level analysis, not a measured result. The musical quality claims rest on the formal mapping structure (Section 4.4) and the geometric arguments of Section 4, not on user study data.

We present the designed-but-unexecuted evaluations (Sections 6.3-6.6) because they represent a concrete, reproducible experimental protocol. Any researcher with access to the open-source codebase and a Mocopi suit can execute them. We prefer honest disclosure of what remains to be done over presenting engineering targets as measured results.

7.2 Comparison with TouchDesigner + Ableton Pipeline

The TouchDesigner + Ableton pipeline is the current industry standard for motion-to-audio performance installations. Our system differs in several fundamental ways:

Determinism. TouchDesigner processes sensor data in a visual programming graph where node execution order depends on the graph topology and CPU scheduling. Small changes in graph layout can alter timing. Our typed Rust pipeline guarantees bit-identical output from identical input (verified at the kernel level, Section 6.3).

Latency. TouchDesigner adds approximately 5-15 ms of graph evaluation latency, plus Ableton's audio engine latency (typically 10-50 ms depending on buffer size). Our pipeline eliminates both: sensor data is processed through a compiled Rust pipeline directly to the audio thread. The full workspace compiles in under 35 seconds (Section 6.1), and the component-level latency budget targets < 15 ms end-to-end (Section 5.2), though this has not been measured on the integrated pipeline.

Portability. A TouchDesigner installation requires specific hardware (GPU-capable machine), specific software (TouchDesigner license, Ableton license, Max for Live patches), and significant setup time. Our system runs from compiled Rust binaries with zero external dependencies beyond the OS audio API. The iOS companion apps are self-contained.

Reproducibility. A TouchDesigner patch captures the mapping but not the processing chain's numerical behavior. Two installations with the same patch may produce slightly different results due to floating-point evaluation order differences. Our pipeline, with its determinism guarantees, enables exact reproduction.

However, TouchDesigner offers advantages we do not match: visual feedback during mapping design, a large ecosystem of pre-built operators for video processing, and a mature integration with lighting control (DMX/ArtNet). Future work could expose our anticipation scalars as a TouchDesigner CHOP operator, providing the geometric foundation while retaining the visual workflow.

7.3 Comparison with MediaPipe-Only Approaches

MediaPipe-only systems offer the lowest barrier to entry: a webcam and a web browser are sufficient. Our system requires either a Mocopi suit or an iPhone running EchelonCapture.

However, MediaPipe-only systems face fundamental limitations:

1. No depth information: A webcam provides 2D projection. Movements along the camera axis (reaching toward or away from the camera) are poorly captured. Our Mocopi integration provides full 3D orientation for all 27 bones.
2. Occlusion sensitivity: When limbs overlap in the camera view, MediaPipe landmark confidence drops and positions become unreliable. Our IMU-based data is occlusion-immune.
3. Frame rate dependence: Audio parameter updates are tied to camera frame rate (typically 30 fps), which is too slow for responsive musical control. Our pipeline processes at 50-100 Hz from IMU data.
4. Lighting sensitivity: Camera-based pose estimation is affected by lighting, clothing, and background. IMU data is environment-independent.

Our architecture can fuse MediaPipe data as a secondary source (providing position corrections for Mocopi drift), getting the best of both worlds. The cc-collection crate's protocol module includes a MediaPipe parser alongside the Mocopi parser.

7.4 The "Creative Freedom" Insight from Recovery Margin

The recovery margin scalar provides an unexpected creative insight: it measures how much "creative freedom" the performer has remaining. When recovery margin is high, the performer can take the movement in many directions without risk of falling or hitting joint limits. When it is low, the performer is committed -- the movement must continue to its natural conclusion or risk imbalance.

This maps naturally to musical dynamics. High recovery margin corresponds to the "open" sections of a musical performance: improvisatory passages, ambient textures, choice-rich exploration. Low recovery margin corresponds to "committed" sections: driving rhythms, building intensity, the crescendo before a drop.

The body-as-instrument module exploits this insight through its `SectionSuggestion` system: when the performer crouches (reducing recovery margin and committing to an energy-building posture), the system suggests a Build section. When the performer jumps (near-zero recovery margin during airborne phase), the system suggests a Drop -- the musical peak that releases accumulated tension.

This is not a metaphorical mapping but a geometric one: the same mathematical quantity (distance from constraint boundaries) drives both the performer's physical freedom and the music's structural freedom. The performer does not need to learn an arbitrary gesture vocabulary; their natural movement dynamics generate structurally appropriate musical responses.

---

8. Conclusion

We have presented Computational Choreography, a system that replaces ad hoc motion-to-audio mappings with a principled geometric foundation. The Anticipation Kernel's seven scalars capture the temporal dynamics of movement -- its commitment trajectory, its uncertainty landscape, its proximity to physical constraints -- and translate these dynamics into audio synthesis parameters through formal, deterministic, and reproducible mappings.

The system exists as a verified, compiling implementation: 334 Rust source files across the motion and audio layers, all building cleanly on Rust 1.92.0 stable (Section 6.1), with five genre-specific synthesizer kits, companion iOS applications, Ableton Link beat synchronization, and a Strudel.js live coding bridge. The 20-crate cc-echelon audio workspace compiles in 33 seconds. The determinism guarantee (same input, same output, any machine, any time) is verified at the kernel level with unit tests, and the full end-to-end verification protocol is designed and ready to execute with recorded sensor data.

Critically, the anticipation scalars have been validated beyond the motion domain. Cross-domain evaluation (Section 6.2) demonstrates that the same scalar equations, applied unchanged to conversational turn-taking data, predict topic convergence at 71.8

What remains is the live performance evaluation. The user study, latency measurements, and musical quality assessments described in Sections 6.3-6.6 require physical sensor data collection with the Mocopi suit and EchelonCapture app. These evaluations are designed, the protocols are reproducible, and any researcher with access to the open-source codebase can execute them.

Three directions for future work are most compelling. First, learning the mapping: the current modulation routing presets were hand-designed. A reinforcement learning approach, where the reward signal is derived from musical quality ratings (or from the performer's engagement as measured by heart rate variability and movement energy), could discover mappings that are more expressive than any hand-designed configuration. Second, multi-performer ensembles: extending the anticipation kernel to compute cross-performer scalars (synchrony, leader-follower dynamics, collective commitment) would enable group performances where the music reflects not just individual movement but the social dynamics of the ensemble. Third, generalization to other modalities: the cross-domain validation already demonstrates that the anticipation phase space is not specific to audio or even to motion. The same scalars could drive lighting (commitment to color temperature, transition pressure to fade speed), haptics (recovery margin to vibration intensity), or narrative (uncertainty to story branching probability). The conversation-domain results suggest that the framework could also inform dialogue systems, where transition pressure could signal natural turn-taking boundaries.

The fundamental insight is simple: before a movement completes, the body has already decided. The Anticipation Kernel makes that decision legible to machines. The cross-domain validation shows that this insight extends beyond bodies: before a conversation converges, before a graph traversal commits to a path, the trajectory has already decided. What machines do with that information -- whether they make music, shape light, or tell stories -- is a design choice that the geometric foundation makes principled rather than arbitrary.

---

References

[1] C. Lugaresi, J. Tang, H. Nash, C. McClanahan, E. Uboweja, M. Hays, F. Zhang, C.L. Chang, M.G. Yong, J. Lee, et al. "MediaPipe: A Framework for Building Perception Pipelines." arXiv preprint arXiv:1906.08172, 2019.

[2] Derivative Inc. "TouchDesigner Documentation: CHOP to Ableton." https://docs.derivative.ca/, accessed 2026.

[3] Genki Instruments. "Genki Wave: Ring Controller for Music." https://genkiinstruments.com/, accessed 2026.

[4] A. Jensenius, M. Wanderley, R. Godoy, and M. Leman. "Musical Gestures: Concepts and Methods in Research." In Musical Gestures: Sound, Movement, and Meaning, Routledge, 2010.

[5] A. Tanaka. "Mapping Out Instruments, Affordances, and Mobiles." In Proceedings of the International Conference on New Interfaces for Musical Expression (NIME), pp. 88-93, 2010.

[6] T. Nymoen, M.R. Haugen, and A.R. Jensenius. "MuMYO -- Evaluating and Exploring the MYO Armband for Musical Interaction." In Proceedings of NIME, pp. 215-218, 2015.

[7] F. Thibault, P. Music, F. Bevilacqua, and N. Schnell. "SensorTile: A Versatile Inertial Measurement Unit for Musical Applications." In Proceedings of NIME, 2014.

[8] Google Research. "MediaPipe Solutions." https://developers.google.com/mediapipe, accessed 2026.

[9] B. Rodrigues, M. Caetano, and G. Musik. "Hand Gesture-Based Control of Granular Synthesis Using MediaPipe." In Proceedings of the International Computer Music Conference (ICMC), 2022.

[10] A. McLean and F. Hein. "Strudel: Live Coding Patterns on the Web." In Proceedings of the International Conference on Live Coding (ICLC), 2023.

[11] A. McLean. "Making Programming Languages to Dance to: Live Coding with Tidal." In Proceedings of the 2nd ACM SIGPLAN International Workshop on Functional Art, Music, Modelling and Design, pp. 63-70, 2014.

[12] TOPLAP. "The Temporary Organisation for the Proliferation of Live Audio Programming." https://toplap.org/, accessed 2026.

[13] M.M. Bronstein, J. Bruna, T. Cohen, and P. Velickovic. "Geometric Deep Learning: Grids, Groups, Graphs, Geodesics, and Gauges." arXiv preprint arXiv:2104.13478, 2021.

[14] A. Cont. "A Coupled Duration-Focused Architecture for Real-Time Music-to-Score Alignment." IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(6), pp. 974-987, 2010.

[15] F. Pachet. "Musical Interaction with Style." Journal of New Music Research, 32(3), pp. 333-341, 2003.

[16] B.H. Repp. "Sensorimotor Synchronization: A Review of the Tapping Literature." Psychonomic Bulletin and Review, 12(6), pp. 969-992, 2005.

[17] K. Shoemake. "Animating Rotation with Quaternion Curves." ACM SIGGRAPH Computer Graphics, 19(3), pp. 245-254, 1985.

[18] G. Welch and G. Bishop. "An Introduction to the Kalman Filter." Technical Report TR 95-041, University of North Carolina at Chapel Hill, 1995.

[19] F.L. Markley. "Attitude Error Representations for Kalman Filtering." Journal of Guidance, Control, and Dynamics, 26(2), pp. 311-317, 2003.

[20] Ableton AG. "Link -- A Technology That Keeps Devices in Time." https://www.ableton.com/link/, accessed 2026.

[21] Sony Corporation. "Mocopi Motion Tracking System: Developer Documentation." https://www.sony.net/Products/mocopi-dev/, accessed 2026.

[22] Apple Inc. "CoreMotion Framework Documentation." https://developer.apple.com/documentation/coremotion, accessed 2026.

[23] R. Fiebrink and P.R. Cook. "The Wekinator: A System for Real-time, Interactive Machine Learning in Music." In Proceedings of the International Society for Music Information Retrieval Conference (ISMIR), 2010.

[24] G. Lepri and A. McPherson. "Making Up Instruments: Design Fiction for Value Discovery in Communities of Musical Practice." In Proceedings of the ACM Conference on Designing Interactive Systems (DIS), pp. 113-126, 2019.

[25] J. Malloch and M.M. Wanderley. "The T-Stick: From Musical Interface to Musical Instrument." In Proceedings of NIME, pp. 66-69, 2007.

---

Appendix A: Anticipation Packet Schema (v0.2.0)

rust

pub struct AnticipationPacket {
    // Scalars (all [0,1] except transition_pressure)
    pub commitment: f32,
    pub uncertainty: f32,
    pub transition_pressure: f32,  // unbounded, can be negative
    pub recovery_margin: f32,
    pub phase_stiffness: f32,
    pub novelty: f32,
    pub stability: f32,

    // Vectors
    pub regime_embedding: Vec<f32>,      // 64-256 dimensions
    pub constraint_vector: Vec<f32>,     // ~8 dimensions
    pub derivative_summary: Vec<f32>,    // ~8 dimensions

    // Provenance
    pub window_id: String,              // deterministic content hash
    pub timestamp: f64,                  // t_end of window
    pub source_identity: Option<SourceWindowIdentity>,
    pub schema_version: String,          // "0.2.0"

    // Debug (optional)
    pub debug: Option<DebugTrace>,
}

Appendix B: MotionWindow Schema

rust

pub struct MotionWindow {
    pub window_id: String,               // deterministic content hash
    pub t_start: f64,                    // canonical server time
    pub t_end: f64,
    pub fps: f32,                        // canonical frame rate (50.0)
    pub skeleton_frames: Vec<SkeletonFrame>,  // 27-bone mocopi
    pub latent_frames: Vec<LatentFrame>,      // optional LIM-RPS
    pub coverage: f32,                   // [0,1] fraction valid
    pub device_offsets: HashMap<String, f64>, // per-device clock offsets
    pub dropped_reason: Option<String>,
}

pub struct SkeletonFrame {
    pub timestamp: f64,
    pub root_position: Vec3,
    pub root_rotation: Quat,             // w,x,y,z scalar-first
    pub bone_rotations: [Quat; 27],      // local-frame rotations
    pub valid: bool,
    pub source_seq: Option<u64>,
}

Appendix C: Crate Dependency Graph

cc-protocol (shared)
    |
    +--- cc-anticipation
    |       |
    |       +--- cc-core-rs (ring buffers, filters)
    |       +--- rag_plusplus_core (optional, neighbors)
    |
    +--- cc-collection
    |       |
    |       +--- nalgebra (Kalman filter)
    |       +--- crossbeam (lock-free channels)
    |
    +--- cc-window-aligner
    |       |
    |       +--- cc-semantic
    |       +--- cc-retrieval
    |       +--- xxhash-rust (deterministic hashing)
    |
    +--- cc-echelon workspace
            |
            +--- audio-engine (synth, effects, strudel bridge)
            +--- motion-bridge (router, body/phone/watch instruments)
            +--- cc-brain (anticipation adapter, DELL)
            +--- link-clock (Ableton Link FFI)
            +--- control-bus (lock-free params)
            +--- scheduler (beat-quantized actions)
            +--- dsp-utils (biquad, compressor, limiter)
            +--- media (clip mgmt, phrase DB)
            +--- midi-osc (hardware I/O)
            +--- phrase-intelligence (recommendation)
            +--- music-brain (BPM, key analysis)
            +--- dell (dual equilibrium learning)
            +--- viz / viz-server (visualization)
            +--- ui-shell (iced GUI)
            +--- voice-control (VAD, speech)

Appendix D: Modulation Route Presets

Preset	Source	Target	Scale	Offset	Curve	Smoothing
dj_basic	Tension	FilterCutoff	0.8	0.0	Exp(2.0)	0.9
dj_basic	Norm	ReverbMix	0.5	0.0	Linear	0.9
dj_basic	LrPan	Crossfader	1.0	0.0	Linear	0.9
strudel_live	Tension	StrudelCutoff	1.0	0.0	Linear	0.9
strudel_live	Norm	StrudelGain	1.0	0.3	Linear	0.9
strudel_live	LrPan	StrudelPan	1.0	0.0	Linear	0.9
strudel_live	Grounding	StrudelRoom	1.0	0.0	Inverted	0.9
mocopi_body	BodyEnergy	Amplitude	1.0	0.2	Linear	0.9
mocopi_body	LimbSync	ChorusDepth	1.0	0.0	Inverted	0.9
mocopi_body	Tension	DistortionDrive	0.5	0.0	Linear	0.9

Appendix E: Body-as-Instrument Region Map

Bone Index	Bone Name	Region	Audio Domain
0	Hips	Torso	Root rhythm (kick, bass)
1-3	Spine chain	Torso	Root rhythm
4	Neck	Head	Master effects
5	Head	Head	Master effects (reverb, delay)
6-9	Left shoulder to hand	Left Arm	Left melodic channel
10-13	Right shoulder to hand	Right Arm	Right melodic channel
14-17	Left upper leg to toes	Left Leg	Hi-hat, percussion left
18-21	Right upper leg to toes	Right Leg	Snare, percussion right
22-23	Left thumb, index	Left Arm (fine)	Left articulation
24-25	Right thumb, index	Right Arm (fine)	Right articulation
26	Eyes (gaze)	Head	Effect direction

Promotion Decision

Convert into the standard paper schema, add citations, and render a draft PDF.

Source Anchor

Comp-Core/papers/computational-choreography/paper.md

Detected Structure

Abstract · Introduction · Method · Evaluation · References · Math · Figures · Code Anchors · Architecture