Equilibrium Diffusion: LIM-RPS x Discrete Token Diffusion
**Banach Fixed-Point Theorem**: If f is a contraction mapping (Lipschitz constant L < 1), then: 1. A unique fixed point z* exists 2. The iteration z_{k+1} = f(z_k, x) converges to z* for any initial z_0 3. Convergence is exponential: ||z_k - z*|| <= L^k ||z_0 - z*||
Full Public Reader
Equilibrium Diffusion: LIM-RPS x Discrete Token Diffusion
> A unified framework for motion-conditioned music generation via coupled equilibrium systems.
1. Mathematical Foundation
1.1 Deep Equilibrium Models (DEQ)
A DEQ replaces L explicit layers with a single implicit layer defined by its fixed point:
z* = f(z*, x) where x is the inputBanach Fixed-Point Theorem: If f is a contraction mapping (Lipschitz constant L < 1), then:
1. A unique fixed point z* exists
2. The iteration z_{k+1} = f(z_k, x) converges to z* for any initial z_0
3. Convergence is exponential: ||z_k - z|| <= L^k ||z_0 - z||
Reference: Bai, Kolter & Koltun, "Deep Equilibrium Models" (NeurIPS 2019)
1.2 Score-Based Diffusion
A diffusion model defines a forward process that adds noise:
q(x_t | x_0) = N(x_t; sqrt(alpha_t) x_0, (1 - alpha_t) I)The reverse process uses the score function s_theta = nabla_x log p(x_t):
x_{t-1} = x_t + epsilon * s_theta(x_t, t) + sqrt(2*epsilon) * noiseThis is Langevin dynamics converging to the data distribution p(x_0).
1.3 The Structural Equivalence
| LIM-RPS Fixed-Point | Diffusion Denoising | |
|---|---|---|
| Iteration | z_{k+1} = z_k - gamma * B(z_k) + prox_pull | x_{t-1} = x_t + epsilon * s(x_t, t) + noise |
| Vector field | B: CrossModalOperator (96 -> 96) | s_theta: score network |
| Contraction | Spectral norm(B) <= 1 (1-Lipschitz) | Bounded Jacobian of s_theta |
| Attractor | Unique z* (deterministic) | Data manifold (statistical) |
| Convergence proof | Banach theorem (spectral norm < 1) | Score matching loss -> true score |
| Without noise | Deterministic equilibrium | Probability flow ODE |
The operator B in LIM-RPS plays the same mathematical role as -s_theta in diffusion.
Both are vector fields pushing the state toward an attractor. The spectral normalization
of B is equivalent to requiring the score function to have bounded Jacobian, which is
exactly the condition needed for diffusion convergence.
Key insight: The LIM-RPS iteration `z - gamma * B(z)` is a single step of the
probability flow ODE `dx/dt = -s(x, t)` evaluated at discrete time. If you add noise
(`z - gamma B(z) + sqrt(2gamma) * epsilon`), you recover stochastic Langevin dynamics,
which IS the reverse diffusion process.
2. The CC System Architecture (As Built)
2.1 DELL: Dual Equilibrium Latent Learning
Two coupled equilibrium solvers operating at different timescales:
FastEquilibrium (60Hz, K_fast=4 iterations):
h_{k+1} = tanh(W_in @ limbs + W_recurrent @ h_k + bias + y_slow * 0.3)
x_fast = W_out @ h*
Dimensions: 216 (27 limbs x 8) -> 128 hidden -> 32 outputSlowEquilibrium (2.5Hz at beat boundaries, K_slow=6 iterations):
x_accum = mean(x_fast over 24 frames)
y_{k+1} = tanh(W @ x_accum + U @ y_k + bias)
y_slow = y*
Dimensions: 32 -> 32Coupling:
- Slow -> Fast: 0.3 (musical context biases motion perception)
- Fast -> Slow: 0.5 (motion accumulates into musical decisions)
Brain Latent Blend:
z = x_fast * 0.7 + y_slow * 0.3 (32-dimensional)Plus extracted dynamics: velocity, acceleration, curvature, coherence, periodicity, phase, grounding, verticality.
2.2 LIM-RPS Processor (Motion Bridge)
Converts raw sensor data to rich LatentState through:
1. Feature extraction from IMU
2. Cross-device fusion
3. Proximal update: smooth convergence without jitter
4. Dynamics: curvature (kappa = |v x a| / |v|^3), tension (accumulated + decay), periodicity (autocorrelation)
5. Prediction: linear extrapolation + confidence from velocity stability
2.3 Conductor (Rule-Based, Current)
Maps LatentTrajectory -> pattern edits via threshold rules:
- tension > 0.7 -> increase hihat density
- energy > 0.8 -> raise bass gain
- energy < 0.3 -> raise pad gain
- transition_intensity > 0.5 -> increase reverb
Musical section state machine: Intro -> Groove -> Build -> Climax -> Breakdown -> Outro
2.4 Strudel Bridge (Existing)
WebSocket commands to Strudel.js:
- MotionModulate { parameter, value, source } -- real-time parameter tweaking
- Eval { pattern } -- inject new pattern code
- Crossfade { pattern, duration_beats } -- smooth transition
- SetParam { name, value } -- direct parameter set
3. The Unified Architecture
3.1 Where Diffusion Enters
The Conductor's rule-based threshold system (tension > 0.7 -> hihat density) is the
weakest link. It's a hand-crafted lookup table pretending to be a creative decision-maker.
The diffusion model replaces EXACTLY this component:
CURRENT:
z* (32D) -> Conductor threshold rules -> StrudelCommand::SetParam
PROPOSED:
z* (32D) + dynamics -> Conditioning Encoder -> Discrete Diffusion -> Token Grid -> StrudelCommand::EvalEverything upstream (DELL, LIM-RPS) and downstream (Strudel bridge, audio engine) stays.
3.2 Coupled Equilibrium System
EQUILIBRIUM 1: Motion Perception
(DELL FastEquilibrium, 60Hz)
h* = tanh(W_in @ limbs + W_rec @ h* + slow_bias)
x_fast = W_out @ h*
| 0.7 weight
v
z = blend(x_fast, y_slow)
+ dynamics: curvature, periodicity, tension, prediction
| conditioning signal
v
EQUILIBRIUM 2: Music Composition
(Discrete Diffusion, every 1-2 bars)
x_0 ~ p(tokens | z*, dynamics, genre)
Generated via T denoising steps:
x_{t-1} = denoise(x_t, z*, t)
| token grid (12 instruments x 32 steps)
v
StrudelEngine (existing renderer)
Eval { pattern } -> Audio Output
| sound fills room
v
EQUILIBRIUM 3: Performer-System Loop
(Emergent, not explicitly computed)
Performer hears music -> moves differently -> z* shifts -> music shifts
Joint attractor of the coupled system3.3 The Conditioning Encoder
Maps DELL output to diffusion conditioning space:
Input (104-dim):
z [32] -- DELL blended latent
velocity [32] -- dz/dt
curvature [1] -- trajectory bending
curvature_rate [1] -- jerk in latent space
periodicity [1] -- autocorrelation peak
internal_tempo [1] -- embodied BPM
phase [1] -- beat phase [0,1)
tension [1] -- accumulated intensity
grounding [1] -- vertical stability
verticality [1] -- up/down tendency
coherence [1] -- fast-slow alignment
lr_pan [1] -- left-right balance
Encoder (MLP):
Linear(104, 256) + GELU
Linear(256, 512) + GELU
Linear(512, 768)
Output: 768-dim embedding (matches Roformer d_model)3.4 Dynamics-Driven Noise Schedule
The prediction_confidence from LIM-RPS modulates the diffusion process itself:
High confidence (predictable motion):
T_effective = T_min (e.g., 4 steps)
temperature = 0.7 (conservative sampling)
Result: clear, committed musical phrases
Low confidence (novel motion):
T_effective = T_max (e.g., 50 steps)
temperature = 1.2 (exploratory sampling)
Result: ambiguous, searching patterns
The commitment scalar IS the noise schedule.3.5 Token Vocabulary
GETScore-style 2D grid: 12 instrument tracks x 32 timesteps (2 bars at 16th-note resolution).
Each cell contains a token from vocabulary V:
| Category | Tokens | Count |
|---|---|---|
| Rest (silence) | ~ | 1 |
| MIDI pitches | C2-C6 | 49 |
| Drum hits | bd, sd, hh, oh, cp, rim, tom, perc | 8 |
| Dynamics | pp, p, mp, mf, f, ff | 6 |
| Articulation | stac, legato, accent, ghost | 4 |
| Duration | 16th, 8th, dotted-8th, quarter, half | 5 |
| Effects | lpf, hpf, delay, reverb, dist | 5 |
| Structure | repeat, fill, break | 3 |
| Total | 81 |
Each track maps to one of the 12 instruments:
KICK, SNARE, HH, CLAP, BASS, PAD, LEAD, STAB, ARP, FX, VOX, AMB
3.6 The Lipschitz Cascade
The 1-Lipschitz property of the CrossModalOperator B propagates through the system:
1. Small delta_motion -> small delta_z* (B is 1-Lipschitz, proximal step is contractive)
2. Small delta_z* -> small delta_embedding (MLP encoder is continuous)
3. Small delta_embedding -> small delta_p(tokens|z*) (conditional distribution shifts smoothly)
4. Small delta_p -> small delta_x_0 (denoised tokens are close)
Result: the performer can't "break" the music. No matter how sudden the motion change,
the cascaded Lipschitz bounds guarantee the music changes by at most a bounded amount per
time step. This is the mathematical guarantee that replaces hand-tuned smoothing.
4. Training Strategy
4.1 Phase 1: Synthetic Data (Immediate)
Generate (z* trajectory, token sequence) pairs from existing rules:
- Run LIM-RPS on recorded session data (stored in Desktop/MotionMix/sessions/)
- For each bar: compute z* + dynamics, record which patterns the Conductor chose
- Convert Conductor's choices to token grid format
- This gives supervised pairs without any new recordings
4.2 Phase 2: Performance Data (Ongoing)
Log real performance data (Task 21):
- Every bar boundary: z* (32D), dynamics (8 scalars), active Strudel pattern
- JSONL format, sidecar to video recording
- Target: 100 sessions = ~10,000 bar-level training examples
4.3 Phase 3: Joint Fine-Tuning (Future)
End-to-end training where the conditioning encoder and diffusion model are optimized jointly:
- Loss = denoising loss on tokens + regularization on z* smoothness
- The encoder learns WHAT to extract from z* for music generation
- The diffusion model learns HOW to generate coherent patterns from those features
5. Implementation Plan
5.1 Task 1: Wire LIM-RPS Dynamics to Strudel (No ML)
Replace Conductor threshold rules with continuous dynamics-driven MotionModulate commands.
Use existing Strudel bridge. Map:
- periodicity -> rhythmic density (pattern complexity)
- curvature -> melodic range (interval sizes)
- tension -> harmonic tension (dissonance level)
- grounding -> bass weight (low-frequency emphasis)
- prediction_confidence -> arrangement stability
5.2 Task 2: z* Trajectory Logger
In AudioEngine, at bar boundaries:
struct BarSnapshot: Codable {
let timestamp: Double
let z: [Float] // 32D DELL latent
let dynamics: Dynamics // curvature, periodicity, tension, etc.
let activePattern: String // Strudel pattern code
let instrumentActivations: [Float] // 12 values
let genre: String
let bpm: Float
}Append to JSONL sidecar file per session.
5.3 Task 3: Token Vocabulary Definition
Shared definition in Rust (for CC crate), Swift (for iOS), Python (for training):
- Rust: enum MusicToken with 81 variants
- Swift: MusicToken enum mirroring Rust
- Python: vocabulary dict with encode/decode
5.4 Task 4: Conditioning Encoder
Python (MLX on Mac5):
- MLP: 104 -> 256 -> 512 -> 768
- Train on z* trajectories from logger
- Export to CoreML for on-device option
5.5 Task 5: Discrete Diffusion Model
GETMusic-style on Mac5 (MLX):
- Roformer, 86M params
- Discrete categorical diffusion, T=100
- Input: 768D conditioning from encoder
- Output: 12 x 32 token grid
- Training: MLX distributed on Mac4+Mac5
5.6 Task 6: Wire to Strudel
Token grid -> Strudel mini-notation:
Token grid row (BASS track, 32 timesteps):
[C2, ~, ~, ~, C2, ~, ~, ~, E2, ~, ~, ~, G2, ~, ~, ~, ...]
Strudel pattern:
note("c2 ~ ~ ~ c2 ~ ~ ~ e2 ~ ~ ~ g2 ~ ~ ~").s("bass")Ship via NATS from Mac5 -> iOS, consume in StrudelEngine at bar boundary.
6. Echelon Evolution Path
### Current (v1): Threshold Conductor
z* -> rules -> SetParam commands -> Strudel renders
### Next (v2): Equilibrium-Driven Patterns
z* dynamics -> continuous MotionModulate -> Strudel renders with smooth modulation
### Future (v3): Diffusion Token Generation
z* -> conditioning encoder -> discrete diffusion -> token grid -> Strudel Eval
### Research (v4): Joint Equilibrium
z and x co-evolve. B (CrossModalOperator) and s_theta (diffusion score) trained jointly.
The performer-system loop becomes a formal coupled dynamical system with provable convergence.
7. Key Equations Summary
DELL Fast Equilibrium:
h = tanh(W_in @ limbs + W_rec @ h + b + 0.3 * y_slow)
DELL Slow Equilibrium:
y = tanh(W @ mean(x_fast) + U @ y + b)
Brain Latent Blend:
z = 0.7 x_fast + 0.3 y_slow
LIM-RPS Fixed-Point:
z = prox_tau(z - gamma B(z))
Curvature:
kappa = |v x a| / |v|^3
Diffusion Reverse Step:
x_{t-1} = x_t + epsilon s_theta(x_t, t, z) + sqrt(2epsilon) noise
Coupled Equilibrium (v4):
z = argmin_z L_motion(z, body) + lambda L_coupling(z, x*)
x = argmin_x L_music(x) s.t. conditioning on z
8. Research Grounding (from Deep Literature Survey)
8.1 Explicit DEQ+Diffusion Papers
DEQ-DDIM (Pokle, Geng, Kolter, NeurIPS 2022): Recasts the entire DDIM sampling chain
as a joint multivariate fixed-point system z = F(z). Anderson acceleration converges in
far fewer than T sequential steps, enabling 2x faster parallel sampling.
GET (Geng et al., NeurIPS 2023): Generative Equilibrium Transformer — infinite-depth
weight-tied transformer solving z = f(z, noise, class). Single-step generation via
DEQ fixed-point, trained with direct L1 reconstruction. Matches 5x larger ViT in FID.
Equilibrium Matching (arXiv 2510.02300, 2025): The cleanest statement. Replaces
time-conditional diffusion entirely with a time-invariant energy landscape E(x).
Sampling = gradient descent on E. Data points are fixed points where ||nabla E|| ≈ 0.
Convergence: O(1/K) for L-smooth energy.
8.2 Coupled Equilibrium Convergence
Bhaskar-Lakshmikantham Theorem (Nonlinear Analysis, 2006): For coupled operator
F: X × X → X with mixed monotone property, if the product-space contraction holds:
d(F(x1,y1), F(x2,y2)) <= (k/2) [d(x1,x2) + d(y1,y2)], k < 1then a unique coupled fixed point (x, y) exists and iteration converges geometrically.
Application to CC: Define F_motion(z, m) and F_music(m, z) as the motion and music
equilibrium operators. Convergence requires:
||dF_motion/dz|| + ||dF_motion/dm|| < 1
||dF_music/dm|| + ||dF_music/dz|| < 1The spectral radius of the block Jacobian rho(J) < 1 guarantees joint convergence.
The coupling strength (0.3 slow→fast, 0.5 fast→slow in DELL) must stay below the
contraction gap of each individual process.
8.3 Discrete Diffusion: SEDD over D3PM
SEDD (Lou et al., ICML 2024 Best Paper): Learns the "concrete score" — probability
ratios p_t(y)/p_t(x) — the exact discrete analogue of the continuous score function.
Reverse CTMC rates = forward rates × concrete score, directly paralleling how the
continuous reverse SDE = forward SDE + score. This makes SEDD the cleanest bridge
between the DEQ-diffusion equivalence and discrete token generation.
Recommendation: Use SEDD's concrete score parameterization for the music token
diffusion model rather than D3PM's transition matrices, as it preserves the
theoretical connection to the equilibrium framework.
8.4 Existing Motion+Music Coupled Diffusion
MoMu-Diffusion (NeurIPS 2024): Two expert diffusion models (motion-to-music,
music-to-motion) combined via cross-guidance at sampling time. Stage 1: unconditional.
Stage 2 (after critical timestep T^c): classifier-free guidance with scale s.
Engineering solution without formal convergence proof.
9. Implementation Status
| Task | Status | Artifact |
|---|---|---|
| Wire LIM-RPS dynamics to Strudel | DONE | conductor.rs: 15 continuous params, strudel.rs: DynamicsInput + 16 default mappings |
| z* trajectory logger | DONE | TrajectoryLogger.swift: JSONL at bar boundaries |
| Token vocabulary | DONE | music_tokens.rs (81 tokens, 6 tests), MusicToken.swift, music_tokens.py |
| Conditioning encoder | DONE | conditioning_encoder.py: MLX, 554K params, 104→768 |
| Discrete diffusion model | PENDING | Needs recorded session data for training |
| Wire to Strudel | PENDING | TokenGrid.to_strudel() ready, needs diffusion output |
---
Document: Equilibrium Diffusion Theory for Computational Choreography
Author: Mohamed Diomande / Claude
Date: 2026-04-02
Status: Living document, updated as implementation progresses
Promotion Decision
Keep as idea/proposal unless evidence and implementation anchors exist.
Source Anchor
MotionMix/EQUILIBRIUM-DIFFUSION-THEORY.md
Detected Structure
Method · Code Anchors · Architecture