Grand Diomande Research · Full HTML Reader

Equilibrium Diffusion: LIM-RPS x Discrete Token Diffusion

**Banach Fixed-Point Theorem**: If f is a contraction mapping (Lipschitz constant L < 1), then: 1. A unique fixed point z* exists 2. The iteration z_{k+1} = f(z_k, x) converges to z* for any initial z_0 3. Convergence is exponential: ||z_k - z*|| <= L^k ||z_0 - z*||

Embodied Trajectory Systems proposal backlog reference score 22 .md

Full Public Reader

Equilibrium Diffusion: LIM-RPS x Discrete Token Diffusion

> A unified framework for motion-conditioned music generation via coupled equilibrium systems.

1. Mathematical Foundation

1.1 Deep Equilibrium Models (DEQ)

A DEQ replaces L explicit layers with a single implicit layer defined by its fixed point:

z* = f(z*, x)     where x is the input

Banach Fixed-Point Theorem: If f is a contraction mapping (Lipschitz constant L < 1), then:
1. A unique fixed point z* exists
2. The iteration z_{k+1} = f(z_k, x) converges to z* for any initial z_0
3. Convergence is exponential: ||z_k - z|| <= L^k ||z_0 - z||

Reference: Bai, Kolter & Koltun, "Deep Equilibrium Models" (NeurIPS 2019)

1.2 Score-Based Diffusion

A diffusion model defines a forward process that adds noise:

q(x_t | x_0) = N(x_t; sqrt(alpha_t) x_0, (1 - alpha_t) I)

The reverse process uses the score function s_theta = nabla_x log p(x_t):

x_{t-1} = x_t + epsilon * s_theta(x_t, t) + sqrt(2*epsilon) * noise

This is Langevin dynamics converging to the data distribution p(x_0).

1.3 The Structural Equivalence

	LIM-RPS Fixed-Point	Diffusion Denoising
Iteration	z_{k+1} = z_k - gamma * B(z_k) + prox_pull	x_{t-1} = x_t + epsilon * s(x_t, t) + noise
Vector field	B: CrossModalOperator (96 -> 96)	s_theta: score network
Contraction	Spectral norm(B) <= 1 (1-Lipschitz)	Bounded Jacobian of s_theta
Attractor	Unique z* (deterministic)	Data manifold (statistical)
Convergence proof	Banach theorem (spectral norm < 1)	Score matching loss -> true score
Without noise	Deterministic equilibrium	Probability flow ODE

The operator B in LIM-RPS plays the same mathematical role as -s_theta in diffusion.
Both are vector fields pushing the state toward an attractor. The spectral normalization
of B is equivalent to requiring the score function to have bounded Jacobian, which is
exactly the condition needed for diffusion convergence.

Key insight: The LIM-RPS iteration `z - gamma * B(z)` is a single step of the
probability flow ODE `dx/dt = -s(x, t)` evaluated at discrete time. If you add noise
(`z - gamma B(z) + sqrt(2gamma) * epsilon`), you recover stochastic Langevin dynamics,
which IS the reverse diffusion process.

2. The CC System Architecture (As Built)

2.1 DELL: Dual Equilibrium Latent Learning

Two coupled equilibrium solvers operating at different timescales:

FastEquilibrium (60Hz, K_fast=4 iterations):

h_{k+1} = tanh(W_in @ limbs + W_recurrent @ h_k + bias + y_slow * 0.3)
x_fast = W_out @ h*

Dimensions: 216 (27 limbs x 8) -> 128 hidden -> 32 output

SlowEquilibrium (2.5Hz at beat boundaries, K_slow=6 iterations):

x_accum = mean(x_fast over 24 frames)
y_{k+1} = tanh(W @ x_accum + U @ y_k + bias)
y_slow = y*

Dimensions: 32 -> 32

Coupling:
- Slow -> Fast: 0.3 (musical context biases motion perception)
- Fast -> Slow: 0.5 (motion accumulates into musical decisions)

Brain Latent Blend:

z = x_fast * 0.7 + y_slow * 0.3     (32-dimensional)

Plus extracted dynamics: velocity, acceleration, curvature, coherence, periodicity, phase, grounding, verticality.

2.2 LIM-RPS Processor (Motion Bridge)

Converts raw sensor data to rich LatentState through:
1. Feature extraction from IMU
2. Cross-device fusion
3. Proximal update: smooth convergence without jitter
4. Dynamics: curvature (kappa = |v x a| / |v|^3), tension (accumulated + decay), periodicity (autocorrelation)
5. Prediction: linear extrapolation + confidence from velocity stability

2.3 Conductor (Rule-Based, Current)

Maps LatentTrajectory -> pattern edits via threshold rules:
- tension > 0.7 -> increase hihat density
- energy > 0.8 -> raise bass gain
- energy < 0.3 -> raise pad gain
- transition_intensity > 0.5 -> increase reverb

Musical section state machine: Intro -> Groove -> Build -> Climax -> Breakdown -> Outro

2.4 Strudel Bridge (Existing)

WebSocket commands to Strudel.js:
- MotionModulate { parameter, value, source } -- real-time parameter tweaking
- Eval { pattern } -- inject new pattern code
- Crossfade { pattern, duration_beats } -- smooth transition
- SetParam { name, value } -- direct parameter set

3. The Unified Architecture

3.1 Where Diffusion Enters

The Conductor's rule-based threshold system (tension > 0.7 -> hihat density) is the
weakest link. It's a hand-crafted lookup table pretending to be a creative decision-maker.
The diffusion model replaces EXACTLY this component:

CURRENT:
  z* (32D) -> Conductor threshold rules -> StrudelCommand::SetParam

PROPOSED:
  z* (32D) + dynamics -> Conditioning Encoder -> Discrete Diffusion -> Token Grid -> StrudelCommand::Eval

Everything upstream (DELL, LIM-RPS) and downstream (Strudel bridge, audio engine) stays.

3.2 Coupled Equilibrium System

        EQUILIBRIUM 1: Motion Perception
        (DELL FastEquilibrium, 60Hz)

        h* = tanh(W_in @ limbs + W_rec @ h* + slow_bias)
        x_fast = W_out @ h*

             |  0.7 weight
             v
        z = blend(x_fast, y_slow)
        + dynamics: curvature, periodicity, tension, prediction

             |  conditioning signal
             v

        EQUILIBRIUM 2: Music Composition
        (Discrete Diffusion, every 1-2 bars)

        x_0 ~ p(tokens | z*, dynamics, genre)
        Generated via T denoising steps:
        x_{t-1} = denoise(x_t, z*, t)

             |  token grid (12 instruments x 32 steps)
             v

        StrudelEngine (existing renderer)
        Eval { pattern } -> Audio Output

             |  sound fills room
             v

        EQUILIBRIUM 3: Performer-System Loop
        (Emergent, not explicitly computed)

        Performer hears music -> moves differently -> z* shifts -> music shifts
        Joint attractor of the coupled system

3.3 The Conditioning Encoder

Maps DELL output to diffusion conditioning space:

Input (104-dim):
  z          [32]   -- DELL blended latent
  velocity   [32]   -- dz/dt
  curvature  [1]    -- trajectory bending
  curvature_rate [1] -- jerk in latent space
  periodicity [1]   -- autocorrelation peak
  internal_tempo [1] -- embodied BPM
  phase      [1]    -- beat phase [0,1)
  tension    [1]    -- accumulated intensity
  grounding  [1]    -- vertical stability
  verticality [1]   -- up/down tendency
  coherence  [1]    -- fast-slow alignment
  lr_pan     [1]    -- left-right balance

Encoder (MLP):
  Linear(104, 256) + GELU
  Linear(256, 512) + GELU
  Linear(512, 768)

Output: 768-dim embedding (matches Roformer d_model)

3.4 Dynamics-Driven Noise Schedule

The prediction_confidence from LIM-RPS modulates the diffusion process itself:

High confidence (predictable motion):
  T_effective = T_min (e.g., 4 steps)
  temperature = 0.7 (conservative sampling)
  Result: clear, committed musical phrases

Low confidence (novel motion):
  T_effective = T_max (e.g., 50 steps)
  temperature = 1.2 (exploratory sampling)
  Result: ambiguous, searching patterns

The commitment scalar IS the noise schedule.

3.5 Token Vocabulary

GETScore-style 2D grid: 12 instrument tracks x 32 timesteps (2 bars at 16th-note resolution).

Each cell contains a token from vocabulary V:

Category	Tokens	Count
Rest (silence)	~	1
MIDI pitches	C2-C6	49
Drum hits	bd, sd, hh, oh, cp, rim, tom, perc	8
Dynamics	pp, p, mp, mf, f, ff	6
Articulation	stac, legato, accent, ghost	4
Duration	16th, 8th, dotted-8th, quarter, half	5
Effects	lpf, hpf, delay, reverb, dist	5
Structure	repeat, fill, break	3
Total		81

Each track maps to one of the 12 instruments:
KICK, SNARE, HH, CLAP, BASS, PAD, LEAD, STAB, ARP, FX, VOX, AMB

3.6 The Lipschitz Cascade

The 1-Lipschitz property of the CrossModalOperator B propagates through the system:

1. Small delta_motion -> small delta_z* (B is 1-Lipschitz, proximal step is contractive)
2. Small delta_z* -> small delta_embedding (MLP encoder is continuous)
3. Small delta_embedding -> small delta_p(tokens|z*) (conditional distribution shifts smoothly)
4. Small delta_p -> small delta_x_0 (denoised tokens are close)

Result: the performer can't "break" the music. No matter how sudden the motion change,
the cascaded Lipschitz bounds guarantee the music changes by at most a bounded amount per
time step. This is the mathematical guarantee that replaces hand-tuned smoothing.

4. Training Strategy

4.1 Phase 1: Synthetic Data (Immediate)

Generate (z* trajectory, token sequence) pairs from existing rules:
- Run LIM-RPS on recorded session data (stored in Desktop/MotionMix/sessions/)
- For each bar: compute z* + dynamics, record which patterns the Conductor chose
- Convert Conductor's choices to token grid format
- This gives supervised pairs without any new recordings

4.2 Phase 2: Performance Data (Ongoing)

Log real performance data (Task 21):
- Every bar boundary: z* (32D), dynamics (8 scalars), active Strudel pattern
- JSONL format, sidecar to video recording
- Target: 100 sessions = ~10,000 bar-level training examples

4.3 Phase 3: Joint Fine-Tuning (Future)

End-to-end training where the conditioning encoder and diffusion model are optimized jointly:
- Loss = denoising loss on tokens + regularization on z* smoothness
- The encoder learns WHAT to extract from z* for music generation
- The diffusion model learns HOW to generate coherent patterns from those features

5. Implementation Plan

5.1 Task 1: Wire LIM-RPS Dynamics to Strudel (No ML)

Replace Conductor threshold rules with continuous dynamics-driven MotionModulate commands.
Use existing Strudel bridge. Map:
- periodicity -> rhythmic density (pattern complexity)
- curvature -> melodic range (interval sizes)
- tension -> harmonic tension (dissonance level)
- grounding -> bass weight (low-frequency emphasis)
- prediction_confidence -> arrangement stability

5.2 Task 2: z* Trajectory Logger

In AudioEngine, at bar boundaries:

swift

struct BarSnapshot: Codable {
    let timestamp: Double
    let z: [Float]           // 32D DELL latent
    let dynamics: Dynamics   // curvature, periodicity, tension, etc.
    let activePattern: String // Strudel pattern code
    let instrumentActivations: [Float] // 12 values
    let genre: String
    let bpm: Float
}

Append to JSONL sidecar file per session.

5.3 Task 3: Token Vocabulary Definition

Shared definition in Rust (for CC crate), Swift (for iOS), Python (for training):
- Rust: enum MusicToken with 81 variants
- Swift: MusicToken enum mirroring Rust
- Python: vocabulary dict with encode/decode

5.4 Task 4: Conditioning Encoder

Python (MLX on Mac5):
- MLP: 104 -> 256 -> 512 -> 768
- Train on z* trajectories from logger
- Export to CoreML for on-device option

5.5 Task 5: Discrete Diffusion Model

GETMusic-style on Mac5 (MLX):
- Roformer, 86M params
- Discrete categorical diffusion, T=100
- Input: 768D conditioning from encoder
- Output: 12 x 32 token grid
- Training: MLX distributed on Mac4+Mac5

5.6 Task 6: Wire to Strudel

Token grid -> Strudel mini-notation:

Token grid row (BASS track, 32 timesteps):
  [C2, ~, ~, ~, C2, ~, ~, ~, E2, ~, ~, ~, G2, ~, ~, ~, ...]

Strudel pattern:
  note("c2 ~ ~ ~ c2 ~ ~ ~ e2 ~ ~ ~ g2 ~ ~ ~").s("bass")

Ship via NATS from Mac5 -> iOS, consume in StrudelEngine at bar boundary.

6. Echelon Evolution Path

### Current (v1): Threshold Conductor
z* -> rules -> SetParam commands -> Strudel renders

### Next (v2): Equilibrium-Driven Patterns
z* dynamics -> continuous MotionModulate -> Strudel renders with smooth modulation

### Future (v3): Diffusion Token Generation
z* -> conditioning encoder -> discrete diffusion -> token grid -> Strudel Eval

### Research (v4): Joint Equilibrium
z and x co-evolve. B (CrossModalOperator) and s_theta (diffusion score) trained jointly.
The performer-system loop becomes a formal coupled dynamical system with provable convergence.

7. Key Equations Summary

DELL Fast Equilibrium:
h = tanh(W_in @ limbs + W_rec @ h + b + 0.3 * y_slow)

DELL Slow Equilibrium:
y = tanh(W @ mean(x_fast) + U @ y + b)

Brain Latent Blend:
z = 0.7 x_fast + 0.3 y_slow

LIM-RPS Fixed-Point:
z = prox_tau(z - gamma B(z))

Curvature:
kappa = |v x a| / |v|^3

Diffusion Reverse Step:
x_{t-1} = x_t + epsilon s_theta(x_t, t, z) + sqrt(2epsilon) noise

Coupled Equilibrium (v4):
z = argmin_z L_motion(z, body) + lambda L_coupling(z, x*)
x = argmin_x L_music(x) s.t. conditioning on z

8. Research Grounding (from Deep Literature Survey)

8.1 Explicit DEQ+Diffusion Papers

DEQ-DDIM (Pokle, Geng, Kolter, NeurIPS 2022): Recasts the entire DDIM sampling chain
as a joint multivariate fixed-point system z = F(z). Anderson acceleration converges in
far fewer than T sequential steps, enabling 2x faster parallel sampling.

GET (Geng et al., NeurIPS 2023): Generative Equilibrium Transformer — infinite-depth
weight-tied transformer solving z = f(z, noise, class). Single-step generation via
DEQ fixed-point, trained with direct L1 reconstruction. Matches 5x larger ViT in FID.

Equilibrium Matching (arXiv 2510.02300, 2025): The cleanest statement. Replaces
time-conditional diffusion entirely with a time-invariant energy landscape E(x).
Sampling = gradient descent on E. Data points are fixed points where ||nabla E|| ≈ 0.
Convergence: O(1/K) for L-smooth energy.

8.2 Coupled Equilibrium Convergence

Bhaskar-Lakshmikantham Theorem (Nonlinear Analysis, 2006): For coupled operator
F: X × X → X with mixed monotone property, if the product-space contraction holds:

d(F(x1,y1), F(x2,y2)) <= (k/2) [d(x1,x2) + d(y1,y2)],  k < 1

then a unique coupled fixed point (x, y) exists and iteration converges geometrically.

Application to CC: Define F_motion(z, m) and F_music(m, z) as the motion and music
equilibrium operators. Convergence requires:

||dF_motion/dz|| + ||dF_motion/dm|| < 1
||dF_music/dm|| + ||dF_music/dz|| < 1

The spectral radius of the block Jacobian rho(J) < 1 guarantees joint convergence.
The coupling strength (0.3 slow→fast, 0.5 fast→slow in DELL) must stay below the
contraction gap of each individual process.

8.3 Discrete Diffusion: SEDD over D3PM

SEDD (Lou et al., ICML 2024 Best Paper): Learns the "concrete score" — probability
ratios p_t(y)/p_t(x) — the exact discrete analogue of the continuous score function.
Reverse CTMC rates = forward rates × concrete score, directly paralleling how the
continuous reverse SDE = forward SDE + score. This makes SEDD the cleanest bridge
between the DEQ-diffusion equivalence and discrete token generation.

Recommendation: Use SEDD's concrete score parameterization for the music token
diffusion model rather than D3PM's transition matrices, as it preserves the
theoretical connection to the equilibrium framework.

8.4 Existing Motion+Music Coupled Diffusion

MoMu-Diffusion (NeurIPS 2024): Two expert diffusion models (motion-to-music,
music-to-motion) combined via cross-guidance at sampling time. Stage 1: unconditional.
Stage 2 (after critical timestep T^c): classifier-free guidance with scale s.
Engineering solution without formal convergence proof.

9. Implementation Status

Task	Status	Artifact
Wire LIM-RPS dynamics to Strudel	DONE	conductor.rs: 15 continuous params, strudel.rs: DynamicsInput + 16 default mappings
z* trajectory logger	DONE	TrajectoryLogger.swift: JSONL at bar boundaries
Token vocabulary	DONE	music_tokens.rs (81 tokens, 6 tests), MusicToken.swift, music_tokens.py
Conditioning encoder	DONE	conditioning_encoder.py: MLX, 554K params, 104→768
Discrete diffusion model	PENDING	Needs recorded session data for training
Wire to Strudel	PENDING	TokenGrid.to_strudel() ready, needs diffusion output

---

Document: Equilibrium Diffusion Theory for Computational Choreography
Author: Mohamed Diomande / Claude
Date: 2026-04-02
Status: Living document, updated as implementation progresses

Promotion Decision

Keep as idea/proposal unless evidence and implementation anchors exist.

Source Anchor

MotionMix/EQUILIBRIUM-DIFFUSION-THEORY.md

Detected Structure

Method · Code Anchors · Architecture