Grand Diomande Research · Full HTML Reader

Path A: Flow Matching Architecture Upgrade

Layers (8 blocks): ├─ AdaLN-Zero (adaptive layer norm from timestep embedding) ├─ Multi-Head Self-Attention (8 heads, dim=256) over temporal axis ├─ Cross-Attention to audio context c (8 heads) ├─ FiLM modulation from audio (preserved from current system) └─ MLP (256 → 1024 → 256, GELU)

Embodied Trajectory Systems architecture technical paper candidate score 28 .md

Full Public Reader

Path A: Flow Matching Architecture Upgrade

## Core Thesis
Replace CC-MotionGen's DDPM/DDIM diffusion backbone with Optimal Transport Conditional Flow Matching (OT-CFM), achieving 10-100x speedup while preserving the dual-stage validation advantage.

Architecture Design

Current → New

CURRENT: Audio → AudioConditioner → U-Net 1D (DDPM, 50 steps) → 25D motion
NEW:     Audio → AudioConditioner → Motion DiT (Flow Matching, 1-10 steps) → 25D motion

### Motion DiT (Diffusion Transformer for Motion)
Replace U-Net 1D with a transformer-based architecture:

Input: x_t (B, T, 25) noised motion + t (scalar timestep) + c (B, T, 256) audio context

Layers (8 blocks):
  ├─ AdaLN-Zero (adaptive layer norm from timestep embedding)
  ├─ Multi-Head Self-Attention (8 heads, dim=256) over temporal axis
  ├─ Cross-Attention to audio context c (8 heads)
  ├─ FiLM modulation from audio (preserved from current system)
  └─ MLP (256 → 1024 → 256, GELU)

Output: v_θ(x_t, t, c) velocity field prediction (B, T, 25)

Parameter count: ~15M (vs current U-Net ~20M). Lighter due to no skip connections.

Flow Matching Training

python

# OT-CFM loss (replaces DDPM MSE loss)
t = torch.rand(B, 1, 1)  # uniform [0, 1]
x_0 = noise  # standard Gaussian
x_1 = motion_data  # clean motion
x_t = (1 - t) * x_0 + t * x_1  # linear interpolation (optimal transport)
v_target = x_1 - x_0  # constant velocity field

v_pred = model(x_t, t, audio_context)
loss = F.mse_loss(v_pred, v_target)  # + structure regularizers from current system

Key advantage: straight-line interpolation paths → fewer steps needed for quality.

Inference (ODE Solver)

python

# Euler solver (1-step)
x_0 = torch.randn(B, T, 25)
v = model(x_0, t=0, audio_context)
x_1 = x_0 + v  # single step!

# Midpoint solver (4-step, higher quality)
for t in [0.0, 0.25, 0.5, 0.75]:
    v = model(x_t, t, audio_context)
    x_t = x_t + 0.25 * v

### Speed Projections
| Steps | Quality (est. FID) | Latency (GPU) | vs Current |
|-------|-------------------|---------------|------------|
| 1 | ~0.3-0.5 | ~40ms | 50-75x faster |
| 4 | ~0.1-0.2 | ~160ms | 12-18x faster |
| 10 | ~0.05-0.1 | ~400ms | 5-7x faster |
| Current (DDIM 50) | baseline | ~2500ms | 1x |

### Preservation Strategy
1. SanityChecker: Unchanged. Physics validation is architecture-agnostic.
2. MusalityScorer: Unchanged. Scoring operates on output trajectories.
3. AudioConditioner: Reused. Conv1D encoder → 256-dim context feeds cross-attention.
4. MotionDecoder: Reused. 7 semantic heads decode from 25D regardless of generation method.
5. Speculative sampling: Now 10-50x cheaper per candidate. Can increase K from 4 to 16.
6. RAG++/MPMS: Selection scoring unchanged. Prior injection adapts to flow matching conditioning.

### Migration Plan
1. Week 1: Implement MotionDiT in `model/dit.py`. Keep U-Net untouched.
2. Week 2: Implement flow matching loss in `training/flow_losses.py`.
3. Week 3: Train from scratch on existing phrase data. Compare against DDIM baseline.
4. Week 4: Distill U-Net teacher → DiT student if quality gap exists.
5. Week 5: Integrate into inference pipeline. A/B test with validation scores.

### Risks
- Flow matching on 25D custom representation is uncharted (all papers use SMPL 263D)
- DiT may need more data than U-Net for equivalent quality
- 1-step generation may produce artifacts in quaternion channels (discontinuities)
- Training from scratch required — no weight transfer from U-Net

### Files to Create/Modify
- NEW: `model/dit.py` — Motion DiT architecture
- NEW: `model/flow_matching.py` — OT-CFM forward/reverse process
- NEW: `training/flow_losses.py` — Flow matching loss with structure regularizers
- MODIFY: `inference/sampler.py` — Add Euler/midpoint ODE solvers
- MODIFY: `config.py` — Add flow matching config block
- MODIFY: `scripts/train.py` — Flow matching training entry point

Promotion Decision

Promote into a technical note or architecture paper with implementation anchors.

Source Anchor

omega-output/cc-motion-gen-20260321/02-evolution/stage1-path-a-flow-matching.md

Detected Structure

Method · Evaluation · Code Anchors · Architecture · is Stage Research