Path A: Flow Matching Architecture Upgrade
Layers (8 blocks): ├─ AdaLN-Zero (adaptive layer norm from timestep embedding) ├─ Multi-Head Self-Attention (8 heads, dim=256) over temporal axis ├─ Cross-Attention to audio context c (8 heads) ├─ FiLM modulation from audio (preserved from current system) └─ MLP (256 → 1024 → 256, GELU)
Full Public Reader
Path A: Flow Matching Architecture Upgrade
## Core Thesis
Replace CC-MotionGen's DDPM/DDIM diffusion backbone with Optimal Transport Conditional Flow Matching (OT-CFM), achieving 10-100x speedup while preserving the dual-stage validation advantage.
Architecture Design
Current → New
CURRENT: Audio → AudioConditioner → U-Net 1D (DDPM, 50 steps) → 25D motion
NEW: Audio → AudioConditioner → Motion DiT (Flow Matching, 1-10 steps) → 25D motion### Motion DiT (Diffusion Transformer for Motion)
Replace U-Net 1D with a transformer-based architecture:
Input: x_t (B, T, 25) noised motion + t (scalar timestep) + c (B, T, 256) audio context
Layers (8 blocks):
├─ AdaLN-Zero (adaptive layer norm from timestep embedding)
├─ Multi-Head Self-Attention (8 heads, dim=256) over temporal axis
├─ Cross-Attention to audio context c (8 heads)
├─ FiLM modulation from audio (preserved from current system)
└─ MLP (256 → 1024 → 256, GELU)
Output: v_θ(x_t, t, c) velocity field prediction (B, T, 25)Parameter count: ~15M (vs current U-Net ~20M). Lighter due to no skip connections.
Flow Matching Training
# OT-CFM loss (replaces DDPM MSE loss)
t = torch.rand(B, 1, 1) # uniform [0, 1]
x_0 = noise # standard Gaussian
x_1 = motion_data # clean motion
x_t = (1 - t) * x_0 + t * x_1 # linear interpolation (optimal transport)
v_target = x_1 - x_0 # constant velocity field
v_pred = model(x_t, t, audio_context)
loss = F.mse_loss(v_pred, v_target) # + structure regularizers from current systemKey advantage: straight-line interpolation paths → fewer steps needed for quality.
Inference (ODE Solver)
# Euler solver (1-step)
x_0 = torch.randn(B, T, 25)
v = model(x_0, t=0, audio_context)
x_1 = x_0 + v # single step!
# Midpoint solver (4-step, higher quality)
for t in [0.0, 0.25, 0.5, 0.75]:
v = model(x_t, t, audio_context)
x_t = x_t + 0.25 * v### Speed Projections
| Steps | Quality (est. FID) | Latency (GPU) | vs Current |
|-------|-------------------|---------------|------------|
| 1 | ~0.3-0.5 | ~40ms | 50-75x faster |
| 4 | ~0.1-0.2 | ~160ms | 12-18x faster |
| 10 | ~0.05-0.1 | ~400ms | 5-7x faster |
| Current (DDIM 50) | baseline | ~2500ms | 1x |
### Preservation Strategy
1. SanityChecker: Unchanged. Physics validation is architecture-agnostic.
2. MusalityScorer: Unchanged. Scoring operates on output trajectories.
3. AudioConditioner: Reused. Conv1D encoder → 256-dim context feeds cross-attention.
4. MotionDecoder: Reused. 7 semantic heads decode from 25D regardless of generation method.
5. Speculative sampling: Now 10-50x cheaper per candidate. Can increase K from 4 to 16.
6. RAG++/MPMS: Selection scoring unchanged. Prior injection adapts to flow matching conditioning.
### Migration Plan
1. Week 1: Implement MotionDiT in `model/dit.py`. Keep U-Net untouched.
2. Week 2: Implement flow matching loss in `training/flow_losses.py`.
3. Week 3: Train from scratch on existing phrase data. Compare against DDIM baseline.
4. Week 4: Distill U-Net teacher → DiT student if quality gap exists.
5. Week 5: Integrate into inference pipeline. A/B test with validation scores.
### Risks
- Flow matching on 25D custom representation is uncharted (all papers use SMPL 263D)
- DiT may need more data than U-Net for equivalent quality
- 1-step generation may produce artifacts in quaternion channels (discontinuities)
- Training from scratch required — no weight transfer from U-Net
### Files to Create/Modify
- NEW: `model/dit.py` — Motion DiT architecture
- NEW: `model/flow_matching.py` — OT-CFM forward/reverse process
- NEW: `training/flow_losses.py` — Flow matching loss with structure regularizers
- MODIFY: `inference/sampler.py` — Add Euler/midpoint ODE solvers
- MODIFY: `config.py` — Add flow matching config block
- MODIFY: `scripts/train.py` — Flow matching training entry point
Promotion Decision
Promote into a technical note or architecture paper with implementation anchors.
Source Anchor
omega-output/cc-motion-gen-20260321/02-evolution/stage1-path-a-flow-matching.md
Detected Structure
Method · Evaluation · Code Anchors · Architecture · is Stage Research