Stage 4: FORGE — CC-MotionGen V2 Architecture
**CC-MotionGen V2 = Flow Matching DiT + Two-Tier Deployment + Multi-Modal Conditioning + Physics-Grounded Learned Validation + Sensor Capture Flywheel**
Full Public Reader
Stage 4: FORGE — CC-MotionGen V2 Architecture
Core Formula
CC-MotionGen V2 = Flow Matching DiT + Two-Tier Deployment + Multi-Modal Conditioning + Physics-Grounded Learned Validation + Sensor Capture Flywheel
The system is built on 4 pillars:
1. 25D Motion Protocol (unchanged, competitive moat)
2. Flow Matching DiT (replaces DDPM, 100x faster)
3. Learned Quality System (validation advantage evolved)
4. iPhone Capture Studio (data flywheel, first-mover)
---
Architecture Overview
┌─────────────────────────────────────────────────────────────────────┐
│ CC-MotionGen V2 │
│ │
│ ┌─────────────┐ ┌──────────────┐ ┌──────────────────────┐ │
│ │ CONDITIONING │ │ GENERATION │ │ VALIDATION │ │
│ │ │ │ │ │ │ │
│ │ Text (CLIP) │────→│ Motion DiT │────→│ SanityChecker │ │
│ │ Audio (163ch)│────→│ (Flow Match) │ │ (deterministic) │ │
│ │ Motion (25D) │────→│ │ │ ↓ │ │
│ │ Task Token │────→│ 1-10 steps │ │ Learned Critic │ │
│ │ │ │ ODE solve │ │ (quality+confidence) │ │
│ └─────────────┘ └──────┬───────┘ │ ↓ │ │
│ │ │ ProximalCorrector │ │
│ ↓ │ (physics refinement) │ │
│ ┌────────────────┐ └──────────┬───────────┘ │
│ │ Motion Decoder │ │ │
│ │ (7 heads) │←────────────────┘ │
│ └────────┬───────┘ │
│ ↓ │
│ ┌────────────────┐ │
│ │ 25D Motion │ │
│ │ Protocol │ │
│ └────────┬───────┘ │
│ │ │
│ ┌───────────────┼───────────────┐ │
│ ↓ ↓ ↓ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ CompCoreBridge│ │ Echelon │ │ USDZ Export │ │
│ │ (camera ctrl)│ │ (music gen) │ │ (AR/VR) │ │
│ └──────────────┘ └──────────────┘ └──────────────┘ │
└─────────────────────────────────────────────────────────────────────┘
┌─────────────────────────────────────────────────────────────────────┐
│ SENSOR CAPTURE STUDIO │
│ │
│ iPhone Camera → Vision Pose → 3D Lifting → ┐ │
│ Apple Watch → IMU (accel+gyro) ──────────→ EKF Fusion → 25D │
│ AirPods Pro → Head Tracking ─────────────→ ┘ │
│ │
│ 25D → Cleanup → Beat Sync → Training Data Store │
│ → Real-time CompCoreBridge │
│ → Auto-Caption → Text-Motion Pairs │
└─────────────────────────────────────────────────────────────────────┘---
Component Specifications
1. Motion DiT (Diffusion Transformer)
Replaces: `model/unet.py` (U-Net 1D)
File: `model/dit.py` (new)
Architecture:
Input: x_t (B, T, 25) + t (B,) + conditions
Output: v_θ (B, T, 25) velocity field
Stem: Linear(25, 256) + sinusoidal pos encoding
Blocks (N=8):
├─ AdaLN-Zero (timestep t → scale, shift, gate)
├─ Self-Attention (8 heads, dim=256, causal=False)
├─ Cross-Attention to conditioning (text + audio)
│ ├─ Text: CLIP embedding (1, 768) → Linear(768, 256) → global token
│ └─ Audio: AudioConditioner (B, T, 163) → Conv1D → (B, T, 256)
├─ FiLM modulation (preserved from V1 for backward compat)
└─ MLP (256 → 1024 → 256, GELU)
Head: Linear(256, 25)
Task conditioning (Full tier only):
task_embedding = Embedding(6, 256) # GEN, EDIT, INBETWEEN, STYLE, PREDICT, AUTOCOMPLETE
Added to timestep embedding before AdaLN-Zero
Mask conditioning (Full tier only):
mask (B, T, 25) binary → concatenated with x_t → stem becomes Linear(50, 256)
Parameters:
Tiny tier: 4 blocks, dim=128 → ~2M params
Full tier: 8 blocks, dim=256 → ~25M params2. Flow Matching Engine
Replaces: `model/diffusion.py` (GaussianDiffusion)
File: `model/flow_matching.py` (new)
class OptimalTransportCFM:
"""Conditional Flow Matching with optimal transport paths."""
def training_step(self, x_1, condition):
"""x_1 is clean motion data."""
t = torch.rand(B, 1, 1)
x_0 = torch.randn_like(x_1)
x_t = (1 - t) * x_0 + t * x_1 # OT interpolation
v_target = x_1 - x_0 # constant velocity
v_pred = self.model(x_t, t, condition)
loss = F.mse_loss(v_pred, v_target)
return loss # + structure regularizers from V1
def sample(self, condition, steps=4):
"""Generate motion via ODE integration."""
x = torch.randn(B, T, 25)
dt = 1.0 / steps
for i in range(steps):
t = torch.full((B, 1, 1), i * dt)
v = self.model(x, t, condition)
x = x + dt * v # Euler step
return x
def sample_cfg(self, condition, steps=4, cfg_scale=3.0):
"""Classifier-free guided sampling."""
x = torch.randn(B, T, 25)
dt = 1.0 / steps
for i in range(steps):
t = torch.full((B, 1, 1), i * dt)
v_cond = self.model(x, t, condition)
v_uncond = self.model(x, t, null_condition)
v = v_uncond + cfg_scale * (v_cond - v_uncond)
x = x + dt * v
return x3. Multi-Modal Conditioning
Extends: `model/conditioning.py`
Text Path:
CLIP ViT-L/14 (frozen) → (B, 768) → Linear(768, 256) → global context token
Dropout: 10% during training (for CFG)
Audio Path (preserved from V1):
163-channel features → AudioConditioner (Conv1D stack) → (B, T, 256)
Dropout: 10% during training (for CFG)
Motion Path (for editing/prediction tasks):
Input motion (B, T, 25) → Linear(25, 256) → temporal encoding
Mask applied: known positions attend, unknown positions learn
Fusion (in DiT cross-attention):
Q = motion representation (from self-attention)
K, V = concat([text_token, audio_sequence]) # variable-length key/value
Multi-modal attention learns which modality to attend to per frame4. Two-Tier Deployment
Tiny Tier (On-Device)
Model: MotionDiT-Tiny (4 blocks, dim=128, 2M params)
Training: Consistency distillation from Full teacher
Inference: 1-step, <100ms on iPhone 15 Pro
Format: CoreML (.mlpackage), 6-bit palettized
Tasks: PREDICT, CLEANUP (subset of EDIT), BEAT_SYNC
Validation: SanityChecker only (deterministic, 5ms)
Integration: CompCoreBridge.swift direct model callFull Tier (Cloud/Mac)
Model: MotionDiT-Full (8 blocks, dim=256, 25M params)
Training: Flow matching + multi-task + DPO alignment
Inference: 4-10 steps, 160-400ms on GPU
Format: PyTorch (cloud), MLX (Mac4/Mac5)
Tasks: All 6 (GEN, EDIT, INBETWEEN, STYLE, PREDICT, AUTOCOMPLETE)
Validation: SanityChecker + Learned Critic + ProximalCorrector
Integration: FastAPI/WebSocket server (existing choreo_server.py pattern)5. Validation System V2
┌──────────────────────────────────────────────────┐
│ Layer 1: SanityChecker (UNCHANGED — deterministic)│
│ NaN/Inf, range, jerk, quaternion, phase, vel │
│ → Binary PASS/FAIL │
│ → On-device: 5ms │
└───────────────────┬──────────────────────────────┘
↓ (only if PASS)
┌──────────────────────────────────────────────────┐
│ Layer 2: LearnedCritic (NEW) │
│ Transformer: (motion, audio) → quality (0-1) │
│ + per-dimension confidence (25D) │
│ Bootstrapped from MusalityScorer, then human │
│ preference DPO │
│ → Scalar quality score │
│ → Full tier only: 20ms │
└───────────────────┬──────────────────────────────┘
↓ (if quality < threshold)
┌──────────────────────────────────────────────────┐
│ Layer 3: ProximalCorrector (NEW) │
│ Gradient-based refinement (50 iters, lr=0.01) │
│ Minimizes: physics violations + deviation from │
│ original - critic score │
│ → Corrected motion │
│ → Full tier only: 50-100ms │
└──────────────────────────────────────────────────┘6. Sensor Capture Studio
Pipeline (on-device, real-time):
Camera 30fps → Vision BodyPose 17j 2D → PoseLiftingNet 3D → ┐
Watch 60Hz → IMU (accel, gyro) ────────────────────────────→ EKF → 25D @ 30fps
AirPods → Head quaternion ──────────────────────────────→ ┘
Augmentation modes:
1. Cleanup: Savitzky-Golay smooth + jerk limit + ground snap
2. Occluded infill: mask low-confidence joints → Full model INBETWEEN
3. Style amplify: raw motion → Full model STYLE task
4. Beat sync: DTW alignment to audio beat grid
Data collection (consent-gated):
25D motion + audio features + metadata → on-device storage
Auto-caption via LLM → text-motion pairs
Upload to training pipeline when on WiFi---
Anti-Patterns
1. Never replace SanityChecker with learned physics — Deterministic physics checks are trustworthy. Learned validation supplements, never replaces.
2. Never train on SMPL and retarget to 25D — Train on 25D natively. Retarget FROM SMPL for benchmarking only.
3. Never unify Tiny and Full into one model — Different deployment targets need different architectures. One-size-fits-all fails on-device.
4. Never skip the Week 2 flow matching checkpoint — If 25D flow matching doesn't converge, pivot to DDIM consistency distillation immediately.
5. Never drop cc-anticipation (Rust) before learned prediction is validated — Keep it as production fallback.
6. Never collect motion data without user consent — 25D is privacy-preserving but consent is mandatory.
---
File Structure (New/Modified)
cc_motiongen/
├── model/
│ ├── unet.py # PRESERVED (fallback)
│ ├── diffusion.py # PRESERVED (fallback)
│ ├── dit.py # NEW — MotionDiT (Tiny + Full)
│ ├── flow_matching.py # NEW — OT-CFM engine
│ ├── conditioning.py # MODIFIED — add CLIP text encoder, multi-modal fusion
│ └── decoder.py # PRESERVED (7 semantic heads)
├── training/
│ ├── trainer.py # MODIFIED — flow matching training loop
│ ├── losses.py # MODIFIED — add physics losses (momentum, joint, contact)
│ ├── flow_losses.py # NEW — flow matching loss
│ ├── physics.py # NEW — physics-in-the-loop losses
│ ├── distill.py # NEW — consistency distillation (Full→Tiny)
│ └── dpo.py # NEW — DPO alignment trainer
├── inference/
│ ├── sampler.py # MODIFIED — add flow matching ODE solvers
│ ├── mpms_sampler.py # PRESERVED
│ ├── selection.py # PRESERVED
│ └── postprocess.py # PRESERVED
├── validation/
│ ├── sanity.py # PRESERVED (deterministic)
│ ├── musicality.py # PRESERVED (hand-tuned, bootstrap source)
│ ├── critic.py # NEW — learned quality critic
│ ├── corrector.py # NEW — proximal correction
│ └── biomechanics.py # NEW — joint limits, self-penetration, CoM
├── capture/ # NEW — sensor pipeline
│ ├── lifting.py # 2D→3D pose lifting network
│ ├── fusion.py # EKF multi-device fusion (Python bridge to Rust)
│ ├── cleanup.py # Motion cleanup (smooth, jerk limit, ground snap)
│ ├── beat_sync.py # DTW beat alignment
│ └── collector.py # Training data collection with consent
├── export/ # NEW — deployment
│ ├── coreml.py # PyTorch → CoreML conversion
│ ├── mlx_model.py # Native MLX implementation
│ └── onnx_export.py # ONNX intermediate
├── bridge/ # NEW — 25D ↔ SMPL
│ ├── smpl_to_25d.py # SMPL retargeting to 25D
│ └── twentyfive_to_smpl.py # 25D expansion to SMPL (for benchmarking)
├── config.py # MODIFIED — add flow matching, DiT, capture configs
├── types.py # MODIFIED — add TaskType enum, CaptureData types
└── __init__.py # MODIFIED — export new types---
Execution Gate (3-lens check)
1. Can this be built with available resources?
YES. Training: Mac4/Mac5 for prototyping, cloud GPU for production runs. iOS: Mac1 for builds. All tooling exists.
2. Are all dependencies satisfiable?
YES. PyTorch (existing), CLIP (pip install), CoreML tools (pip install), cc-collection Rust EKF (exists). No new infrastructure needed.
3. Can Autopilot execute without human intervention?
PARTIALLY. P0 (flow matching) and P4 (unified model) are fully automatable. P1 (CoreML profiling), P2 (sensor testing), P5 (human labels) need human involvement at specific points.
Gate: PASS — proceeding to Stage 5: RAIL.
Promotion Decision
Promote into a technical note or architecture paper with implementation anchors.
Source Anchor
omega-output/cc-motion-gen-20260321/04-architecture.md
Detected Structure
Method · Evaluation · Code Anchors · Architecture · is Stage Research