Grand Diomande Research · Full HTML Reader

Stage 4: FORGE — CC-MotionGen V2 Architecture

**CC-MotionGen V2 = Flow Matching DiT + Two-Tier Deployment + Multi-Modal Conditioning + Physics-Grounded Learned Validation + Sensor Capture Flywheel**

Embodied Trajectory Systems architecture technical paper candidate score 36 .md

Full Public Reader

Stage 4: FORGE — CC-MotionGen V2 Architecture

Core Formula

CC-MotionGen V2 = Flow Matching DiT + Two-Tier Deployment + Multi-Modal Conditioning + Physics-Grounded Learned Validation + Sensor Capture Flywheel

The system is built on 4 pillars:
1. 25D Motion Protocol (unchanged, competitive moat)
2. Flow Matching DiT (replaces DDPM, 100x faster)
3. Learned Quality System (validation advantage evolved)
4. iPhone Capture Studio (data flywheel, first-mover)

---

Architecture Overview

┌─────────────────────────────────────────────────────────────────────┐
│                        CC-MotionGen V2                              │
│                                                                     │
│  ┌─────────────┐     ┌──────────────┐     ┌──────────────────────┐ │
│  │ CONDITIONING │     │  GENERATION  │     │    VALIDATION        │ │
│  │              │     │              │     │                      │ │
│  │ Text (CLIP)  │────→│ Motion DiT   │────→│ SanityChecker        │ │
│  │ Audio (163ch)│────→│ (Flow Match) │     │ (deterministic)      │ │
│  │ Motion (25D) │────→│              │     │          ↓           │ │
│  │ Task Token   │────→│ 1-10 steps   │     │ Learned Critic       │ │
│  │              │     │ ODE solve    │     │ (quality+confidence) │ │
│  └─────────────┘     └──────┬───────┘     │          ↓           │ │
│                              │             │ ProximalCorrector    │ │
│                              ↓             │ (physics refinement) │ │
│                     ┌────────────────┐     └──────────┬───────────┘ │
│                     │ Motion Decoder │                 │             │
│                     │ (7 heads)      │←────────────────┘             │
│                     └────────┬───────┘                               │
│                              ↓                                       │
│                     ┌────────────────┐                               │
│                     │ 25D Motion     │                               │
│                     │ Protocol       │                               │
│                     └────────┬───────┘                               │
│                              │                                       │
│              ┌───────────────┼───────────────┐                       │
│              ↓               ↓               ↓                       │
│    ┌──────────────┐ ┌──────────────┐ ┌──────────────┐               │
│    │ CompCoreBridge│ │ Echelon     │ │ USDZ Export  │               │
│    │ (camera ctrl)│ │ (music gen) │ │ (AR/VR)      │               │
│    └──────────────┘ └──────────────┘ └──────────────┘               │
└─────────────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────────────┐
│                    SENSOR CAPTURE STUDIO                            │
│                                                                     │
│  iPhone Camera → Vision Pose → 3D Lifting → ┐                      │
│  Apple Watch  → IMU (accel+gyro) ──────────→ EKF Fusion → 25D     │
│  AirPods Pro  → Head Tracking ─────────────→ ┘                     │
│                                                                     │
│  25D → Cleanup → Beat Sync → Training Data Store                   │
│                           → Real-time CompCoreBridge                │
│                           → Auto-Caption → Text-Motion Pairs       │
└─────────────────────────────────────────────────────────────────────┘

---

Component Specifications

1. Motion DiT (Diffusion Transformer)

Replaces: `model/unet.py` (U-Net 1D)
File: `model/dit.py` (new)

Architecture:
  Input:  x_t (B, T, 25) + t (B,) + conditions
  Output: v_θ (B, T, 25) velocity field

  Stem: Linear(25, 256) + sinusoidal pos encoding
  Blocks (N=8):
    ├─ AdaLN-Zero (timestep t → scale, shift, gate)
    ├─ Self-Attention (8 heads, dim=256, causal=False)
    ├─ Cross-Attention to conditioning (text + audio)
    │   ├─ Text: CLIP embedding (1, 768) → Linear(768, 256) → global token
    │   └─ Audio: AudioConditioner (B, T, 163) → Conv1D → (B, T, 256)
    ├─ FiLM modulation (preserved from V1 for backward compat)
    └─ MLP (256 → 1024 → 256, GELU)
  Head: Linear(256, 25)

  Task conditioning (Full tier only):
    task_embedding = Embedding(6, 256)  # GEN, EDIT, INBETWEEN, STYLE, PREDICT, AUTOCOMPLETE
    Added to timestep embedding before AdaLN-Zero

  Mask conditioning (Full tier only):
    mask (B, T, 25) binary → concatenated with x_t → stem becomes Linear(50, 256)

Parameters:
  Tiny tier: 4 blocks, dim=128 → ~2M params
  Full tier: 8 blocks, dim=256 → ~25M params

2. Flow Matching Engine

Replaces: `model/diffusion.py` (GaussianDiffusion)
File: `model/flow_matching.py` (new)

python

class OptimalTransportCFM:
    """Conditional Flow Matching with optimal transport paths."""

    def training_step(self, x_1, condition):
        """x_1 is clean motion data."""
        t = torch.rand(B, 1, 1)
        x_0 = torch.randn_like(x_1)
        x_t = (1 - t) * x_0 + t * x_1  # OT interpolation
        v_target = x_1 - x_0            # constant velocity

        v_pred = self.model(x_t, t, condition)
        loss = F.mse_loss(v_pred, v_target)
        return loss  # + structure regularizers from V1

    def sample(self, condition, steps=4):
        """Generate motion via ODE integration."""
        x = torch.randn(B, T, 25)
        dt = 1.0 / steps
        for i in range(steps):
            t = torch.full((B, 1, 1), i * dt)
            v = self.model(x, t, condition)
            x = x + dt * v  # Euler step
        return x

    def sample_cfg(self, condition, steps=4, cfg_scale=3.0):
        """Classifier-free guided sampling."""
        x = torch.randn(B, T, 25)
        dt = 1.0 / steps
        for i in range(steps):
            t = torch.full((B, 1, 1), i * dt)
            v_cond = self.model(x, t, condition)
            v_uncond = self.model(x, t, null_condition)
            v = v_uncond + cfg_scale * (v_cond - v_uncond)
            x = x + dt * v
        return x

3. Multi-Modal Conditioning

Extends: `model/conditioning.py`

Text Path:
  CLIP ViT-L/14 (frozen) → (B, 768) → Linear(768, 256) → global context token
  Dropout: 10% during training (for CFG)

Audio Path (preserved from V1):
  163-channel features → AudioConditioner (Conv1D stack) → (B, T, 256)
  Dropout: 10% during training (for CFG)

Motion Path (for editing/prediction tasks):
  Input motion (B, T, 25) → Linear(25, 256) → temporal encoding
  Mask applied: known positions attend, unknown positions learn

Fusion (in DiT cross-attention):
  Q = motion representation (from self-attention)
  K, V = concat([text_token, audio_sequence])  # variable-length key/value
  Multi-modal attention learns which modality to attend to per frame

4. Two-Tier Deployment

Tiny Tier (On-Device)

Model: MotionDiT-Tiny (4 blocks, dim=128, 2M params)
Training: Consistency distillation from Full teacher
Inference: 1-step, <100ms on iPhone 15 Pro
Format: CoreML (.mlpackage), 6-bit palettized
Tasks: PREDICT, CLEANUP (subset of EDIT), BEAT_SYNC
Validation: SanityChecker only (deterministic, 5ms)
Integration: CompCoreBridge.swift direct model call

Full Tier (Cloud/Mac)

Model: MotionDiT-Full (8 blocks, dim=256, 25M params)
Training: Flow matching + multi-task + DPO alignment
Inference: 4-10 steps, 160-400ms on GPU
Format: PyTorch (cloud), MLX (Mac4/Mac5)
Tasks: All 6 (GEN, EDIT, INBETWEEN, STYLE, PREDICT, AUTOCOMPLETE)
Validation: SanityChecker + Learned Critic + ProximalCorrector
Integration: FastAPI/WebSocket server (existing choreo_server.py pattern)

5. Validation System V2

┌──────────────────────────────────────────────────┐
│ Layer 1: SanityChecker (UNCHANGED — deterministic)│
│   NaN/Inf, range, jerk, quaternion, phase, vel   │
│   → Binary PASS/FAIL                             │
│   → On-device: 5ms                               │
└───────────────────┬──────────────────────────────┘
                    ↓ (only if PASS)
┌──────────────────────────────────────────────────┐
│ Layer 2: LearnedCritic (NEW)                      │
│   Transformer: (motion, audio) → quality (0-1)   │
│   + per-dimension confidence (25D)                │
│   Bootstrapped from MusalityScorer, then human    │
│   preference DPO                                  │
│   → Scalar quality score                          │
│   → Full tier only: 20ms                          │
└───────────────────┬──────────────────────────────┘
                    ↓ (if quality < threshold)
┌──────────────────────────────────────────────────┐
│ Layer 3: ProximalCorrector (NEW)                  │
│   Gradient-based refinement (50 iters, lr=0.01)  │
│   Minimizes: physics violations + deviation from  │
│   original - critic score                         │
│   → Corrected motion                              │
│   → Full tier only: 50-100ms                      │
└──────────────────────────────────────────────────┘

6. Sensor Capture Studio

Pipeline (on-device, real-time):
  Camera 30fps → Vision BodyPose 17j 2D → PoseLiftingNet 3D → ┐
  Watch 60Hz   → IMU (accel, gyro) ────────────────────────────→ EKF → 25D @ 30fps
  AirPods      → Head quaternion ──────────────────────────────→ ┘

Augmentation modes:
  1. Cleanup: Savitzky-Golay smooth + jerk limit + ground snap
  2. Occluded infill: mask low-confidence joints → Full model INBETWEEN
  3. Style amplify: raw motion → Full model STYLE task
  4. Beat sync: DTW alignment to audio beat grid

Data collection (consent-gated):
  25D motion + audio features + metadata → on-device storage
  Auto-caption via LLM → text-motion pairs
  Upload to training pipeline when on WiFi

---

Anti-Patterns

1. Never replace SanityChecker with learned physics — Deterministic physics checks are trustworthy. Learned validation supplements, never replaces.
2. Never train on SMPL and retarget to 25D — Train on 25D natively. Retarget FROM SMPL for benchmarking only.
3. Never unify Tiny and Full into one model — Different deployment targets need different architectures. One-size-fits-all fails on-device.
4. Never skip the Week 2 flow matching checkpoint — If 25D flow matching doesn't converge, pivot to DDIM consistency distillation immediately.
5. Never drop cc-anticipation (Rust) before learned prediction is validated — Keep it as production fallback.
6. Never collect motion data without user consent — 25D is privacy-preserving but consent is mandatory.

---

File Structure (New/Modified)

cc_motiongen/
├── model/
│   ├── unet.py          # PRESERVED (fallback)
│   ├── diffusion.py     # PRESERVED (fallback)
│   ├── dit.py           # NEW — MotionDiT (Tiny + Full)
│   ├── flow_matching.py # NEW — OT-CFM engine
│   ├── conditioning.py  # MODIFIED — add CLIP text encoder, multi-modal fusion
│   └── decoder.py       # PRESERVED (7 semantic heads)
├── training/
│   ├── trainer.py       # MODIFIED — flow matching training loop
│   ├── losses.py        # MODIFIED — add physics losses (momentum, joint, contact)
│   ├── flow_losses.py   # NEW — flow matching loss
│   ├── physics.py       # NEW — physics-in-the-loop losses
│   ├── distill.py       # NEW — consistency distillation (Full→Tiny)
│   └── dpo.py           # NEW — DPO alignment trainer
├── inference/
│   ├── sampler.py       # MODIFIED — add flow matching ODE solvers
│   ├── mpms_sampler.py  # PRESERVED
│   ├── selection.py     # PRESERVED
│   └── postprocess.py   # PRESERVED
├── validation/
│   ├── sanity.py        # PRESERVED (deterministic)
│   ├── musicality.py    # PRESERVED (hand-tuned, bootstrap source)
│   ├── critic.py        # NEW — learned quality critic
│   ├── corrector.py     # NEW — proximal correction
│   └── biomechanics.py  # NEW — joint limits, self-penetration, CoM
├── capture/             # NEW — sensor pipeline
│   ├── lifting.py       # 2D→3D pose lifting network
│   ├── fusion.py        # EKF multi-device fusion (Python bridge to Rust)
│   ├── cleanup.py       # Motion cleanup (smooth, jerk limit, ground snap)
│   ├── beat_sync.py     # DTW beat alignment
│   └── collector.py     # Training data collection with consent
├── export/              # NEW — deployment
│   ├── coreml.py        # PyTorch → CoreML conversion
│   ├── mlx_model.py     # Native MLX implementation
│   └── onnx_export.py   # ONNX intermediate
├── bridge/              # NEW — 25D ↔ SMPL
│   ├── smpl_to_25d.py   # SMPL retargeting to 25D
│   └── twentyfive_to_smpl.py  # 25D expansion to SMPL (for benchmarking)
├── config.py            # MODIFIED — add flow matching, DiT, capture configs
├── types.py             # MODIFIED — add TaskType enum, CaptureData types
└── __init__.py          # MODIFIED — export new types

---

Execution Gate (3-lens check)

1. Can this be built with available resources?
YES. Training: Mac4/Mac5 for prototyping, cloud GPU for production runs. iOS: Mac1 for builds. All tooling exists.

2. Are all dependencies satisfiable?
YES. PyTorch (existing), CLIP (pip install), CoreML tools (pip install), cc-collection Rust EKF (exists). No new infrastructure needed.

3. Can Autopilot execute without human intervention?
PARTIALLY. P0 (flow matching) and P4 (unified model) are fully automatable. P1 (CoreML profiling), P2 (sensor testing), P5 (human labels) need human involvement at specific points.

Gate: PASS — proceeding to Stage 5: RAIL.

Promotion Decision

Promote into a technical note or architecture paper with implementation anchors.

Source Anchor

omega-output/cc-motion-gen-20260321/04-architecture.md

Detected Structure

Method · Evaluation · Code Anchors · Architecture · is Stage Research