CC-MotionGen Model Architecture Reference
1. [Architecture Overview](#architecture-overview) 2. [GaussianDiffusion](#gaussiandiffusion) 3. [UNet1D](#unet1d) 4. [MotionDecoder](#motiondecoder) 5. [Conditioning System](#conditioning-system) 6. [Motion Representation](#motion-representation) 7. [Inference Pipeline](#inference-pipeline)
Full Public Reader
CC-MotionGen Model Architecture Reference
Technical reference for the CC-MotionGen model components
---
Table of Contents
1. [Architecture Overview](#architecture-overview)
2. [GaussianDiffusion](#gaussiandiffusion)
3. [UNet1D](#unet1d)
4. [MotionDecoder](#motiondecoder)
5. [Conditioning System](#conditioning-system)
6. [Motion Representation](#motion-representation)
7. [Inference Pipeline](#inference-pipeline)
---
Architecture Overview
CC-MotionGen uses a three-stage architecture:
Audio Features → Conditioning → UNet1D (Diffusion) → MotionDecoder → Motion Trajectory
↑
Optional MPMS PriorsComponent Summary
| Component | Parameters | Location | Purpose |
|---|---|---|---|
| GaussianDiffusion | - | `model/diffusion.py` | Noise scheduling, DDPM/DDIM sampling |
| UNet1D | 116M | `model/unet.py` | Denoising network |
| MotionDecoder | 2M | `model/decoder.py` | Semantic motion mapping |
| AudioConditioner | ~2M | `model/conditioning.py` | FiLM conditioning |
| MPMSConditioner | ~0.5M | `model/conditioning.py` | Memory context blending |
---
GaussianDiffusion
Location: `model/diffusion.py`
The diffusion process that defines the forward (noising) and reverse (denoising) processes.
Configuration
@dataclass
class DiffusionConfig:
num_timesteps: int = 1000 # Total diffusion steps
beta_start: float = 0.0001 # Starting noise level
beta_end: float = 0.02 # Ending noise level
beta_schedule: str = "cosine" # "linear" or "cosine"
# Loss settings
loss_type: str = "mse" # "mse" or "l1"
predict_epsilon: bool = True # Predict noise vs x0Key Methods
Forward Process (Noising):
def q_sample(self, x_0: Tensor, t: Tensor, noise: Tensor = None) -> Tensor:
"""
Sample x_t from q(x_t | x_0).
Args:
x_0: Clean motion [B, 25, T]
t: Timesteps [B]
noise: Optional noise (sampled if None)
Returns:
x_t: Noised motion [B, 25, T]
"""Loss Computation:
def training_loss(self, model: nn.Module, x_0: Tensor, condition: Tensor) -> Tensor:
"""
Compute training loss (simplified diffusion loss).
Args:
model: UNet1D denoiser
x_0: Target motion [B, 25, T]
condition: Audio conditioning [B, C, T]
Returns:
loss: MSE between predicted and actual noise
"""Reverse Process (Sampling):
def ddim_sample(
self,
model: nn.Module,
shape: Tuple[int, ...],
condition: Tensor,
num_steps: int = 50,
eta: float = 0.0,
) -> Tensor:
"""
DDIM sampling for fast inference.
Args:
model: UNet1D denoiser
shape: Output shape (B, 25, T)
condition: Audio conditioning
num_steps: DDIM steps (50-100 typical)
eta: Stochasticity (0 = deterministic)
Returns:
x_0: Generated motion [B, 25, T]
"""Beta Schedules
Linear:
betas = torch.linspace(beta_start, beta_end, num_timesteps)Cosine (recommended):
# Smoother noise schedule that preserves more signal
s = 0.008 # Small offset to prevent singularity
steps = num_timesteps + 1
x = torch.linspace(0, num_timesteps, steps)
alphas_cumprod = torch.cos(((x / num_timesteps) + s) / (1 + s) * pi * 0.5) ** 2
alphas_cumprod = alphas_cumprod / alphas_cumprod[0]
betas = 1 - (alphas_cumprod[1:] / alphas_cumprod[:-1])---
UNet1D
Location: `model/unet.py`
1D UNet for temporal motion denoising.
Configuration
@dataclass
class UNetConfig:
in_channels: int = 25 # Motion dimension
out_channels: int = 25 # Same as input
model_channels: int = 128 # Base channel width
channel_mult: Tuple = (1, 2, 4) # Channel multipliers per resolution
num_res_blocks: int = 2 # ResBlocks per resolution
attention_resolutions: Tuple = (4, 8) # Resolutions with attention
dropout: float = 0.1
num_heads: int = 4 # Attention headsArchitecture
Input: x_t [B, 25, T], t [B], condition [B, C, T]
Timestep Embedding:
t → sinusoidal → MLP → [B, 512]
Encoder:
[B, 25, T] → Conv → [B, 128, T]
DownBlock(128 → 128):
ResBlock × 2 (+ timestep + condition)
Downsample → [B, 128, T/2]
DownBlock(128 → 256):
ResBlock × 2 (+ Attention at res 4, 8)
Downsample → [B, 256, T/4]
DownBlock(256 → 512):
ResBlock × 2 (+ Attention)
Downsample → [B, 512, T/8]
Middle:
ResBlock → Attention → ResBlock
[B, 512, T/8]
Decoder:
UpBlock(512 → 256):
Upsample → [B, 256, T/4]
ResBlock × 2 (+ skip connection + Attention)
UpBlock(256 → 128):
Upsample → [B, 128, T/2]
ResBlock × 2 (+ skip connection)
UpBlock(128 → 128):
Upsample → [B, 128, T]
ResBlock × 2 (+ skip connection)
Output:
[B, 128, T] → Conv → [B, 25, T]ResBlock
class ResBlock(nn.Module):
"""
Residual block with timestep and condition injection.
h = input
h = GroupNorm → SiLU → Conv
h = h + time_emb # Timestep injection
h = GroupNorm → SiLU → Dropout → Conv
if in_channels != out_channels:
h = h + Conv(input) # Skip projection
else:
h = h + input
"""FiLM Conditioning
Conditioning is injected via Feature-wise Linear Modulation:
class FiLMLayer(nn.Module):
def forward(self, h: Tensor, condition: Tensor) -> Tensor:
"""
h: [B, C, T] feature map
condition: [B, C_cond, T] conditioning
gamma, beta = MLP(condition) # [B, C, T]
return gamma * h + beta
"""---
MotionDecoder
Location: `model/decoder.py`
Maps raw UNet output to physically valid motion semantics.
Configuration
@dataclass
class MotionDecoderConfig:
input_dim: int = 25
hidden_dim: int = 256
output_dim: int = 25
num_layers: int = 3
# Output constraints
position_scale: float = 50.0 # Max |position|
velocity_scale: float = 50.0 # Max |velocity| (increased from 20)
acceleration_scale: float = 100.0 # Max |acceleration|
angular_velocity_scale: float = 10.0
# Derivation settings
derive_velocity: bool = True # v = dp/dt
derive_acceleration: bool = True # a = dv/dt
fps: float = 30.0 # Frame rate for derivativesArchitecture
Raw UNet Output [B, T, 25]
│
▼
┌───────────────────────────────────────────────────────────┐
│ MotionDecoder │
│ │
│ ┌──────────────────────────────────────────────────┐ │
│ │ Temporal Encoder │ │
│ │ 3× (Conv1D → LayerNorm → GELU) │ │
│ │ Input: [B, 25, T] → Output: [B, 256, T] │ │
│ └──────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────────────────────────────────┐ │
│ │ Semantic Heads │ │
│ │ │ │
│ │ position_head: Linear(256 → 3) │ │
│ │ quaternion_head: Linear(256 → 4) │ │
│ │ phase_head: Linear(256 → 1) │ │
│ │ style_head: Linear(256 → 8) │ │
│ └──────────────────────────────────────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────────────────────────────────┐ │
│ │ Semantic Constraints │ │
│ │ │ │
│ │ position: tanh(x/scale) * scale │ │
│ │ velocity: DERIVED from position (dp/dt) │ │
│ │ acceleration: DERIVED from velocity (dv/dt) │ │
│ │ quaternion: normalize(q) + hemisphere fix │ │
│ │ phase: cumulative_softplus (monotonic) │ │
│ │ style: L2 normalize │ │
│ └──────────────────────────────────────────────────┘ │
│ │
└───────────────────────────────────────────────────────────┘
│
▼
Motion Trajectory [B, T, 25]Critical: Derived Quantities
Velocity and acceleration are computed from position, not predicted directly:
def _derive_velocity(self, position: Tensor) -> Tensor:
"""
Derive velocity from position using finite differences.
v[t] = (p[t] - p[t-1]) * fps
v[0] = v[1] # First frame copied from second
Then clamped to [-velocity_scale, velocity_scale]
"""
B, T, _ = position.shape
dt = 1.0 / self.config.fps # 1/30 = 0.0333
# Finite difference
velocity = torch.zeros_like(position)
velocity[:, 1:] = (position[:, 1:] - position[:, :-1]) / dt
velocity[:, 0] = velocity[:, 1]
# Clamp to physical limits
velocity = torch.clamp(velocity, -self.velocity_scale, self.velocity_scale)
return velocityQuaternion Processing
def _process_quaternion(self, raw_quat: Tensor) -> Tensor:
"""
Ensure valid unit quaternions with hemisphere consistency.
1. Normalize to unit length
2. Fix hemisphere (ensure w > 0)
3. Ensure temporal consistency (q·q_prev > 0)
"""
# Normalize
quat = F.normalize(raw_quat, p=2, dim=-1)
# Hemisphere consistency (w component >= 0)
sign = torch.sign(quat[..., 0:1])
sign = torch.where(sign == 0, torch.ones_like(sign), sign)
quat = quat * sign
return quat---
Conditioning System
Location: `model/conditioning.py`
AudioConditioner
Encodes audio features for FiLM injection:
class AudioConditioner(nn.Module):
"""
Combines mel, mfcc, and chroma into conditioning tensors.
Input:
mel: [B, 128, T]
mfcc: [B, 20, T]
chroma: [B, 12, T]
Output:
condition: [B, 256, T]
"""
def __init__(self, mel_dim=128, mfcc_dim=20, chroma_dim=12, out_dim=256):
self.mel_proj = nn.Conv1d(mel_dim, out_dim // 2, 1)
self.mfcc_proj = nn.Conv1d(mfcc_dim, out_dim // 4, 1)
self.chroma_proj = nn.Conv1d(chroma_dim, out_dim // 4, 1)
self.combine = nn.Conv1d(out_dim, out_dim, 3, padding=1)MPMSConditioner
Blends transformer context with MPMS-retrieved context:
class MPMSConditioner(nn.Module):
"""
Blends CC-MotionGen context with MPMS priors.
Input:
transformer_context: [B, 256] from ContextTransformer
mpms_context: [B, 256] from MPMS retrieval
prior_curves: [B, 4, T] energy/density/tension/stability
Output:
blended_context: [B, 256]
encoded_priors: [B, 64, T]
"""
def forward(self, transformer_context, mpms_context, prior_curves):
# Blend contexts (learned weighting)
alpha = self.blend_gate(torch.cat([transformer_context, mpms_context], -1))
blended = alpha * transformer_context + (1 - alpha) * mpms_context
# Encode prior curves
encoded_priors = self.prior_encoder(prior_curves)
return blended, encoded_priors---
Motion Representation
25-Dimensional Motion Vector
| Index | Name | Dimension | Description | Range |
|---|---|---|---|---|
| 0-2 | position | 3 | World-space position (x, y, z) | [-50, 50] |
| 3-5 | velocity | 3 | Linear velocity (derived) | [-50, 50] |
| 6-8 | acceleration | 3 | Linear acceleration (derived) | [-100, 100] |
| 9-12 | quaternion | 4 | Orientation (w, x, y, z) | Unit sphere |
| 13-15 | angular_velocity | 3 | Rotational velocity | [-10, 10] |
| 16 | phase | 1 | Beat-aligned phase | [0, 1] per beat |
| 17-24 | style | 8 | Learned style embedding | Unit sphere |
Temporal Coherence
For a valid motion trajectory at 30fps:
| Property | Constraint | Threshold |
|---|---|---|
| Velocity coherence | \|v - dp/dt\| | < 5.0 |
| Acceleration coherence | \|a - dv/dt\| | < 50.0 |
| Jerk bound | \|d³p/dt³\| | < 50000 |
| Quaternion continuity | q_t · q_{t+1} | > 0.95 |
| Phase monotonicity | phase_{t+1} >= phase_t | Always |
---
Inference Pipeline
Standard Inference
from cc_motiongen.inference import MotionSampler
from cc_motiongen.model import GaussianDiffusion, create_motion_decoder
# Create components
diffusion = GaussianDiffusion.from_pretrained("path/to/model")
sampler = MotionSampler(diffusion, device="cuda")
# Sample motion
result = sampler.sample(
audio_condition=audio_features,
num_frames=240, # 8 seconds at 30fps
num_samples=8, # Candidate count
guidance_scale=3.0, # CFG scale
num_steps=50, # DDIM steps
)
# result.trajectory: [T, 25]
# result.all_candidates: [N, T, 25] if return_all=TrueMPMS-Enhanced Inference
from cc_motiongen.inference import MPMSMotionSampler
from cc_core.policy.rag_motionphrase import MPMService
# Initialize MPMS service
mpms = await MPMService.from_config(config)
await mpms.initialize()
# Create MPMS sampler
sampler = MPMSMotionSampler(diffusion, mpms_service=mpms)
# Sample with memory conditioning
result = await sampler.sample_with_mpms(
audio_condition=audio_features,
num_frames=240,
num_samples=8,
audio_embedding=audio_embed, # For MPMS query
beat_phase=0.5, # Current phase
)---
Last updated: December 2025
Promotion Decision
Promote into a technical note or architecture paper with implementation anchors.
Source Anchor
Comp-Core/core/ml/cc-ml/cc_motiongen/docs/technical/MODEL_ARCHITECTURE.md
Detected Structure
Method · Code Anchors · Architecture