Grand Diomande Research · Full HTML Reader

CC-MotionGen Model Architecture Reference

1. [Architecture Overview](#architecture-overview) 2. [GaussianDiffusion](#gaussiandiffusion) 3. [UNet1D](#unet1d) 4. [MotionDecoder](#motiondecoder) 5. [Conditioning System](#conditioning-system) 6. [Motion Representation](#motion-representation) 7. [Inference Pipeline](#inference-pipeline)

Embodied Trajectory Systems architecture technical paper candidate score 44 .md

Full Public Reader

CC-MotionGen Model Architecture Reference

Technical reference for the CC-MotionGen model components

---

Table of Contents

1. [Architecture Overview](#architecture-overview)
2. [GaussianDiffusion](#gaussiandiffusion)
3. [UNet1D](#unet1d)
4. [MotionDecoder](#motiondecoder)
5. [Conditioning System](#conditioning-system)
6. [Motion Representation](#motion-representation)
7. [Inference Pipeline](#inference-pipeline)

---

Architecture Overview

CC-MotionGen uses a three-stage architecture:

Audio Features → Conditioning → UNet1D (Diffusion) → MotionDecoder → Motion Trajectory
                                    ↑
                             Optional MPMS Priors

Component Summary

ComponentParametersLocationPurpose
GaussianDiffusion-`model/diffusion.py`Noise scheduling, DDPM/DDIM sampling
UNet1D116M`model/unet.py`Denoising network
MotionDecoder2M`model/decoder.py`Semantic motion mapping
AudioConditioner~2M`model/conditioning.py`FiLM conditioning
MPMSConditioner~0.5M`model/conditioning.py`Memory context blending

---

GaussianDiffusion

Location: `model/diffusion.py`

The diffusion process that defines the forward (noising) and reverse (denoising) processes.

Configuration

python
@dataclass
class DiffusionConfig:
    num_timesteps: int = 1000    # Total diffusion steps
    beta_start: float = 0.0001   # Starting noise level
    beta_end: float = 0.02       # Ending noise level
    beta_schedule: str = "cosine" # "linear" or "cosine"

    # Loss settings
    loss_type: str = "mse"       # "mse" or "l1"
    predict_epsilon: bool = True  # Predict noise vs x0

Key Methods

Forward Process (Noising):

python
def q_sample(self, x_0: Tensor, t: Tensor, noise: Tensor = None) -> Tensor:
    """
    Sample x_t from q(x_t | x_0).

    Args:
        x_0: Clean motion [B, 25, T]
        t: Timesteps [B]
        noise: Optional noise (sampled if None)

    Returns:
        x_t: Noised motion [B, 25, T]
    """

Loss Computation:

python
def training_loss(self, model: nn.Module, x_0: Tensor, condition: Tensor) -> Tensor:
    """
    Compute training loss (simplified diffusion loss).

    Args:
        model: UNet1D denoiser
        x_0: Target motion [B, 25, T]
        condition: Audio conditioning [B, C, T]

    Returns:
        loss: MSE between predicted and actual noise
    """

Reverse Process (Sampling):

python
def ddim_sample(
    self,
    model: nn.Module,
    shape: Tuple[int, ...],
    condition: Tensor,
    num_steps: int = 50,
    eta: float = 0.0,
) -> Tensor:
    """
    DDIM sampling for fast inference.

    Args:
        model: UNet1D denoiser
        shape: Output shape (B, 25, T)
        condition: Audio conditioning
        num_steps: DDIM steps (50-100 typical)
        eta: Stochasticity (0 = deterministic)

    Returns:
        x_0: Generated motion [B, 25, T]
    """

Beta Schedules

Linear:

python
betas = torch.linspace(beta_start, beta_end, num_timesteps)

Cosine (recommended):

python
# Smoother noise schedule that preserves more signal
s = 0.008  # Small offset to prevent singularity
steps = num_timesteps + 1
x = torch.linspace(0, num_timesteps, steps)
alphas_cumprod = torch.cos(((x / num_timesteps) + s) / (1 + s) * pi * 0.5) ** 2
alphas_cumprod = alphas_cumprod / alphas_cumprod[0]
betas = 1 - (alphas_cumprod[1:] / alphas_cumprod[:-1])

---

UNet1D

Location: `model/unet.py`

1D UNet for temporal motion denoising.

Configuration

python
@dataclass
class UNetConfig:
    in_channels: int = 25           # Motion dimension
    out_channels: int = 25          # Same as input
    model_channels: int = 128       # Base channel width
    channel_mult: Tuple = (1, 2, 4) # Channel multipliers per resolution
    num_res_blocks: int = 2         # ResBlocks per resolution
    attention_resolutions: Tuple = (4, 8)  # Resolutions with attention
    dropout: float = 0.1
    num_heads: int = 4              # Attention heads

Architecture

Input: x_t [B, 25, T], t [B], condition [B, C, T]

Timestep Embedding:
  t → sinusoidal → MLP → [B, 512]

Encoder:
  [B, 25, T] → Conv → [B, 128, T]

  DownBlock(128 → 128):
    ResBlock × 2 (+ timestep + condition)
    Downsample → [B, 128, T/2]

  DownBlock(128 → 256):
    ResBlock × 2 (+ Attention at res 4, 8)
    Downsample → [B, 256, T/4]

  DownBlock(256 → 512):
    ResBlock × 2 (+ Attention)
    Downsample → [B, 512, T/8]

Middle:
  ResBlock → Attention → ResBlock
  [B, 512, T/8]

Decoder:
  UpBlock(512 → 256):
    Upsample → [B, 256, T/4]
    ResBlock × 2 (+ skip connection + Attention)

  UpBlock(256 → 128):
    Upsample → [B, 128, T/2]
    ResBlock × 2 (+ skip connection)

  UpBlock(128 → 128):
    Upsample → [B, 128, T]
    ResBlock × 2 (+ skip connection)

Output:
  [B, 128, T] → Conv → [B, 25, T]

ResBlock

python
class ResBlock(nn.Module):
    """
    Residual block with timestep and condition injection.

    h = input
    h = GroupNorm → SiLU → Conv
    h = h + time_emb  # Timestep injection
    h = GroupNorm → SiLU → Dropout → Conv

    if in_channels != out_channels:
        h = h + Conv(input)  # Skip projection
    else:
        h = h + input
    """

FiLM Conditioning

Conditioning is injected via Feature-wise Linear Modulation:

python
class FiLMLayer(nn.Module):
    def forward(self, h: Tensor, condition: Tensor) -> Tensor:
        """
        h: [B, C, T] feature map
        condition: [B, C_cond, T] conditioning

        gamma, beta = MLP(condition)  # [B, C, T]
        return gamma * h + beta
        """

---

MotionDecoder

Location: `model/decoder.py`

Maps raw UNet output to physically valid motion semantics.

Configuration

python
@dataclass
class MotionDecoderConfig:
    input_dim: int = 25
    hidden_dim: int = 256
    output_dim: int = 25
    num_layers: int = 3

    # Output constraints
    position_scale: float = 50.0       # Max |position|
    velocity_scale: float = 50.0       # Max |velocity| (increased from 20)
    acceleration_scale: float = 100.0  # Max |acceleration|
    angular_velocity_scale: float = 10.0

    # Derivation settings
    derive_velocity: bool = True       # v = dp/dt
    derive_acceleration: bool = True   # a = dv/dt
    fps: float = 30.0                  # Frame rate for derivatives

Architecture

Raw UNet Output [B, T, 25]
        │
        ▼
┌───────────────────────────────────────────────────────────┐
│                    MotionDecoder                          │
│                                                           │
│  ┌──────────────────────────────────────────────────┐    │
│  │            Temporal Encoder                       │    │
│  │  3× (Conv1D → LayerNorm → GELU)                  │    │
│  │  Input: [B, 25, T] → Output: [B, 256, T]         │    │
│  └──────────────────────────────────────────────────┘    │
│                          │                                │
│                          ▼                                │
│  ┌──────────────────────────────────────────────────┐    │
│  │            Semantic Heads                         │    │
│  │                                                   │    │
│  │  position_head: Linear(256 → 3)                  │    │
│  │  quaternion_head: Linear(256 → 4)                │    │
│  │  phase_head: Linear(256 → 1)                     │    │
│  │  style_head: Linear(256 → 8)                     │    │
│  └──────────────────────────────────────────────────┘    │
│                          │                                │
│                          ▼                                │
│  ┌──────────────────────────────────────────────────┐    │
│  │            Semantic Constraints                   │    │
│  │                                                   │    │
│  │  position: tanh(x/scale) * scale                 │    │
│  │  velocity: DERIVED from position (dp/dt)         │    │
│  │  acceleration: DERIVED from velocity (dv/dt)     │    │
│  │  quaternion: normalize(q) + hemisphere fix       │    │
│  │  phase: cumulative_softplus (monotonic)          │    │
│  │  style: L2 normalize                             │    │
│  └──────────────────────────────────────────────────┘    │
│                                                           │
└───────────────────────────────────────────────────────────┘
        │
        ▼
Motion Trajectory [B, T, 25]

Critical: Derived Quantities

Velocity and acceleration are computed from position, not predicted directly:

python
def _derive_velocity(self, position: Tensor) -> Tensor:
    """
    Derive velocity from position using finite differences.

    v[t] = (p[t] - p[t-1]) * fps
    v[0] = v[1]  # First frame copied from second

    Then clamped to [-velocity_scale, velocity_scale]
    """
    B, T, _ = position.shape
    dt = 1.0 / self.config.fps  # 1/30 = 0.0333

    # Finite difference
    velocity = torch.zeros_like(position)
    velocity[:, 1:] = (position[:, 1:] - position[:, :-1]) / dt
    velocity[:, 0] = velocity[:, 1]

    # Clamp to physical limits
    velocity = torch.clamp(velocity, -self.velocity_scale, self.velocity_scale)

    return velocity

Quaternion Processing

python
def _process_quaternion(self, raw_quat: Tensor) -> Tensor:
    """
    Ensure valid unit quaternions with hemisphere consistency.

    1. Normalize to unit length
    2. Fix hemisphere (ensure w > 0)
    3. Ensure temporal consistency (q·q_prev > 0)
    """
    # Normalize
    quat = F.normalize(raw_quat, p=2, dim=-1)

    # Hemisphere consistency (w component >= 0)
    sign = torch.sign(quat[..., 0:1])
    sign = torch.where(sign == 0, torch.ones_like(sign), sign)
    quat = quat * sign

    return quat

---

Conditioning System

Location: `model/conditioning.py`

AudioConditioner

Encodes audio features for FiLM injection:

python
class AudioConditioner(nn.Module):
    """
    Combines mel, mfcc, and chroma into conditioning tensors.

    Input:
        mel: [B, 128, T]
        mfcc: [B, 20, T]
        chroma: [B, 12, T]

    Output:
        condition: [B, 256, T]
    """

    def __init__(self, mel_dim=128, mfcc_dim=20, chroma_dim=12, out_dim=256):
        self.mel_proj = nn.Conv1d(mel_dim, out_dim // 2, 1)
        self.mfcc_proj = nn.Conv1d(mfcc_dim, out_dim // 4, 1)
        self.chroma_proj = nn.Conv1d(chroma_dim, out_dim // 4, 1)
        self.combine = nn.Conv1d(out_dim, out_dim, 3, padding=1)

MPMSConditioner

Blends transformer context with MPMS-retrieved context:

python
class MPMSConditioner(nn.Module):
    """
    Blends CC-MotionGen context with MPMS priors.

    Input:
        transformer_context: [B, 256] from ContextTransformer
        mpms_context: [B, 256] from MPMS retrieval
        prior_curves: [B, 4, T] energy/density/tension/stability

    Output:
        blended_context: [B, 256]
        encoded_priors: [B, 64, T]
    """

    def forward(self, transformer_context, mpms_context, prior_curves):
        # Blend contexts (learned weighting)
        alpha = self.blend_gate(torch.cat([transformer_context, mpms_context], -1))
        blended = alpha * transformer_context + (1 - alpha) * mpms_context

        # Encode prior curves
        encoded_priors = self.prior_encoder(prior_curves)

        return blended, encoded_priors

---

Motion Representation

25-Dimensional Motion Vector

IndexNameDimensionDescriptionRange
0-2position3World-space position (x, y, z)[-50, 50]
3-5velocity3Linear velocity (derived)[-50, 50]
6-8acceleration3Linear acceleration (derived)[-100, 100]
9-12quaternion4Orientation (w, x, y, z)Unit sphere
13-15angular_velocity3Rotational velocity[-10, 10]
16phase1Beat-aligned phase[0, 1] per beat
17-24style8Learned style embeddingUnit sphere

Temporal Coherence

For a valid motion trajectory at 30fps:

PropertyConstraintThreshold
Velocity coherence\|v - dp/dt\|< 5.0
Acceleration coherence\|a - dv/dt\|< 50.0
Jerk bound\|d³p/dt³\|< 50000
Quaternion continuityq_t · q_{t+1}> 0.95
Phase monotonicityphase_{t+1} >= phase_tAlways

---

Inference Pipeline

Standard Inference

python
from cc_motiongen.inference import MotionSampler
from cc_motiongen.model import GaussianDiffusion, create_motion_decoder

# Create components
diffusion = GaussianDiffusion.from_pretrained("path/to/model")
sampler = MotionSampler(diffusion, device="cuda")

# Sample motion
result = sampler.sample(
    audio_condition=audio_features,
    num_frames=240,            # 8 seconds at 30fps
    num_samples=8,             # Candidate count
    guidance_scale=3.0,        # CFG scale
    num_steps=50,              # DDIM steps
)

# result.trajectory: [T, 25]
# result.all_candidates: [N, T, 25] if return_all=True

MPMS-Enhanced Inference

python
from cc_motiongen.inference import MPMSMotionSampler
from cc_core.policy.rag_motionphrase import MPMService

# Initialize MPMS service
mpms = await MPMService.from_config(config)
await mpms.initialize()

# Create MPMS sampler
sampler = MPMSMotionSampler(diffusion, mpms_service=mpms)

# Sample with memory conditioning
result = await sampler.sample_with_mpms(
    audio_condition=audio_features,
    num_frames=240,
    num_samples=8,
    audio_embedding=audio_embed,    # For MPMS query
    beat_phase=0.5,                 # Current phase
)

---

Last updated: December 2025

Promotion Decision

Promote into a technical note or architecture paper with implementation anchors.

Source Anchor

Comp-Core/core/ml/cc-ml/cc_motiongen/docs/technical/MODEL_ARCHITECTURE.md

Detected Structure

Method · Code Anchors · Architecture