Grand Diomande Research · Full HTML Reader

Phrase-Conditioned Spectrogram Diffusion System

This module implements a diffusion-based generative audio system that learns from your music library and generates new audio conditioned on: 1. **Phrase embeddings** from your existing tracks 2. **Motion embeddings** from your body movement

Embodied Trajectory Systems proposal experiment writeup candidate score 32 .md

Full Public Reader

Phrase-Conditioned Spectrogram Diffusion System

"Embodied Memory Synthesis"

This module implements a diffusion-based generative audio system that learns from your music library and generates new audio conditioned on:
1. Phrase embeddings from your existing tracks
2. Motion embeddings from your body movement

---

Architecture Overview

┌─────────────────────────────────────────────────────────────────┐
│                     OFFLINE TRAINING PHASE                       │
└─────────────────────────────────────────────────────────────────┘

Music Library (WAV files)
    ↓
┌───────────────────────────────────────┐
│ 1. PHRASE DATABASE BUILDER             │
│ - Beat tracking (madmom)               │
│ - Segmentation (novelty + structure)   │
│ - Feature extraction:                  │
│   * Rhythm (tempogram, onsets)         │
│   * Harmony (chroma, key)              │
│   * Timbre (MFCCs, spectral)           │
└───────────────────────────────────────┘
    ↓
Phrase Database (SQLite + FAISS index)
    ├─ audio_segments/ (WAV chunks)
    ├─ features/ (mel-spectrograms)
    └─ embeddings/ (e_phrase vectors)

    ↓
┌───────────────────────────────────────┐
│ 2. VQ-VAE TOKENIZER                    │
│ - Encoder: waveform → latent          │
│ - Codebook: 2048 tokens               │
│ - Decoder: HiFi-GAN vocoder           │
└───────────────────────────────────────┘
    ↓
Trained Tokenizer (checkpoints/vqvae/)

    ↓
┌───────────────────────────────────────┐
│ 3. DIFFUSION MODEL                     │
│ - U-Net (mel-spectrogram space)       │
│ - Conditioning:                        │
│   * e_phrase (256-D)                  │
│   * e_motion (104-D from RPS)         │
│   * bar_position (positional)         │
│ - DDIM sampling (20 steps)            │
└───────────────────────────────────────┘
    ↓
Trained Diffusion Model (checkpoints/diffusion/)


┌─────────────────────────────────────────────────────────────────┐
│                     REAL-TIME INFERENCE PHASE                    │
└─────────────────────────────────────────────────────────────────┘

iPhone Motion
    ↓
RPS Encoder → Normalizer → [motion_latent (104-D)]
    ↓
┌───────────────────────────────────────┐
│ PHRASE MAPPER                          │
│ - Project motion → phrase space       │
│ - k-NN search in phrase database      │
│ - Retrieve top-5 similar phrases      │
└───────────────────────────────────────┘
    ↓
[e_phrase (256-D)] + [e_motion (104-D)]
    ↓
┌───────────────────────────────────────┐
│ DIFFUSION CONDUCTOR                    │
│ - 4-bar look-ahead buffer             │
│ - DDIM sampling (20 steps, ~200ms)    │
│ - Cross-fade & streaming              │
└───────────────────────────────────────┘
    ↓
Vocoder → Audio Stream → Speakers

---

Module Structure

diffusion/
├── README.md                      # This file
├── __init__.py
│
├── data/                          # Data processing
│   ├── __init__.py
│   ├── audio_loader.py           # Load and normalize audio
│   ├── beat_tracker.py           # Beat detection (madmom)
│   ├── segmenter.py              # Phrase segmentation
│   ├── feature_extractor.py     # Extract rhythm/harmony/timbre
│   └── phrase_database.py        # SQLite + FAISS index
│
├── models/                        # Neural network architectures
│   ├── __init__.py
│   ├── vqvae/                    # Audio tokenizer
│   │   ├── __init__.py
│   │   ├── encoder.py
│   │   ├── decoder.py
│   │   ├── codebook.py
│   │   └── vocoder.py            # HiFi-GAN
│   │
│   ├── diffusion/                # Diffusion models
│   │   ├── __init__.py
│   │   ├── unet.py              # U-Net architecture
│   │   ├── conditioning.py      # FiLM/cross-attention
│   │   ├── noise_schedule.py    # Cosine schedule
│   │   └── sampler.py           # DDIM/DDPM sampling
│   │
│   └── phrase_mapper.py          # Motion → phrase space
│
├── training/                      # Training loops
│   ├── __init__.py
│   ├── train_vqvae.py
│   ├── train_diffusion.py
│   ├── train_phrase_mapper.py
│   └── losses.py                 # Custom loss functions
│
├── inference/                     # Real-time generation
│   ├── __init__.py
│   ├── conductor.py              # Streaming controller
│   ├── phrase_retrieval.py      # k-NN search
│   └── audio_buffer.py           # Circular buffer + crossfade
│
├── configs/                       # Configuration files
│   ├── vqvae.yaml
│   ├── diffusion.yaml
│   └── inference.yaml
│
└── scripts/                       # Utilities
    ├── build_phrase_database.py  # Process music library
    ├── train_pipeline.py         # Full training pipeline
    └── live_demo.py              # Real-time performance demo

---

Key Design Decisions

### 1. Why Spectrogram Space (Not Raw Waveform)?
- Faster: Mel-spectrogram is ~256× smaller than raw audio
- Stable: Diffusion converges faster on spectrograms
- Controllable: Frequency bins map naturally to musical features

### 2. Why VQ-VAE + Vocoder?
- Compression: 44.1kHz × 4 bars = 705,600 samples → ~4096 tokens
- Quality: HiFi-GAN vocoder produces high-fidelity audio
- Modularity: Can swap vocoders (BigVGAN, UnivNet) later

### 3. Why DDIM Sampling (Not DDPM)?
- Speed: 20 steps vs. 1000 steps (50× faster)
- Quality: Nearly identical output quality
- Latency: ~200ms generation time (acceptable for live)

### 4. Why 4-Bar Buffer?
- Musical: 4 bars = 1 phrase in most music
- Latency: ~8 seconds @ 120 BPM = comfortable lead time
- Coherence: Long enough for harmonic/rhythmic structure

---

Dependencies

yaml
# Audio Processing
librosa>=0.10.0
soundfile>=0.12.0
resampy>=0.4.0
madmom>=0.17.0          # Beat tracking
aubio>=0.4.9            # Onset detection

# Feature Extraction
essentia>=2.1b6         # Music information retrieval
pyAudioAnalysis>=0.3.14

# Vector Search
faiss-cpu>=1.7.4        # or faiss-gpu for faster search
hnswlib>=0.7.0          # Alternative lightweight index

# Neural Networks
torch>=2.0.0
torchaudio>=2.0.0
einops>=0.6.0
accelerate>=0.20.0      # Multi-GPU training

# Vocoder
hifi-gan>=0.1.0         # Will install separately
vocos>=0.0.1            # Alternative vocoder

# Diffusion
diffusers>=0.21.0       # Hugging Face diffusion library
k-diffusion>=0.0.16     # Alternative samplers

# Database
sqlite3                 # Built-in
sqlalchemy>=2.0.0

# Utilities
pydub>=0.25.0
tqdm>=4.65.0
wandb>=0.15.0           # Training visualization
tensorboard>=2.13.0

---

Training Data Requirements

### Minimum Viable
- 20-50 tracks from your library
- 2-4 hours of audio total
- Consistent genre/style (for coherence)

### Recommended
- 100-200 tracks
- 8-12 hours of audio
- Diverse but related (e.g., techno + house + minimal)

### Optimal
- 500+ tracks
- 30+ hours of audio
- Your complete archive

---

Hardware Requirements

### Training
- GPU: NVIDIA RTX 3090/4090 or A100 (24GB+ VRAM)
- RAM: 32GB+ system RAM
- Storage: 100GB+ for dataset + checkpoints
- Time:
- VQ-VAE: 2-3 days
- Diffusion: 5-7 days
- Phrase Mapper: 1 day

### Inference (Live Performance)
- GPU: RTX 3060 or Mac M1 Pro/Max
- Latency: < 300ms (motion → audio)
- Buffer: 2GB RAM for 4-bar lookahead

---

Quick Start

bash
# 1. Build phrase database from your music library
python scripts/build_phrase_database.py \
    --input_dir [home-path] \
    --output_dir data/phrase_db \
    --min_phrase_bars 4 \
    --max_phrase_bars 16

# 2. Train VQ-VAE tokenizer
python training/train_vqvae.py \
    --config configs/vqvae.yaml \
    --data_dir data/phrase_db \
    --output_dir checkpoints/vqvae

# 3. Train diffusion model
python training/train_diffusion.py \
    --config configs/diffusion.yaml \
    --vqvae_checkpoint checkpoints/vqvae/best.pt \
    --output_dir checkpoints/diffusion

# 4. Train phrase mapper (optional, if you have motion data)
python training/train_phrase_mapper.py \
    --config configs/phrase_mapper.yaml \
    --motion_sessions data/motion_recordings \
    --phrase_db data/phrase_db

# 5. Run live performance
python scripts/live_demo.py \
    --diffusion_checkpoint checkpoints/diffusion/best.pt \
    --vqvae_checkpoint checkpoints/vqvae/best.pt \
    --phrase_db data/phrase_db \
    --motion_input iphone

---

Performance Tuning

### Reduce Latency
1. Use fp16 inference (2× speedup)
2. Reduce DDIM steps (20 → 10 steps, slight quality loss)
3. Smaller U-Net (fewer channels/layers)
4. Distillation (train 1-step model from 20-step teacher)

### Improve Quality
1. More training data (50 → 200 tracks)
2. Longer training (100K → 500K steps)
3. Better vocoder (HiFi-GAN → BigVGAN)
4. Guidance scale tuning (1.0 → 3.0)

### Scale Up
1. Multi-GPU training (DDP, FSDP)
2. Mixed precision (fp16/bf16)
3. Gradient checkpointing (lower memory)
4. Model parallelism (for very large models)

---

Roadmap

  • [x] Architecture design
  • [ ] Phase 0: Setup (this week)
  • [ ] Phase 1: Phrase database builder (week 2)
  • [ ] Phase 2: VQ-VAE tokenizer (weeks 3-4)
  • [ ] Phase 3: Diffusion model (weeks 5-7)
  • [ ] Phase 4: Motion integration (week 8)
  • [ ] Phase 5: Real-time conductor (weeks 9-10)
  • [ ] Phase 6: Live performance testing (weeks 11-12)

---

References

### Research Papers
- Denoising Diffusion Probabilistic Models (Ho et al., 2020)
- Denoising Diffusion Implicit Models (Song et al., 2020)
- Stable Diffusion (Rombach et al., 2022) - Latent diffusion
- Dance Diffusion (Harmonai, 2022) - Audio diffusion
- AudioLDM (Liu et al., 2023) - Text-to-audio diffusion

### Code References
- [Harmonai/sample-generator](https://github.com/Harmonai-org/sample-generator) - Dance Diffusion
- [lucidrains/audio-diffusion-pytorch](https://github.com/lucidrains/audio-diffusion-pytorch)
- [jiaaro/pydub](https://github.com/jiaaro/pydub) - Audio manipulation
- [CPJKU/madmom](https://github.com/CPJKU/madmom) - Beat tracking

---

Contact & Support

For questions about this module:
- Check `docs/diffusion_system_guide.md` for detailed explanations
- See `examples/diffusion_demos/` for usage examples
- Open an issue with the `diffusion` tag

---

Last Updated: October 30, 2025
Status: Phase 0 - Setup
Next Milestone: Phrase Database Builder

Promotion Decision

Attach run IDs, datasets, metrics, and reproduction commands.

Source Anchor

Comp-Core/core/ml/cc-ml/diffusion/README.md

Detected Structure

Method · Evaluation · References · Code Anchors · Architecture