Phrase-Conditioned Spectrogram Diffusion System
This module implements a diffusion-based generative audio system that learns from your music library and generates new audio conditioned on: 1. **Phrase embeddings** from your existing tracks 2. **Motion embeddings** from your body movement
Full Public Reader
Phrase-Conditioned Spectrogram Diffusion System
"Embodied Memory Synthesis"
This module implements a diffusion-based generative audio system that learns from your music library and generates new audio conditioned on:
1. Phrase embeddings from your existing tracks
2. Motion embeddings from your body movement
---
Architecture Overview
┌─────────────────────────────────────────────────────────────────┐
│ OFFLINE TRAINING PHASE │
└─────────────────────────────────────────────────────────────────┘
Music Library (WAV files)
↓
┌───────────────────────────────────────┐
│ 1. PHRASE DATABASE BUILDER │
│ - Beat tracking (madmom) │
│ - Segmentation (novelty + structure) │
│ - Feature extraction: │
│ * Rhythm (tempogram, onsets) │
│ * Harmony (chroma, key) │
│ * Timbre (MFCCs, spectral) │
└───────────────────────────────────────┘
↓
Phrase Database (SQLite + FAISS index)
├─ audio_segments/ (WAV chunks)
├─ features/ (mel-spectrograms)
└─ embeddings/ (e_phrase vectors)
↓
┌───────────────────────────────────────┐
│ 2. VQ-VAE TOKENIZER │
│ - Encoder: waveform → latent │
│ - Codebook: 2048 tokens │
│ - Decoder: HiFi-GAN vocoder │
└───────────────────────────────────────┘
↓
Trained Tokenizer (checkpoints/vqvae/)
↓
┌───────────────────────────────────────┐
│ 3. DIFFUSION MODEL │
│ - U-Net (mel-spectrogram space) │
│ - Conditioning: │
│ * e_phrase (256-D) │
│ * e_motion (104-D from RPS) │
│ * bar_position (positional) │
│ - DDIM sampling (20 steps) │
└───────────────────────────────────────┘
↓
Trained Diffusion Model (checkpoints/diffusion/)
┌─────────────────────────────────────────────────────────────────┐
│ REAL-TIME INFERENCE PHASE │
└─────────────────────────────────────────────────────────────────┘
iPhone Motion
↓
RPS Encoder → Normalizer → [motion_latent (104-D)]
↓
┌───────────────────────────────────────┐
│ PHRASE MAPPER │
│ - Project motion → phrase space │
│ - k-NN search in phrase database │
│ - Retrieve top-5 similar phrases │
└───────────────────────────────────────┘
↓
[e_phrase (256-D)] + [e_motion (104-D)]
↓
┌───────────────────────────────────────┐
│ DIFFUSION CONDUCTOR │
│ - 4-bar look-ahead buffer │
│ - DDIM sampling (20 steps, ~200ms) │
│ - Cross-fade & streaming │
└───────────────────────────────────────┘
↓
Vocoder → Audio Stream → Speakers---
Module Structure
diffusion/
├── README.md # This file
├── __init__.py
│
├── data/ # Data processing
│ ├── __init__.py
│ ├── audio_loader.py # Load and normalize audio
│ ├── beat_tracker.py # Beat detection (madmom)
│ ├── segmenter.py # Phrase segmentation
│ ├── feature_extractor.py # Extract rhythm/harmony/timbre
│ └── phrase_database.py # SQLite + FAISS index
│
├── models/ # Neural network architectures
│ ├── __init__.py
│ ├── vqvae/ # Audio tokenizer
│ │ ├── __init__.py
│ │ ├── encoder.py
│ │ ├── decoder.py
│ │ ├── codebook.py
│ │ └── vocoder.py # HiFi-GAN
│ │
│ ├── diffusion/ # Diffusion models
│ │ ├── __init__.py
│ │ ├── unet.py # U-Net architecture
│ │ ├── conditioning.py # FiLM/cross-attention
│ │ ├── noise_schedule.py # Cosine schedule
│ │ └── sampler.py # DDIM/DDPM sampling
│ │
│ └── phrase_mapper.py # Motion → phrase space
│
├── training/ # Training loops
│ ├── __init__.py
│ ├── train_vqvae.py
│ ├── train_diffusion.py
│ ├── train_phrase_mapper.py
│ └── losses.py # Custom loss functions
│
├── inference/ # Real-time generation
│ ├── __init__.py
│ ├── conductor.py # Streaming controller
│ ├── phrase_retrieval.py # k-NN search
│ └── audio_buffer.py # Circular buffer + crossfade
│
├── configs/ # Configuration files
│ ├── vqvae.yaml
│ ├── diffusion.yaml
│ └── inference.yaml
│
└── scripts/ # Utilities
├── build_phrase_database.py # Process music library
├── train_pipeline.py # Full training pipeline
└── live_demo.py # Real-time performance demo---
Key Design Decisions
### 1. Why Spectrogram Space (Not Raw Waveform)?
- Faster: Mel-spectrogram is ~256× smaller than raw audio
- Stable: Diffusion converges faster on spectrograms
- Controllable: Frequency bins map naturally to musical features
### 2. Why VQ-VAE + Vocoder?
- Compression: 44.1kHz × 4 bars = 705,600 samples → ~4096 tokens
- Quality: HiFi-GAN vocoder produces high-fidelity audio
- Modularity: Can swap vocoders (BigVGAN, UnivNet) later
### 3. Why DDIM Sampling (Not DDPM)?
- Speed: 20 steps vs. 1000 steps (50× faster)
- Quality: Nearly identical output quality
- Latency: ~200ms generation time (acceptable for live)
### 4. Why 4-Bar Buffer?
- Musical: 4 bars = 1 phrase in most music
- Latency: ~8 seconds @ 120 BPM = comfortable lead time
- Coherence: Long enough for harmonic/rhythmic structure
---
Dependencies
# Audio Processing
librosa>=0.10.0
soundfile>=0.12.0
resampy>=0.4.0
madmom>=0.17.0 # Beat tracking
aubio>=0.4.9 # Onset detection
# Feature Extraction
essentia>=2.1b6 # Music information retrieval
pyAudioAnalysis>=0.3.14
# Vector Search
faiss-cpu>=1.7.4 # or faiss-gpu for faster search
hnswlib>=0.7.0 # Alternative lightweight index
# Neural Networks
torch>=2.0.0
torchaudio>=2.0.0
einops>=0.6.0
accelerate>=0.20.0 # Multi-GPU training
# Vocoder
hifi-gan>=0.1.0 # Will install separately
vocos>=0.0.1 # Alternative vocoder
# Diffusion
diffusers>=0.21.0 # Hugging Face diffusion library
k-diffusion>=0.0.16 # Alternative samplers
# Database
sqlite3 # Built-in
sqlalchemy>=2.0.0
# Utilities
pydub>=0.25.0
tqdm>=4.65.0
wandb>=0.15.0 # Training visualization
tensorboard>=2.13.0---
Training Data Requirements
### Minimum Viable
- 20-50 tracks from your library
- 2-4 hours of audio total
- Consistent genre/style (for coherence)
### Recommended
- 100-200 tracks
- 8-12 hours of audio
- Diverse but related (e.g., techno + house + minimal)
### Optimal
- 500+ tracks
- 30+ hours of audio
- Your complete archive
---
Hardware Requirements
### Training
- GPU: NVIDIA RTX 3090/4090 or A100 (24GB+ VRAM)
- RAM: 32GB+ system RAM
- Storage: 100GB+ for dataset + checkpoints
- Time:
- VQ-VAE: 2-3 days
- Diffusion: 5-7 days
- Phrase Mapper: 1 day
### Inference (Live Performance)
- GPU: RTX 3060 or Mac M1 Pro/Max
- Latency: < 300ms (motion → audio)
- Buffer: 2GB RAM for 4-bar lookahead
---
Quick Start
# 1. Build phrase database from your music library
python scripts/build_phrase_database.py \
--input_dir [home-path] \
--output_dir data/phrase_db \
--min_phrase_bars 4 \
--max_phrase_bars 16
# 2. Train VQ-VAE tokenizer
python training/train_vqvae.py \
--config configs/vqvae.yaml \
--data_dir data/phrase_db \
--output_dir checkpoints/vqvae
# 3. Train diffusion model
python training/train_diffusion.py \
--config configs/diffusion.yaml \
--vqvae_checkpoint checkpoints/vqvae/best.pt \
--output_dir checkpoints/diffusion
# 4. Train phrase mapper (optional, if you have motion data)
python training/train_phrase_mapper.py \
--config configs/phrase_mapper.yaml \
--motion_sessions data/motion_recordings \
--phrase_db data/phrase_db
# 5. Run live performance
python scripts/live_demo.py \
--diffusion_checkpoint checkpoints/diffusion/best.pt \
--vqvae_checkpoint checkpoints/vqvae/best.pt \
--phrase_db data/phrase_db \
--motion_input iphone---
Performance Tuning
### Reduce Latency
1. Use fp16 inference (2× speedup)
2. Reduce DDIM steps (20 → 10 steps, slight quality loss)
3. Smaller U-Net (fewer channels/layers)
4. Distillation (train 1-step model from 20-step teacher)
### Improve Quality
1. More training data (50 → 200 tracks)
2. Longer training (100K → 500K steps)
3. Better vocoder (HiFi-GAN → BigVGAN)
4. Guidance scale tuning (1.0 → 3.0)
### Scale Up
1. Multi-GPU training (DDP, FSDP)
2. Mixed precision (fp16/bf16)
3. Gradient checkpointing (lower memory)
4. Model parallelism (for very large models)
---
Roadmap
- [x] Architecture design
- [ ] Phase 0: Setup (this week)
- [ ] Phase 1: Phrase database builder (week 2)
- [ ] Phase 2: VQ-VAE tokenizer (weeks 3-4)
- [ ] Phase 3: Diffusion model (weeks 5-7)
- [ ] Phase 4: Motion integration (week 8)
- [ ] Phase 5: Real-time conductor (weeks 9-10)
- [ ] Phase 6: Live performance testing (weeks 11-12)
---
References
### Research Papers
- Denoising Diffusion Probabilistic Models (Ho et al., 2020)
- Denoising Diffusion Implicit Models (Song et al., 2020)
- Stable Diffusion (Rombach et al., 2022) - Latent diffusion
- Dance Diffusion (Harmonai, 2022) - Audio diffusion
- AudioLDM (Liu et al., 2023) - Text-to-audio diffusion
### Code References
- [Harmonai/sample-generator](https://github.com/Harmonai-org/sample-generator) - Dance Diffusion
- [lucidrains/audio-diffusion-pytorch](https://github.com/lucidrains/audio-diffusion-pytorch)
- [jiaaro/pydub](https://github.com/jiaaro/pydub) - Audio manipulation
- [CPJKU/madmom](https://github.com/CPJKU/madmom) - Beat tracking
---
Contact & Support
For questions about this module:
- Check `docs/diffusion_system_guide.md` for detailed explanations
- See `examples/diffusion_demos/` for usage examples
- Open an issue with the `diffusion` tag
---
Last Updated: October 30, 2025
Status: Phase 0 - Setup
Next Milestone: Phrase Database Builder
Promotion Decision
Attach run IDs, datasets, metrics, and reproduction commands.
Source Anchor
Comp-Core/core/ml/cc-ml/diffusion/README.md
Detected Structure
Method · Evaluation · References · Code Anchors · Architecture