Phrase-Conditioned Spectrogram Diffusion System

Full HTML reader

Read the full artifact

Extracted abstract or opening context

This module implements a diffusion-based generative audio system that learns from your music library and generates new audio conditioned on: 1. **Phrase embeddings** from your existing tracks 2. **Motion embeddings** from your body movement ### 1. **Why Spectrogram Space (Not Raw Waveform)?** - **Faster**: Mel-spectrogram is ~256× smaller than raw audio - **Stable**: Diffusion converges faster on spectrograms - **Controllable**: Frequency bins map naturally to musical features ### 2. **Why VQ-VAE + Vocoder?** - **Compression**: 44.1kHz × 4 bars = 705,600 samples → ~4096 tokens - **Quality**: HiFi-GAN vocoder produces high-fidelity audio - **Modularity**: Can swap vocoders (BigVGAN, UnivNet) later ### 3. **Why DDIM Sampling (Not DDPM)?** - **Speed**: 20 steps vs. 1000 steps (50× faster) - **Quality**: Nearly identical output quality - **Latency**: ~200ms generation time (acceptable for live) ### 4. **Why 4-Bar Buffer?** - **Musical**: 4 bars = 1 phrase in most music - **Latency**: ~8 seconds @ 120 BPM = comfortable lead time - **Coherence**: Long enough for harmonic/rhythmic structure

Promotion decision

What has to happen next

Attach run IDs, datasets, metrics, and reproduction commands.

Why this is not always a full paper yet

Corpus pages are public-safe readers for discovered workspace artifacts. They are not automatically final papers. A corpus item becomes a polished paper only after the editable source, evidence checkpoints, references, figures, render path, and release status are attached through the paper schema.