Grand Diomande Research · Full HTML Reader

Diffusion System Setup Guide

1. **Project Structure** - Created `diffusion/` module with proper package structure - Set up configuration system (`configs/`) - Organized into logical subdirectories (data, models, training, inference)

Embodied Trajectory Systems research note experiment writeup candidate score 24 .md

Full Public Reader

Diffusion System Setup Guide

Status: Phase 0 Complete ✅
Next: Ready to install dependencies and build Phase 1

---

What We've Built So Far

✅ Phase 0: Project Setup (COMPLETE)

1. Project Structure
- Created `diffusion/` module with proper package structure
- Set up configuration system (`configs/`)
- Organized into logical subdirectories (data, models, training, inference)

2. Documentation
- Main README with full architecture overview
- Configuration files with extensive comments
- This setup guide

3. Dependencies
- Updated `requirements.txt` with all necessary packages:
- Audio processing (librosa, soundfile, madmom, aubio)
- Vector search (FAISS, hnswlib)
- Diffusion models (diffusers, k-diffusion)
- Training utilities (accelerate, wandb)

4. Configuration Files
- `phrase_database.yaml`: Full configuration for building phrase DB
- `vqvae.yaml`: Complete VQ-VAE tokenizer setup

---

Installation Steps

1. Install Base Dependencies

bash
cd "[home]/Desktop/Computational Choreography"

# Activate your virtual environment
source venv/bin/activate

# Install new diffusion dependencies
pip install --upgrade pip
pip install -r requirements.txt

Note: Some packages may have conflicts or require specific versions for your Mac M1/M2:
- `faiss-cpu` → works on Mac
- `madmom` → may need `pip install madmom --no-binary madmom` if issues
- `aubio` → `brew install aubio` then `pip install aubio`

2. Verify Installation

bash
python -c "import librosa, soundfile, faiss, torch, torchaudio; print('✅ All imports successful!')"

3. Install Optional Tools (Recommended)

bash
# For better beat tracking
brew install aubio

# For audio playback testing
brew install portaudio
pip install sounddevice

# For experiment tracking (optional)
pip install wandb
wandb login  # If you want to use Weights & Biases

---

Next Steps: Phase 1 Implementation

What Phase 1 Will Build

Phrase Database Builder - The foundation of the entire system:

1. Audio Loader (`diffusion/data/audio_loader.py`)
- Load audio files from your music library
- Normalize loudness (LUFS)
- Resample to 44.1kHz
- Quality filtering

2. Beat Tracker (`diffusion/data/beat_tracker.py`)
- Detect beats and downbeats using madmom
- Compute tempo curves
- Align to bar grid (4/4 time)

3. Segmenter (`diffusion/data/segmenter.py`)
- Detect structural boundaries (intro, verse, chorus, etc.)
- Split into phrases (4-16 bars)
- Handle overlapping windows for continuity

4. Feature Extractor (`diffusion/data/feature_extractor.py`)
- Mel-spectrograms: Time-frequency representation
- Chroma: Harmonic/pitch content
- MFCCs: Timbral fingerprint
- Tempogram: Rhythmic patterns
- Spectral features: Brightness, rolloff, etc.

5. Phrase Database (`diffusion/data/phrase_database.py`)
- SQLite database for metadata
- FAISS vector index for similarity search
- Efficient storage and retrieval

6. Build Script (`diffusion/scripts/build_phrase_database.py`)
- Main entry point to process your library
- Progress tracking
- Resume capability

---

Your Music Library Requirements

### Minimum Viable
- 20-50 tracks (2-4 hours of music)
- All from similar genre/style
- Good audio quality (no lo-fi)

### Recommended
- 100-200 tracks (8-12 hours)
- Consistent production style
- Mix of different energy levels

### Optimal
- Your full library (500+ tracks, 30+ hours)
- Diverse but coherent
- High-quality source files

---

Estimated Timeline

### Phase 1: Phrase Database Builder
- Implementation: 3-5 days (code + testing)
- Processing your library (100 tracks): ~2-4 hours
- Total: ~1 week

### Phase 2: VQ-VAE Tokenizer
- Implementation: 3-5 days
- Training: 2-3 days on GPU
- Total: ~1.5 weeks

### Phase 3: Diffusion Model
- Implementation: 5-7 days
- Training: 5-7 days on GPU
- Total: ~2-3 weeks

### Phase 4 & 5: Integration & Real-time
- Implementation: 5-7 days
- Testing & tuning: 3-5 days
- Total: ~2 weeks

TOTAL ESTIMATE: 6-8 weeks for fully functional system

---

Hardware Considerations

### For Training (Phases 2-3)
- Ideal: NVIDIA RTX 3090/4090 or A100
- Minimum: RTX 3060 (12GB VRAM)
- Mac Alternative: Rent cloud GPU (Lambda Labs, RunPod, Paperspace)
- Cost: ~$0.50-1.50/hour
- Budget: ~$100-200 for full training

### For Inference (Live Performance)
- Your Mac M1/M2 is perfect!
- Can run inference at < 300ms latency
- No GPU rental needed for performance

---

Cloud GPU Options (If Needed)

If training locally is too slow, here are good options:

### 1. Lambda Labs (Recommended)
- RTX 3090: $0.50/hour
- A100 (40GB): $1.10/hour
- Easy setup, PyTorch pre-installed
- https://lambdalabs.com/

### 2. RunPod
- RTX 4090: $0.69/hour
- A100 (80GB): $1.89/hour
- More GPU options
- https://runpod.io/

### 3. Paperspace Gradient
- Free tier available (limited hours)
- RTX 4000: $0.51/hour
- Jupyter notebook interface
- https://www.paperspace.com/

### 4. Google Colab Pro+
- $50/month for A100 access
- Good for experimentation
- https://colab.research.google.com/

---

Testing the Setup

Before building Phase 1, let's test your environment:

bash
# Test 1: Audio processing
python -c "
import librosa
import soundfile as sf
import numpy as np

# Generate test audio
sr = 44100
duration = 2.0
t = np.linspace(0, duration, int(sr * duration))
audio = 0.5 * np.sin(2 * np.pi * 440 * t)

# Save and load
sf.write('test.wav', audio, sr)
loaded, sr_loaded = librosa.load('test.wav', sr=sr)
print(f'✅ Audio I/O works! Shape: {loaded.shape}, SR: {sr_loaded}Hz')
"

# Test 2: Beat tracking
python -c "
import madmom
print(f'✅ Madmom version: {madmom.__version__}')
print('Beat tracking library ready!')
"

# Test 3: Vector search
python -c "
import faiss
import numpy as np

# Create test index
dim = 256
vectors = np.random.randn(100, dim).astype('float32')
index = faiss.IndexFlatL2(dim)
index.add(vectors)
print(f'✅ FAISS works! Indexed {index.ntotal} vectors')
"

# Test 4: PyTorch + audio
python -c "
import torch
import torchaudio
print(f'✅ PyTorch: {torch.__version__}')
print(f'✅ torchaudio: {torchaudio.__version__}')
print(f'✅ MPS (Mac GPU) available: {torch.backends.mps.is_available()}')
"

---

What Comes After Setup

Once Phase 0 is complete and dependencies are installed:

1. You provide a sample of your music library
- Even 10-20 tracks is enough to start
- Should be representative of your style

2. I build Phase 1 (Phrase Database Builder)
- Complete implementation (~3-5 days of dev time)
- Test on your sample library
- Generate first phrase embeddings

3. We verify the phrase database works
- Can search for similar phrases
- Embeddings make musical sense
- Quality metrics look good

4. Move to Phase 2 (VQ-VAE)
- Train the audio tokenizer
- Verify reconstruction quality
- Move to Phase 3 (Diffusion)

---

Questions to Consider

Before we continue to Phase 1, think about:

1. Music Library
- Where is it stored?
- How many tracks?
- What genres/styles?

2. GPU Access
- Will you train locally or use cloud?
- If cloud, which service?
- Budget for GPU time?

3. Training Data
- Do you have any recordings of yourself dancing to your music?
- This is optional but super valuable for motion conditioning

4. Goal Timeline
- Is this a 2-month sprint or a longer project?
- Any upcoming performances to target?

---

Current Status Summary

Phase 0: Project Setup - COMPLETE
- Directory structure created
- Dependencies documented
- Configuration files ready

Phase 1: Phrase Database Builder - READY TO START
- Architecture designed
- Configs written
- Awaiting implementation

---

Next Action

Ready to start Phase 1!

I can begin implementing the Phrase Database Builder right now. This will take several function calls to create all the modules, but I'll build them systematically:

1. Audio loader
2. Beat tracker
3. Segmenter
4. Feature extractor
5. Database manager
6. Build script

Should I proceed? 🚀

Promotion Decision

Attach run IDs, datasets, metrics, and reproduction commands.

Source Anchor

Comp-Core/core/ml/cc-ml/diffusion/SETUP_GUIDE.md

Detected Structure

Method · Evaluation · Code Anchors · Architecture