Back to corpus
research noteexperiment writeup candidatescore 32
S2O vs ASR+Retrieval - Technical Deep Dive
3. **CLAP** (Contrastive Language-Audio Pretraining): Audio → Audio Embedding - ✅ Pre-trained - ❌ Audio embeddings are for **sounds** (music, environmental sounds) - ❌ Not trained on **speech commands**
Full HTML reader
Read the full artifact
Extracted abstract or opening context
**The problem**: There's no pre-trained model that embeds **audio** into **text embedding space**!
1. **Wav2Vec2**: Audio → Text (via CTC decoding) - ✅ Pre-trained on speech - ❌ Outputs text, not embeddings
2. **Whisper**: Audio → Text (via decoder) - ✅ Pre-trained on speech - ❌ Outputs text, not embeddings
3. **CLAP** (Contrastive Language-Audio Pretraining): Audio → Audio Embedding - ✅ Pre-trained - ❌ Audio embeddings are for **sounds** (music, environmental sounds) - ❌ Not trained on **speech commands**
4. **AudioCLIP**: Audio + Text in shared space - ✅ Audio and text in same space! - ❌ Trained on **music/sounds**, not **speech** - ❌ Not fine-tuned for commands
Promotion decision
What has to happen next
Attach run IDs, datasets, metrics, and reproduction commands.
Why this is not always a full paper yet
Corpus pages are public-safe readers for discovered workspace artifacts. They are not automatically final papers. A corpus item becomes a polished paper only after the editable source, evidence checkpoints, references, figures, render path, and release status are attached through the paper schema.