Mohamed Diomande

Full HTML reader

Read the full artifact

Extracted abstract or opening context

**The problem**: There's no pre-trained model that embeds **audio** into **text embedding space**! 1. **Wav2Vec2**: Audio → Text (via CTC decoding) - ✅ Pre-trained on speech - ❌ Outputs text, not embeddings 2. **Whisper**: Audio → Text (via decoder) - ✅ Pre-trained on speech - ❌ Outputs text, not embeddings 3. **CLAP** (Contrastive Language-Audio Pretraining): Audio → Audio Embedding - ✅ Pre-trained - ❌ Audio embeddings are for **sounds** (music, environmental sounds) - ❌ Not trained on **speech commands** 4. **AudioCLIP**: Audio + Text in shared space - ✅ Audio and text in same space! - ❌ Trained on **music/sounds**, not **speech** - ❌ Not fine-tuned for commands

Promotion decision

What has to happen next

Attach run IDs, datasets, metrics, and reproduction commands.

Why this is not always a full paper yet

Corpus pages are public-safe readers for discovered workspace artifacts. They are not automatically final papers. A corpus item becomes a polished paper only after the editable source, evidence checkpoints, references, figures, render path, and release status are attached through the paper schema.