HF Paper Batch -> LUME Evaluation - 2026-05-24
- `https://huggingface.co/papers/2605.22809` - `https://huggingface.co/papers/2605.22717` - `https://huggingface.co/papers/2605.17991` - `https://huggingface.co/papers/2605.18714`
Full Public Reader
HF Paper Batch -> LUME Evaluation - 2026-05-24
User supplied:
- `https://huggingface.co/papers/2605.22809`
- `https://huggingface.co/papers/2605.22717`
- `https://huggingface.co/papers/2605.17991`
- `https://huggingface.co/papers/2605.18714`
Local Staging
Cloned code/project repos under:
- `Desktop/MotionMix/research/external/audio-ai/live-music-diffusion-models`
- Git: `ab74346` (`2026-05-22 fix readme`)
- `Desktop/MotionMix/research/external/audio-ai/stable-audio-3`
- Git: `fa5ee84` (`2026-05-21 Merge pull request #36 from Stability-AI/apg-float64-mps`)
- `Desktop/MotionMix/research/external/audio-ai/sgt-project-page`
- Git: `dfb0941` (`2026-05-19 add links`)
No heavy installs or model-weight downloads were run. Stable Audio 3 weights
are gated on Hugging Face and require accepting Stability/Gemma terms. LMDM
needs a pretrained audio checkpoint before useful inference.
2605.22809 - Sensor2Sensor
Title: `Sensor2Sensor: Cross-Embodiment Sensor Conversion for Autonomous Driving`
Core idea:
- Convert unstructured monocular dashcam video into a structured multi-sensor
suite using 4D Gaussian Splatting plus diffusion.
- The important abstraction for LUME is not autonomous driving; it is
cross-embodiment sensor conversion.
LUME relevance:
- Treat K11 Bolt, MotionMix iPhones, Insta360, and future room cameras as
different embodiments of the same body-state event.
- Use a canonical latent/evidence schema first, then learn conversion/fusion
between sources later.
- This supports the current design: one accepted live-control source, many
record-only evidence sources.
Action:
- Do not chase a Sensor2Sensor implementation today.
- Add future experiment: train source-to-source reconstruction from recorded
sessions, e.g. phone-side torso evidence -> expected Bolt hand/body geometry,
as an offline confidence/fusion model.
2605.22717 - Live Music Diffusion Models
Title: `Live Music Diffusion Models: Efficient Fine-Tuning and Post-Training of Interactive Diffusion Music Generators`
Local code:
- `Desktop/MotionMix/research/external/audio-ai/live-music-diffusion-models`
Core idea:
- Streaming autoregressive music diffusion.
- Generates audio block by block over a sliding context window.
- `generate_diffusion_cond_blockar` supports block-wise AR generation and a KV
cache for faster streaming.
LUME relevance:
- This is the strongest match for Mo's real-time DJ direction.
- It is not a drop-in replacement for Rekordbox today; it is the lane for
gesture-conditioned accompaniment and live loop continuation.
Action:
- Keep Rekordbox/loopMIDI as the stable live surface.
- Build an offline-first spike:
- use K11 motion DB windows as prompts/control curves;
- generate short continuation/accompaniment blocks;
- save WAV stems for Rekordbox or LUME stem playback.
- Avoid live inference in the bar loop until checkpoint access and latency are
proven.
2605.17991 - Stable Audio 3
Title: `Stable Audio 3`
Local code:
- `Desktop/MotionMix/research/external/audio-ai/stable-audio-3`
Core idea:
- Open platform for generated audio/music.
- Small Music/SFX models are CPU-capable; Medium is CUDA/GPU-oriented.
- Supports text-to-audio, audio-to-audio editing, inpainting, continuation, and
LoRA fine-tuning.
Model access caveat:
- HF model pages require accepting model access conditions.
- Model license is `stable-audio-community`; repo code has MIT license, but
weights and model use are governed separately.
LUME relevance:
- Best immediate route for pre-generating gesture-reactive loops, fills, and
transition material from motion labels.
- Small Music is the likely local/Mac first target. Medium is for a GPU box.
Action:
- Do not attempt ungated download in automation.
- Once access is accepted, create a small batch generator:
- inputs: motion label segment, BPM, energy/arms/spread curves;
- output: normalized WAV loops/fills;
- destination: `C:\lume\stems\generated\...` or MotionMix artifact folder.
2605.18714 - Semantic Generative Tuning
Title: `Semantic Generative Tuning`
Local project page:
- `Desktop/MotionMix/research/external/audio-ai/sgt-project-page`
Core idea:
- Uses segmentation as a generative proxy for unified multimodal models.
- High-level semantic structure beats low-level pixel reconstruction as an
alignment target.
LUME relevance:
- Directly supports using segmentation/body masks as coaching and training
targets.
- For K11, "what part of Mo is visible?" is more useful than only raw pixels or
pose points.
Action:
- Shipped the first perception lane:
- person segmentation overlay in the Pose Coach;
- pose-derived torso and hand ROI segmentation evidence;
- segmentation quality features recorded to `body_motion.sqlite3`;
- report tool at `C:\temp\lume_segmentation_report.py`.
- Next step: use segmentation confidence to decide whether hand/nod gestures are
trusted.
- This is a better next AI-coach primitive than calling a large model every
frame.
Architecture Decision
Keep three lanes:
1. Live control lane:
- K11 Bolt Pose Coach + MediaPipe body/hands.
- Deterministic gestures into Rekordbox (`Z` play/pause).
- No diffusion model inside the live control loop yet.
2. Evidence/training lane:
- MotionMix iPhone SAN + camera pose data.
- K11 body/hand pose DB.
- Future segmentation confidence and source-to-source reconstruction.
3. Generative/offline lane:
- Stable Audio 3 for generating loops/fills.
- LMDM for streaming/accompaniment research once checkpoints are available.
- Generated audio is rendered to WAV/stems first, then brought into Rekordbox
or LUME stem playback.
This preserves the working hand gesture system while opening the path toward AI
audio and learned multi-sensor body understanding.
Promotion Decision
Attach run IDs, datasets, metrics, and reproduction commands.
Source Anchor
MotionMix/research/hf-paper-batch-lume-eval-2026-05-24.md
Detected Structure
Method · Evaluation · Code Anchors · Architecture