Back to corpus
working paperpreprint structure candidatescore 100

Recursive Polymodal Synthesis for Real-Time Embodied Interaction: A Contraction-Based Framework with Provable Convergence

We present a mathematically rigorous framework for multi-modal sensor fusion in real-time embodied interaction systems, coupled to a phrase-conditioned spectrogram diffusion backend for direct audio generation. Our approach, termed Recursive Polymodal Synthesis (RPS), addresses the fundamental challenge of fusing heterogeneous sensor modalities with different noise characteristics, sampling rates, and semantic meanings into a coherent internal representation suitable for generative control. The key innovation is a

Full HTML reader

Read the full artifact

Open in new tab

Extracted abstract or opening context

We present a mathematically rigorous framework for multi-modal sensor fusion in real-time embodied interaction systems, coupled to a phrase-conditioned spectrogram diffusion backend for direct audio generation. Our approach, termed Recursive Polymodal Synthesis (RPS), addresses the fundamental challenge of fusing heterogeneous sensor modalities with different noise characteristics, sampling rates, and semantic meanings into a coherent internal representation suitable for generative control. The key innovation is a proximal fixed-point iteration scheme that enforces cross-modal coherence through spectral-norm-constrained relational operators, providing theoretical guarantees of convergence to a unique fixed point. We establish conditions under which the update operator is a contraction mapping on the latent representation space and prove convergence in at most $\mathcal{O}(\log(1/\epsilon))$ iterations to achieve $\epsilon$-accuracy. The framework processes sensor inputs through modality-specific encoders $\{E_m\}_{m=1}^M$, learns cross-modal predictors $\{T_m\}_{m=1}^M$ with spectral norm $\|T_m\|_2 \leq \sigma_{\max} < 1$, and iteratively refines representations via the proximal operator $\mathcal{P}_\alpha(z^{(t)}) = (1-\alpha)E(x) + \alpha T(z^{(t)})$. For audio generation, a bar-rate conductor transformer provides phrase-level conditioning to a U-Net spectrogram diffusion model, enabling library-faithful, structurally coherent synthesis with controllable guidance. We report synthetic-fusion metrics (99.94% cross-modal coherence) alongside corpus-grounded audio metrics (Fréchet Audio Distance, beat alignment error, key stability, bandwidth/flatness) and runtime characteristics (15–40 ms control latency; 0.5–1.0 s bar-ahead prebuffer). The framework maintains mathematical rigor and deployable performance, making it suitable for latency-critical applications including live performance, human-robot interaction, and adaptive interfaces.

Promotion decision

What has to happen next

Convert into the standard paper schema, add citations, and render a draft PDF.

Why this is not always a full paper yet

Corpus pages are public-safe readers for discovered workspace artifacts. They are not automatically final papers. A corpus item becomes a polished paper only after the editable source, evidence checkpoints, references, figures, render path, and release status are attached through the paper schema.