Grand Diomande Research · Full HTML Reader

RESEARCH PAPER TECHNICAL LATEX

We present a mathematically rigorous framework for multi-modal sensor fusion in real-time embodied interaction systems. Our approach, \emph{Recursive Polymodal Synthesis} (RPS), fuses heterogeneous modalities with disparate sampling rates and noise statistics into a coherent latent representation suitable for generative control. The core mechanism is a proximal fixed-point iteration using spectral-norm-constrained relational operators, yielding contraction guarantees and a unique fixed point. We prove geometric con

Embodied Trajectory Systems working paper preprint render candidate score 100 .md

Full Public Reader

Abstract

We present a mathematically rigorous framework for multi-modal sensor fusion in real-time embodied interaction systems. Our approach, Recursive Polymodal Synthesis (RPS), fuses heterogeneous modalities with disparate sampling rates and noise statistics into a coherent latent representation suitable for generative control. The core mechanism is a proximal fixed-point iteration using spectral-norm-constrained relational operators, yielding contraction guarantees and a unique fixed point. We prove geometric convergence in $\mathcal{O}(\log(1/\epsilon))$ iterations to $\epsilon$-accuracy when $\alpha\lVert T\rVert_2<1$. The system uses modality encoders $\{E_m\}_{m=1}^M$, linear translators $\{T_m\}_{m=1}^M$ with $\lVert T_m\rVert_2\le \sigma_{\max}<1$, and an update $\Pprox_\alpha(\vz)=(1-\alpha)\Emap(\vx)+\alpha\,T(\vz)$. On synthetic data, the approach attains $99.94\%$ cross-modal coherence, validation loss $1.93\times 10^{-4}$, and $15$--$40$ ms CPU latency. Ablations confirm spectral constraints are necessary and reveal low effective rank structure in learned translators. RPS provides a principled, low-latency alternative to attention-based fusion with explicit robustness to missing modalities and tight stability bounds, enabling live performance, human--robot interaction, and adaptive interfaces.

Introduction

Problem formulation

Let $M$ heterogeneous sensors produce observations $\vx_m(t)\in\R^{d_m}$ at times $t\in\Zp$. Typical examples include high-rate IMUs ($100$--$200$ Hz, $d_{\text{mot}}=6$), low-rate physiology ($1$--$2$ Hz, $d_{\text{hr}}=2$), musical rhythm features (beat-synchronous, $d_{\text{aud}}=2$), and scene context ($d_{\text{ctx}}=1$). We seek a mapping \[ \Phi:\prod_{m=1}^{M}\R^{d_m}\longrightarrow \R^{D},\qquad \vz(t)=\Phi\!\big(\vx_1(t),\ldots,\vx_M(t)\big), \] that satisfies:

- P1: Cross-modal coherence. Latents respect empirical dependencies so that $p(\vz_m\mid \vz_{-m})$ aligns with learned relationships.

- P2: Robustness to missing data. For any subset $S\subseteq[M]$, the partial map $\Phi(\cdot\mid S)$ is well-defined.

- P3: Real-time compute. End-to-end latency $L \le 50$ ms for embodied agency.

Naive concatenation fails P1, many attention-based schemes lack convergence guarantees (P2), and some iterative refinements blow the latency budget (P3). RPS addresses all three.

Contributions

[label=C\arabic*,leftmargin=2em]

- Contraction-theoretic fusion. A proximal fixed-point update with spectral constraints ensures existence, uniqueness, and geometric convergence (thm:contraction).

- Architectural modularity. Encoders + linear relational translators + proximal blending enable learned cross-modal structure, graceful hallucination under dropout, and bounded iteration counts.

- Stable training pipeline. Alternating optimization with spectral projection; a multi-objective control loss balances accuracy, smoothness, range, velocity, and inter-channel diversity.

- Empirical validation and analysis. High coherence at low latency; ablations, sensitivity, spectral analysis, and information-theoretic diagnostics clarify what the model learns and why it works.

Preliminaries and notation

Bold lowercase denote vectors $\vx\in\R^{d}$, bold uppercase matrices $\W\in\R^{m\times n}$. Spectral norm $\lVert \W\rVert_2$; Frobenius norm $\lVert \W\rVert_F$; operator Lipschitz constant $\Lip(f)$. Concatenation $[\vz_1;\dots;\vz_M]$ and $[M]\defeq\{1,\dots,M\}$. For any linear map $A$, $\spec(A)$ denotes eigenvalues; for non-square $A$, $\spec(A^\top A)$ refers to squared singular values.

Recursive Polymodal Synthesis (RPS)

Modality encoders

Each modality $m$ has an encoder $\Enc{m}:\R^{d_m}\to\R^{D_m}$ defined by a two-layer residual MLP: \begin{equation} \Enc{m}(\vx_m)=\W_m^{(2)} \sigma(\W_m^{(1)}\vx_m+\vb_m^{(1)}) + \W_m^{(r)}\vx_m+\vb_m^{(2)}, \end{equation} with ReLU $\sigma(\cdot)$. Spectral normalization yields \begin{equation} \widetilde{\W}_m^{(i)} = \frac{\W_m^{(i)}}{\max\{1,\lVert \W_m^{(i)}\rVert_2\}},\qquad i\in\{1,2,r\}, \end{equation} so $\Lip(\Enc{m})\le \lVert \W_m^{(2)}\rVert_2\,\lVert \W_m^{(1)}\rVert_2 + \lVert \W_m^{(r)}\rVert_2 \le 2$ under unit spectral caps. The composite encoder $\Emap(\vx)\defeq[\Enc{1}(\vx_1);\ldots;\Enc{M}(\vx_M)]\in\R^{D}$ with $D\defeq\sum_m D_m$.

Implementation sizes..
We use $M=4$ modalities with $(D_1,D_2,D_3,D_4)=(64,16,16,8)$ so $D=104$; hidden width $H_m=128$ per encoder.

Linear relational translators

Each translator $\Trans{m}:\R^{D}\to\R^{D_m}$ is linear: \begin{equation} \Trans{m}(\vz) = \W_m \vz,\qquad \lVert \W_m\rVert_2 \le \sigma_{\max}<1, \end{equation} enforced via spectral normalization or projection. The composite $T(\vz)\defeq[\Trans{1}(\vz);\dots;\Trans{M}(\vz)]$ is a linear map $\R^{D}\to\R^{D}$.

lemma: Composite spectral bound. If $\lVert \W_m\rVert_2\le \sigma_{\max}$ for all $m$, then \[ \lVert T\rVert_2 \le \Big(\sum_{m=1}^{M}\lVert \W_m\rVert_2^2\Big)^{1/2} \le \sqrt{M}\,\sigma_{\max}. \]

proof.
For $\lVert \vz\rVert_2=1$, $\lVert T(\vz)\rVert_2^2=\sum_m \lVert \W_m\vz\rVert_2^2 \le \sum_m \lVert \W_m\rVert_2^2$.

Proximal fixed point

Given $\vz^{(0)}=\Emap(\vx)$, define \begin{equation} \Pprox_\alpha(\vz;\vx) \defeq (1-\alpha)\Emap(\vx) + \alpha\,T(\vz),\qquad \alpha\in(0,1), \end{equation} and iterate $\vz^{(t+1)}=\Pprox_\alpha(\vz^{(t)};\vx)$. A fixed point $\vz^*$ satisfies \begin{equation}\label{eq:fp} \vz^*=(1-\alpha)\Emap(\vx)+\alpha\,T(\vz^*). \end{equation}

theorem: Contraction and convergence. If $\alpha\lVert T\rVert_2<1$, then $\Pprox_\alpha$ is a contraction with constant $\lambda\defeq \alpha\lVert T\rVert_2<1$. The iteration converges to the unique fixed point $\vz^*$ and \[ \norm{\vz^{(t)}-\vz^*}_2 \le \lambda^t\norm{\vz^{(0)}-\vz^*}_2. \] Thus $\epsilon$-accuracy holds for $t\ge \left\lceil\frac{\log(\epsilon/C)}{\log(\lambda)}\right\rceil$, $C\defeq\norm{\vz^{(0)}-\vz^*}_2$.

proof.
For any $\vz,\vz'$,
$\Pprox_\alpha(\vz;\vx)-\Pprox_\alpha(\vz';\vx)_2
= \alphaT(\vz)-T(\vz')_2\le \alpha\lVert T\rVert_2\vz-\vz'_2$.
Banach's theorem on $(\R^{D},\norm{\cdot}_2)$ gives existence, uniqueness, and geometric convergence.

corollary: Iteration budget.
With $\alpha=0.2$, $\sigma_{\max}=0.9$, $M=4$, lem:composite implies $\lambda\le 0.36$; $t\ge 7$ iterations suffices for $\epsilon=10^{-3}$ relative error. Empirically $t=3$--$5$.

remark[Closed form for linear $T$] Since $T$ is linear and $\alpha\lVert T\rVert_2<1$, $(\I - \alpha T)$ is invertible and \[ \vz^* = (\I-\alpha T)^{-1}(1-\alpha)\Emap(\vx) = (1-\alpha)\sum_{k=0}^\infty \alpha^k T^k\,\Emap(\vx), \] with the Neumann series converging absolutely. remark

Missing modalities and hallucination

For absent $m$, set $\vx_m=\vzero$ (or mask). The $m$-block of [eq: eq:fp] gives \[ \vz_m^*=(1-\alpha)\Enc{m}(\vzero) + \alpha\,\Trans{m}(\vz^*), \] so zero-centered encoders ($\Enc{m}(\vzero)=\vzero$) fully delegate to $T_m$.

proposition: Hallucination error. Let $\vz^*(S)$ be the fixed point with only modalities in $S\subseteq [M]$, and put $\lambda=\alpha\norm{T}_2$. Then for any $m\notin S$, \[ \norm{\vz_m^*(S)-\vz_m^*([M])}_2 \le \frac{\alpha}{1-\lambda}\,\norm{\Trans{m}(\vz^*(S))-\Trans{m}(\vz^*([M]))}_2 . \]

proofsketch
Subtract the two fixed-point equations, unroll the contraction, and sum the geometric series bounded by $(1-\lambda)^{-1}$.
proofsketch

Latent normalization and mapping to controls

Maintain EMA statistics $(\bm{\mu}_m,\bm{\sigma}_m^2)$ and z-score each block: \[ \tilde{\vz}_m = \frac{\vz_m-\bm{\mu}_m}{\sqrt{\bm{\sigma}_m^2+\epsilon}}, \qquad \epsilon=10^{-5}. \] A two-layer GRU produces $K$-dimensional controls $\vu_t$ with standard gating; we use hidden width $H=256$.

Training Objectives and Procedures

Alternating optimization

Approximate $\vz^*$ by $T_{\text{iter}}$ proximal steps. Optimize encoders to match translator predictions at the fixed point: \begin{equation} \mathcal{L}_{\text{enc}} = \E_{\vx}\Big[\sum_{m=1}^{M}\norm{\vz_m^* - \Trans{m}(\vz^*)}_2^2\Big]. \end{equation} Optimize translators to predict encoder outputs: \begin{equation} \mathcal{L}_{\text{trans}} = \E_{\vx}\Big[\sum_{m=1}^{M}\norm{\Enc{m}(\vx_m) - \Trans{m}(\Emap(\vx))}_2^2\Big] \quad \text{s.t. } \lVert \W_m\rVert_2\le \sigma_{\max}. \end{equation} Project $\W_m\leftarrow (\sigma_{\max}/\max\{\sigma_{\max},\lVert \W_m\rVert_2\})\W_m$ each update.

Coherence metric.. \begin{equation} \rho = 1-\frac{1}{M}\sum_{m=1}^{M}\frac{\E[\norm{\vz_m^*- \Trans{m}(\vz^*)}_2^2]}{\E[\norm{\vz_m^*-\E[\vz_m^*]}_2^2]}\in[0,1]. \end{equation}

Control mapper loss

For sequences $\{\tilde{\vz}_t\}_{t=1}^{T}$ and targets $\{\vu_t^*\}$, \begin{equation} \mathcal{L}_{\text{total}} =\underbrace{\frac{1}{T}\sum_{t=1}^{T}\norm{\vu_t-\vu_t^*}_2^2}_{\mathcal{L}_{\text{MSE}}} +\lambda_s\underbrace{\frac{1}{T-1}\sum_{t=1}^{T-1}\norm{\vu_{t+1}-\vu_t}_2^2}_{\mathcal{L}_{\text{smooth}}} +\lambda_r\mathcal{L}_{\text{range}} +\lambda_v\mathcal{L}_{\text{vel}} +\lambda_d\underbrace{\norm{\mat{C}-\I}_F^2}_{\mathcal{L}_{\text{diversity}}}, \end{equation} where $\mat{C}$ is the empirical control correlation, $\mathcal{L}_{\text{range}}$ is negative entropy of discretized marginals, and $\mathcal{L}_{\text{vel}}$ penalizes steps exceeding $v_{\max}$.

Optimization hyperparameters

Encoders/Translators: Adam, $\eta=10^{-3}$, weight decay $10^{-4}$, clip norm $1.0$, warmup $5$ epochs, cosine decay, early stop patience $15$, $\alpha=0.2$, $\sigma_{\max}=0.9$, $T_{\text{iter}}=5$. Mapper: AdamW, $\eta=5\times 10^{-4}$, weight decay $10^{-5}$, sequence length $T=50$, batch $32$.

Algorithms

Algorithm: Proximal Fixed-Point Inference with Modality Mask

text

\Require Input \vx=\vx_m_m=1^M, mask \mu∈0,1^M, encoders E_m, translator T, steps T_iter, blend \alpha
\State \vz^(0) ← [E_1(\mu_1\vx_1);\dots;E_M(\mu_M\vx_M)]
\Fort=0 to T_iter-1
\State \vz← T(\vz^(t))
\State \vz^(t+1)← (1-\alpha)\vz^(0)+\alpha \vz
\If\norm\vz^(t+1)-\vz^(t)_2 \le \epsilon break \EndIf
\EndFor
\State \Return \vz^(t+1)

Algorithm: Alternating Spectral-Constrained Training

text

\Require Dataset D, steps T_iter, \alpha, \sigma_\max
not converged
\State // Encoder step
\State Freeze T_m; for minibatches (\vx)\simD:
\State \vz^(0)← \Emap(\vx); iterate alg:prox to get \vz^*
\State Descend \nabla_\theta\sum_m\vz_m^* - T_m(\vz^*)_2^2
\State // Translator step
\State Freeze m; for minibatches (\vx)\simD:
\State Compute \Emap(\vx); descend \nabla_\W\sum_m\normm(\vx_m)-T_m(\Emap(\vx))_2^2
\State Project \W_m← (\sigma_\max/\max\sigma_\max,\lVert \W_m\rVert_2)\W_m
\EndWhile

Theoretical Analysis

Stability, sensitivity, and Lipschitz bounds

assumption
Each encoder satisfies $\Lip(\Enc{m})\le L_m$, so $\Lip(\Emap)\le (\sum_m L_m^2)^{1/2}\defeq L_E$.
assumption

theorem: Input sensitivity of fixed point. Under $\alpha\norm{T}_2<1$ and ass:lip, the mapping $\vx\mapsto \vz^*(\vx)$ is Lipschitz with \[ \Lip(\vz^*) \le \frac{(1-\alpha)\,L_E}{1-\alpha\norm{T}_2}. \]

proof.
Write the two fixed points for $\vx,\vx'$; subtract and bound using triangle inequality and contraction factor, then rearrange.

proposition: BIBO stability under additive noise.
If inputs are perturbed by $\delta\vx$ with $\norm{\delta\vx}\le \eta$, then the fixed point perturbation satisfies $\norm{\delta\vz^*}\le \frac{(1-\alpha)L_E}{1-\alpha\norm{T}_2}\,\eta$.

Complexity and memory

Per step: encoders $O(\sum_m H_m d_m + D_m H_m)$; translator $O(\text{nnz}(T))$ (dense: $O(D^2)$). With $T_{\text{iter}}$ steps, cost is $O(\text{Enc} + T_{\text{iter}}\cdot \text{Trans})$. GRU dominates latency at larger $H$; see tab:latency.

Relation to contractive residual maps

$\Pprox_\alpha(\vz)=\vz - \underbrace{\big(\vz-(1-\alpha)\Emap(\vx)-\alpha T(\vz)\big)}_{\text{contractive residual}}$.
For $\alpha\norm{T}_2<1$, the residual is a contraction, akin to contractive ResNets; the Neumann expansion in rem:closedform parallels resolvent expansions in monotone operator theory.

Experimental Results

Setup

Synthetic sessions: $f_s=\SI{100}{Hz}$, $T_{\text{session}}=6000$ frames; $N_{\text{train}}=10^5$, $N_{\text{val}}=2\times 10^4$. Encoders: $H_m=128$, outputs $(64,16,16,8)$. Translators: $\sigma_{\max}=0.9$. GRU: $H=256$, 2 layers, dropout $0.1$.

Encoder--translator behavior

Training loss decreases from $8.47\times 10^{-2}$ to $4.67\times 10^{-2}$; validation loss reaches $1.93\times 10^{-4}$. Coherence $\rho=0.9994$. Translators saturate the bound: $\lVert \W_m\rVert_2=0.900$.

Caption: Fixed-point convergence probability with threshold on validation.

Iterations	$\Prob\big(\norm{\vz^{(t)}-\vz^*}_2<\epsilon\big)$
1	0.12
2	0.58
3	0.89
4	0.97
5	0.995

Control mapper

Validation MSE $\approx 6.0\times 10^{-2}$; temporal autocorrelation decays at $\approx\SI{150}{ms}$; coverage of control range $\ge 82\%$ per channel.

End-to-end latency

Caption: Latency on Intel i7-1165G7 (single thread).

Component	Time (ms)	Fraction
Encoders	$3.2\pm 0.8$	15 Proximal (5 iters)	$4.1\pm 1.2$	19 Normalization	$0.8\pm 0.2$	4 GRU Mapper	$13.5\pm 3.5$	62 Total	$21.6\pm 5.7$	\textbf100

Robustness and ablations

Graceful degradation under modality dropout, additive noise, and temporal dropout. Removing spectral constraints leads to exploding $\lVert \W_m\rVert_2$ and eventual divergence; increasing $\alpha$ reduces iterations but raises validation error past $0.4$.

Caption: Effect of on convergence and validation loss.

$\alpha$	Iterations (avg)	Coherence $\rho$	$\mathcal{L}_{\text{val}}$
0.05	$11.2\pm 2.3$	0.9989	$2.1\times 10^{-4}$
0.10	$6.8\pm 1.5$	0.9992	$1.9\times 10^{-4}$
0.20	$3.4\pm 0.8$	0.9994	$1.9\times 10^{-4}$
0.40	$2.1\pm 0.4$	0.9991	$2.3\times 10^{-4}$
0.60	$1.6\pm 0.3$	0.9985	$3.1\times 10^{-4}$

Representation structure

SVD of $\W_{\text{mot}}$ shows $\epsilon$-rank $\approx 24$ at $\epsilon=0.01$; PCA of $\{\vz^*\}$: top $40$ components explain $95\%$ variance; axes correlate with energy, heart-rate, and beat-phase proxies.

Discussion

Comparison to baselines

Caption: Comparison summary. Higher coherence and lower latency are better.

Method	Coherence	Missing Mod.	Latency (ms)	Theory
Concat	0.72	Fails	8.2	None
Attention	0.85	Unstable	24.5	None
CCA	0.88	Fails	35.7	Exists
RPS	0.9994	Robust	21.6	Rigorous

Synthetic-to-real

Architectural priors and spectral constraints temper distribution shift; EMA normalization adapts scale drift. We anticipate $70$--$85\%$ zero-shot retention moving to $85$--$95\%$ with short adaptation by encoder fine-tuning while freezing $T$.

Limitations and next steps

Pending real-sensor evaluation; fixed modality and control cardinality; unidirectional control; lack of explicit multi-timescale hierarchy; CPU-only kernels. Future work: variable modality graphs, bidirectional couplings, multi-scale fixed points, GPU/quantized kernels, tighter sample complexity bounds.

Conclusion

RPS treats fusion as a contraction-mapped fixed-point problem with explicit spectral control, delivering unique coherent latents at low latency with robustness to missing data. The analysis explains stability and sensitivity; experiments support the design and reveal structured cross-modal dependencies. The framework generalizes beyond choreography to any real-time fusion task with strict latency budgets.

\appendix

Glossary of Symbols

Symbol	Meaning
$\vx_m\in\R^{d_m}$	Observation of modality $m$
$\Enc{m}:\R^{d_m}\to\R^{D_m}$	Encoder for modality $m$
$T_m:\R^{D}\to\R^{D_m}$	Translator predicting $\vz_m$
$\Emap(\vx)\in\R^{D}$	Concatenated encoder output
$T(\vz)\in\R^{D}$	Concatenated translator output
$\Pprox_\alpha$	Proximal update $(1-\alpha)\Emap(\vx)+\alpha T(\vz)$
$\alpha\in(0,1)$	Blend parameter
$\sigma_{\max}$	Spectral bound on $T_m$
$\lambda$	Contraction constant $\alpha\norm{T}_2$
$\rho$	Coherence metric in $[0,1]$

Expanded Proofs

lemma: Invertibility of .
If $\alpha\norm{T}_2<1$, then $\I-\alpha T$ is nonsingular and $\norm{(\I-\alpha T)^{-1}}_2 \le 1/(1-\alpha\norm{T}_2)$.

proof.
Neumann series: $(\I-\alpha T)^{-1}=\sum_{k=0}^\infty \alpha^k T^k$ with geometric bound on partial sums.

proposition: Error after iterations. Let $\vz^{(t)}$ be the iterate and $\vz^*$ the fixed point. Then \[ \norm{\vz^{(t)}-\vz^*}_2 \le \lambda^t \norm{\vz^{(0)}-\vz^*}_2 \le \frac{\lambda^t}{1-\lambda}\,(1-\alpha)\,\norm{\Emap(\vx)}_2 . \]

proof.
First inequality from thm:contraction. The second uses rem:closedform and triangle inequality.

Synthetic Data Generation Details

Motion energy combines low-frequency Perlin-like noise with sinusoidal drive; HR uses an exponential lag filter of energy with Gaussian innovations; rhythm uses BPM drift and phase accumulation. Bounds ensure physical plausibility: hip angles within $\pm 45^\circ$, HR within $[60,190]$ BPM after clipping.

Reproducibility Checklist

- Full hyperparameters for all components (sec:hparams).

- Random seeds fixed per run; per-epoch EMA statistics checkpointed.

- Spectral normalization via $K=5$ power iterations; projection every step.

- Hardware: Intel i7-1165G7, single thread; Python 3.11, PyTorch 2.x.

Safety and Ethics

Though designed for consensual performance settings, any embodied sensing introduces privacy risks. We recommend on-device processing, opt-in consent logging, and auditable data retention policies. Spectral constraints improve stability, reducing erratic actuator outputs that could harm equipment or users.

References

[leftmargin=2em,itemsep=0.2em]

- Banach, S. Théorie des opérations linéaires. (1932).

- Miyato, T., et al. Spectral Normalization for GANs. ICLR (2018).

- Baltrušaitis, T., et al. Multimodal ML: A Survey and Taxonomy. TPAMI (2019).

- Parikh, N., Boyd, S. Proximal Algorithms. FnT in Optimization (2014).

- Cho, K., et al. Learning Phrase Representations using RNN Encoder–Decoder. EMNLP (2014).

Promotion Decision

Compile/render the source, verify references and figures, then add to the curated atlas.

Source Anchor

projects/Documentation/05-research/RESEARCH_PAPER_TECHNICAL_LATEX.md

Detected Structure

Latex · Abstract · Method · Evaluation · References · Math · Architecture