research — docs
This paper presents a retrieval-centric architecture for voice-controlled DJ performance that adapts the Speech-to-Order (S2O) streaming pipeline to the domain of professional DJ software, specifically Rekordbox. Instead of parsing transcribed text into intents via a conventional automatic speech recognition (ASR) and natural language understanding stack, the system learns a direct mapping between spoken commands and a catalog of DJ actions derived from Rekordbox’s performance preset mappings. The design combines a
Full Public Reader
• # Retrieval-Centric Voice Control for DJ Performance with Rekordbox and Gemini Live
Abstract
This paper presents a retrieval-centric architecture for voice-controlled DJ performance that adapts the Speech-to-Order (S2O) streaming pipeline to the domain of professional
DJ software, specifically Rekordbox. Instead of parsing transcribed text into intents via a conventional automatic speech recognition (ASR) and natural language understanding
stack, the system learns a direct mapping between spoken commands and a catalog of DJ actions derived from Rekordbox’s performance preset mappings. The design combines a
streaming audio front end with 320-millisecond chunking, voice activity detection, denoising, and log-Mel features; a dual-encoder embedding space for audio and text; Gemini
Live as an optional streaming front end for transcripts and text embeddings; and a symbolic constraint layer that enforces deck-aware safety and performance rules before
triggering Rekordbox commands.
The proposed system operates in both online and offline regimes. In online mode, Gemini Live provides low-latency transcripts and embeddings that feed the retrieval pipeline.
In offline mode, a local audio encoder produces command embeddings directly from microphone audio. In both cases, a shared vector index over the Rekordbox command catalog
provides fast nearest-neighbor search, and a constraint solver mediates between retrieved candidates and the current DJ state to prevent destructive operations such as loading
tracks onto a playing deck. The paper describes the command catalog derived from Rekordbox mapping files, the streaming and embedding infrastructure, the safety and constraint
design, and a data and training strategy that evolves from text-only retrieval to a fully trained audio–text dual encoder. Evaluation considerations and practical deployment
notes round out the proposal.
1. Introduction
Voice-controlled systems increasingly move beyond passive assistants toward tightly coupled control of specialized creative environments. In DJ performance, latency, safety,
and expressiveness all matter simultaneously: a command must land on the beat, avoid interrupting a playing track, and yet be flexible enough to cover transport, looping, hot
cues, grid editing, sampler triggering, and more. Conventional ASR-driven pipelines treat speech as a sequence of words and optimize word error rate, a metric only loosely
correlated with whether the intended DJ action is taken. Misrecognizing even a single word can flip “load next on right” into a dangerous command.
The central idea in this work is to treat voice control for DJing as a retrieval problem rather than a transcription problem. We embed spoken commands directly into the same
low-dimensional space as a catalog of Rekordbox commands, then retrieve the nearest commands and filter them with a symbolic constraint solver. This approach is an adaptation
of the S2O speech-to-order architecture developed for noisy café ordering, where utterances map directly to a menu document space. Here, the “menu” consists of Rekordbox
actions compiled from its Performance 1 preset mappings, including shift-modified keys, deck-specific variants, and sampler controls.
Gemini Live serves as a complementary component rather than a central dependency. When network connectivity and latency budgets allow, Gemini’s streaming transcripts and text
embeddings can drive the retriever. When offline or in harsh acoustic conditions, a local embedding model trained on DJ command audio takes over. Both paths share the same
vector index, constraints, and execution layer, enabling a unified system that degrades gracefully to local operation.
The remainder of this paper describes the command catalog and its semantics, the streaming and embedding architecture, the constraint and safety layer, the data and training
plan, and the implementation phases that connect the existing S2O codebase, Gemini Live capabilities, and the Rekordbox environment into a coherent whole.
2. Background and Motivation
The S2O system was originally conceived for contactless ordering in cafés. Customers spoke among grinders and steamed milk, with heavy background noise, varied accents, and
frequent disfluencies. Rather than chasing perfect transcripts, S2O treated each utterance as an acoustic query in a semantic space of menu documents. A dual encoder mapped
audio and text into a shared vector space; a vector index produced candidate menu items; and a constraint solver enforced menu rules and inventory, asking clarification
questions only when necessary. A streaming audio front end with 320-millisecond chunking and stability detection enabled early retrieval before an utterance finished, improving
responsiveness during busy periods.
DJ performance presents an analogous but distinct challenge. Commands are short, often consisting of two or three words, but they must land tightly on the musical grid. The
environment is loud and dynamic; the DJ cannot afford to stare at a screen or operate a mouse while mixing. At the same time, the consequence of an incorrect command can be
severe: loading over a playing deck, clearing a hot cue, or mismanaging loops can disrupt a set.
Rekordbox encodes a rich command space through keyboard mappings, MIDI, and HID. The performance preset mapping files, which enumerate command identifiers, human-readable
descriptions, and keyboard shortcuts, provide a structured description of this space. They include transport controls, loops, hot cues, grid editing, sampler controls, library
navigation, and recording. This mapping is an ideal candidate for conversion into a retrieval corpus.
Gemini Live, a multimodal, streaming-capable large language model service, introduces a complementary front end. It can provide partial transcripts and text embeddings in real
time. Used naively, it would be another ASR front end with the usual pitfalls. Used carefully, it becomes one of several embedding providers, feeding into a retrieval-centric
system that remains grounded in a curated, deck-aware command catalog and a symbolic safety layer.
The motivation of this work is to combine these elements: the S2O streaming and retrieval stack, Gemini Live’s streaming text embeddings, and the Rekordbox mapping corpus, into
a DJ-focused voice control system that is expressive, fast, robust, and safe.
3. Command Space and Catalog Construction
The first ingredient in a retrieval-centric system is the catalog: the set of documents against which queries are compared. In a café, each document describes a drink and its
modifiers. In the DJ setting, each document describes a specific Rekordbox command.
The Rekordbox Performance 1 preset provides the raw material. Each line in the mapping file specifies a command identifier, a human-readable description, and a keyboard
shortcut. For example, the preset includes entries for playing and cueing tracks on Deck 1 and Deck 2, setting and clearing hot cues, different loop sizes, beatgrid
adjustments, sampler slot playback, and recorder control. Some commands are duplicated for the two decks with distinct identifiers and key bindings; others are global, such as
search or layout toggling.
To become retrieval documents, these mappings must be enriched. Each command is assigned a stable catalog identifier, usually mirroring the Rekordbox commandId. A canonical
phrase is created for the command, such as “play pause left” for the Deck 1 play/pause toggle or “load track to deck two” for track loading on Deck 2. Synonyms and paraphrases
are added to capture the variety of natural speech: “start left,” “play left deck,” “resume deck one,” “prep right deck,” and so on. The catalog entry also records derived
metadata: deck (left, right, both, or context), high-level category (transport, loop, hotcue, grid, library, library_load, sampler, recorder, navigation, sync_tempo), specific
action type (play_pause, load, instant_double, loop_size, clear_hotcue), shortcut key, and safety attributes such as whether the action is destructive, whether the target deck
must be idle, and what cooldown to apply between successive activations.
This yields a structured command catalog that can be serialized as YAML or JSON and loaded by downstream components. Each catalog entry is conceptually similar to a menu item
document in S2O: it has a canonical description, alternative phrasings, and a rich set of attributes. The difference lies in the semantics of those attributes: decks instead of
drink sizes, destructive flags instead of allergens, loop sizes instead of shot counts. The retrieval corpus is then generated by expanding each command into multiple document
rows, one per canonical phrase or synonym, each carrying a copy of the underlying metadata.
4. System Architecture Overview
The proposed system follows the same logical structure as the S2O architecture: a streaming front end, an embedding layer, a vector index, and a constraint and execution layer.
The key difference is the nature of the catalog and some of the constraint rules.
At the front end, a streaming audio pipeline ingests microphone audio at 16 kHz, segments it into overlapping 320-millisecond chunks, applies voice activity detection and
optional spectral denoising, and extracts log-Mel features using a 25-millisecond window and a 10-millisecond hop. The pipeline maintains an internal state of recent chunks,
and once enough speech has been observed, it produces provisional embeddings for the utterance.
The embedding layer is modular. In purely local mode, a dual-encoder audio tower maps pooled log-Mel features into normalized vectors in a shared command embedding space. In
Gemini-integrated mode, a text encoder produces embeddings from Gemini Live transcripts. In hybrid mode, the system chooses between these providers on a per-utterance basis
based on network availability and quality.
A vector index over the command catalog embeddings provides approximate or exact nearest-neighbor search. For each provisional or final query embedding, the index returns a
ranked list of candidate commands with similarity scores.
A constraint and safety layer then mediates between the retrieved candidates and the current DJ environment. It maintains a shadow state for each deck, including whether
a track is loaded, whether it is playing, whether a loop is active, and what hot cues are set, as well as global state such as whether recording is active. Symbolic rules
inspect this state, the retrieved command’s metadata, and the current operating mode (ghost, assist, auto) to decide whether to execute the command, require confirmation, ask a
clarifying question, or reject the proposal. Only after passing these checks does the system emit keystrokes or MIDI messages that Rekordbox interprets as control actions.
The composition of these layers yields a closed-loop control system: speech or text enters at the front end, semantic retrieval proposes an action, symbolic logic vets it
against safety and context constraints, and Rekordbox responds via its own control API.
5. Audio and Text Front Ends
The audio front end builds directly on the S2O streaming implementation. Audio is sampled at sixteen kilohertz from a near-field microphone, ideally a headset or gooseneck to
reduce bleed from the main speakers. The pipeline maintains a buffer of raw samples and processes any time the buffer contains at least one chunk’s worth of data.
For each chunk, voice activity detection decides whether speech is present. In a DJ booth, the noise profile differs from a café; music playback and crowd noise dominate. Voice
activity must therefore be tuned with higher aggressiveness to avoid spurious speech detection in non-vocal segments, possibly relying on spectral signatures of human speech
as opposed to broadband music. When speech is detected, optional spectral denoising suppresses stationary noise components such as air conditioning and some components of the
music, while preserving transient consonants and vowels.
Log-Mel features are extracted for each chunk with eighty Mel frequency bins, a twenty-five millisecond analysis window, and a ten millisecond slide. These features are the
same as those used in the S2O audio tower, enabling reuse of the model architecture. A small audio encoder, which may be a convolutional front end followed by a transformer or
conformer stack, maps variable-length sequences of log-Mel frames into fixed-length embeddings. In the streaming configuration, the pipeline pools features from the most recent
frames and emits a new provisional embedding whenever sufficient context has accumulated.
Parallel to the audio path, Gemini Live offers a text front end. In this mode, the microphone audio is sent to Gemini Live, which returns partial transcripts in real time. Each
transcript snippet is normalized and fed into a text embedding model, either Gemini’s own text embeddings or a local equivalent aligned to the same command embedding space.
Stability detection operates on these text embeddings over time, similarly to how it operates on audio embeddings.
The dual front ends are not mutually exclusive. A system can run Gemini Live as the primary front end when network conditions permit, while maintaining the local audio pipeline
as a low-latency fallback. Both paths produce embeddings in the same space and take advantage of the same retrieval and safety layers.
6. Embedding Space and Retrieval
The core semantic machinery is a shared embedding space where both queries and commands reside. The catalog entries built from the Rekordbox mappings are encoded with a text
tower, producing normalized command embeddings. At inference time, a spoken utterance or a Gemini transcript is similarly mapped into this space, and semantic similarity is
approximated by inner product or cosine similarity.
In the initial phase, before training a bespoke audio encoder, the system can operate entirely in text space. Commands are embedded using a pre-trained sentence-level model,
such as an embedding model in the Gemini family or a compact transformer-based encoder. Spoken queries are transcribed by Gemini Live; the transcripts are embedded and compared
to the catalog embeddings. This configuration is sufficient to demonstrate end-to-end functionality and to validate the catalog design, synonyms, and constraints.
As data accumulates, a dual-encoder model is trained to align audio and text representations. The audio tower consumes variable-length sequences of log-Mel features and outputs
a fixed-dimensional vector; the text tower consumes canonical command phrases and synonyms. A contrastive loss such as InfoNCE encourages paired audio and text embeddings to be
close, while pushing apart embeddings of mismatched pairs. Training is performed on a dataset of recorded spoken commands with labels drawn from the catalog.
At runtime, retrieval is handled by a vector index, which can be an exact cosine search over a few thousand documents or an approximate index such as FAISS’s HNSW when scaling
to larger catalogs or multi-system deployments. The index stores command embeddings and returns top-k candidates with similarity scores. These scores, combined with stability
metrics from the streaming pipeline, provide a probabilistic measure of confidence that a given spoken command refers to a particular Rekordbox action.
This retrieval-centric design has several advantages. It is robust to paraphrase: different phrasings of “load left deck” cluster around the same catalog entries. It naturally
supports expansion of the catalog: adding a new action amounts to adding a new document and its embeddings. It also facilitates calibrating confidence via temperature scaling
of similarity scores, which is crucial for downstream decisions about executing or asking for confirmation.
7. Constraint and Safety Layer
A defining feature of this architecture, inherited from S2O and from the DJ Agent’s safety design, is the explicit constraint and safety layer that stands between retrieval and
execution. In a creative performance context, safety is not only about preventing catastrophic failures but also about preserving musicality and predictability.
The constraint layer maintains a shadow state of the DJ environment. For each deck, it tracks whether a track is loaded, whether it is currently playing, whether a loop
is active, and which hot cues are defined. Globally, it tracks whether recording is active, when different categories of actions were last executed, and potentially other
contextual variables such as tempo and beat phase from external tempo analysis.
Constraints are expressed as simple rules. Some rules enforce prerequisites. For example, loop exit commands should have no effect unless a loop is active. Others enforce deck-
aware safety. Loading a track onto a deck that is actively playing is generally undesirable without explicit intent; therefore, load and instant-double commands for a playing
deck must trigger a confirmation dialogue or be redirected to another deck if policy allows.
Cooldown constraints mitigate repeated triggering due to echo, background speech, or misrecognition. Each category of command has an associated minimum inter-activation
interval. Transport actions such as play and cue may allow activation every few hundred milliseconds, whereas destructive actions such as clearing hot cues or starting and
stopping recording require longer intervals.
The layer also handles deck ambiguity and partial recognition. If a retrieval result suggests a deck-agnostic command such as “play” when both decks are candidates and context
is ambiguous, the constraint logic can ask a short clarification question rather than guessing. These clarifications mirror the targeted questions in S2O’s dialogue manager,
which only intervenes when necessary to disambiguate.
Modes of operation further shape the behavior of the constraint layer. In ghost mode, no commands are executed; the system only logs what it would have done, a crucial
stage for validation. In assist mode, low- and medium-risk commands execute automatically, while high-risk ones require confirmation. In auto mode, the system behaves more
autonomously but still enforces core safety rules such as protecting the currently playing deck for load operations.
By keeping this logic symbolic and rule-based, the system remains interpretable and adjustable. A DJ or engineer can change policies and constraints without retraining any
model, just as a café operator can adjust menu rules in S2O.
8. Streaming, Stability, and Latency Considerations
Latency is critical for DJ performance. A command such as “loop out” or “play” must align with the beat grid to sound intentional. The streaming design from S2O, built around
320-millisecond chunks with overlap-add and embedding stability, transfers well to this domain.
As the audio pipeline ingests speech, it maintains a window of recent embeddings. After each chunk, it computes the cosine similarity between the current embedding and the
embeddings for the last few chunks. If the similarity stays above a threshold, indicating that the semantic representation has stabilized, the embedding is marked as stable
and can trigger retrieval even before the utterance ends. For many short DJ commands, stability will typically occur after two or three chunks, yielding reaction times under
a second.
The stability threshold can be tuned per command category. For high-risk actions, the system may require more consecutive stable embeddings or a larger margin in similarity
scores before acting. For low-risk, timing-critical actions, it may accept stability after a shorter window. The presence of Gemini Live transcripts adds another dimension:
when text is available and reliable, stability can be inferred from the convergence of partial transcripts as well as from embedding similarity.
Backpressure and resource usage are also managed carefully. The pipeline uses bounded queues for inter-thread communication and maintains statistics on dropped chunks and
backpressure events. In high-load situations, the system can prioritize final embeddings and commands over provisional ones, ensuring that retrieval and execution remain
responsive.
9. Data and Training Strategy
A research-grade system must be grounded in data, not just architecture. The data plan for this project mirrors S2O’s careful mix of synthetic and real recordings while
respecting the uniqueness of the DJ domain.
The initial phase operates purely in text space. The command catalog, with canonical phrases and synonyms, is embedded with a pre-trained text model. This enables immediate
experimentation and debugging of retrieval quality against typed queries, as well as scaffolding the constraint layer before any audio is recorded.
Next, a seed audio dataset is collected. A prompting script derived from the catalog lists each canonical phrase and several of its synonyms. The DJ or other speakers record
these phrases in both quiet and DJ-like environments, first in a controlled setting, then in a booth or room with music playback and crowd noise. Each recording is labeled with
the corresponding command identifier and metadata such as deck, category, and environment.
Augmentation amplifies this dataset without fabricating labels. Conventional audio augmentations such as tempo perturbation, pitch shifting, dynamic range compression, and
mixing with background music and crowd noise simulate a diverse range of performance conditions. Optional text-to-speech generation can contribute additional voices and
prosodic variations, especially for frequent commands.
With this data in hand, a dual-encoder training regime begins. Batches of audio and text pairs pass through the audio and text towers; a contrastive loss drives the alignment
of paired embeddings and separation of mismatched pairs. The model is evaluated on held-out commands and paraphrases, using retrieval metrics such as top-one and top-three
accuracy, and on targeted confusion sets where deck, load, play, and loop commands are easily conflated.
Hard negative mining further sharpens the model. Logged interactions from live or simulated use identify commands that are frequently confused. These pairs are emphasized as
negative examples in retraining, encouraging the model to carve clearer boundaries in embedding space.
Finally, calibration techniques such as temperature scaling are applied to the similarity scores to yield reliable confidence estimates, which the constraint layer uses to
decide when to act autonomously and when to ask for clarification.
10. Implementation Phases and Integration
Implementing the full system in practice is best understood as a sequence of phases that incrementally add complexity while maintaining a working system at each step.
In the first phase, the command catalog and text index are built from the Rekordbox mappings. The result is a text-based retriever that can be queried programmatically or via a
simple user interface, allowing engineers and DJs to inspect which actions are returned for different phrasings.
The second phase integrates Gemini Live as a front-end. Microphone audio is sent to Gemini Live, partial transcripts are embedded and fed into a stability detector, and the
resulting stabilized embeddings drive text-based retrieval. A simple keyboard or MIDI bridge translates selected catalog entries into Rekordbox control actions. Constraints and
safety policies are introduced in this phase, although they may initially be conservative.
The third phase adds the S2O-style streaming audio front end and a local dual-encoder audio tower. This enables fully offline operation, removes dependence on external ASR
quality, and potentially lowers end-to-end latency. The same catalog, index, constraints, and execution layer are reused.
In the fourth phase, hybrid logic selects between Gemini text and local audio embeddings, depending on network and acoustic conditions. This phase also includes refinement of
safety policies, fine-tuning of stability thresholds, and user experience improvements such as voice commands for switching modes or querying system state.
Throughout these phases, logs of interactions, retrieval candidates, decisions, and overrides inform both model training and rule adjustments. The architecture deliberately
separates learned components from symbolic ones so that changing risk tolerance, decks, or available commands does not require retraining.
11. Evaluation and Future Work
Evaluating such a system spans both offline metrics and live performance observations. Offline, the primary metrics are retrieval accuracy, stability detection latency, and
calibration quality. Retrieval accuracy is measured by how often the intended command is ranked first or among the top few candidates given a spoken query. Stability latency
is measured from utterance onset to the first stable embedding, and then to command execution. Calibration is evaluated via reliability diagrams comparing similarity-based
confidence estimates to actual correctness rates.
Online, qualitative performance matters as much as numbers. Does the system feel like an extension of the DJ’s intent, or does it constantly require confirmation and
correction? How often do safety rules prevent potentially damaging actions, and do those interventions feel justified? How does the system behave under real sound system
conditions with stage monitors and loud crowds?
Future work can extend the approach in several directions. Personalized command catalogs could learn user-specific synonyms and macro commands. Contextual signals such as beat
phase and phrase structure could inform when commands should be quantized to the next beat or bar. Multi-modal inputs, such as pointing gestures or controller states, could
be integrated into the constraint solver to narrow down target decks or tracks. At the model level, one could explore larger or more specialized audio encoders for improved
robustness in extreme noise.
12. Conclusion
This paper has outlined a retrieval-centric architecture for voice control of Rekordbox, rooted in the S2O streaming and dual-encoder design and enriched with Gemini Live as an
optional front end. By treating Rekordbox’s performance preset mapping as a semantic command catalog, mapping speech into a shared embedding space, and interposing a symbolic
constraint layer before execution, the system aligns its optimization target with what matters to the DJ: the right action, on the right deck, at the right time, with minimal
cognitive load.
The design is modular and pragmatic. It admits a purely text-based initial deployment, then grows into a fully offline audio-driven controller as data becomes available. It
respects safety and musicality through explicit constraints rather than opaque heuristics, making it amenable to inspection and adjustment by practitioners. Most importantly,
it leverages the lessons of retrieval-centric speech interfaces in a new domain, demonstrating that the same principles that enabled robust ordering in noisy cafés can also
support expressive, reliable control in the equally noisy but far more creative context of live DJ performance.
Promotion Decision
Convert into the standard paper schema, add citations, and render a draft PDF.
Source Anchor
Comp-Core/apps/web/cc-studio/configs/mapping/docs/research.md
Detected Structure
Abstract · Introduction · Method · Evaluation · Architecture