LoRA in Stable Audio 3
LoRA fine-tuning lets you adapt a Stable Audio 3 model to a specific style, sound, or domain without retraining the whole model. The result is a small `.safetensors` file (~50–200 MB) that you load on top of any base checkpoint at inference time — stackable, adjustable in strength, and swappable without touching the base weights.
Full Public Reader
LoRA in Stable Audio 3
LoRA fine-tuning lets you adapt a Stable Audio 3 model to a specific style, sound, or domain without retraining the whole model. The result is a small `.safetensors` file (~50–200 MB) that you load on top of any base checkpoint at inference time — stackable, adjustable in strength, and swappable without touching the base weights.
What You Need
- A dataset of audio files with matching text descriptions (at minimum ~20–50 clips; more is better)
- A CUDA GPU with sufficient VRAM:
| Model | Standard | With `--base_precision bf16 --adapter_type lora-xs` |
|---|---|---|
| `medium` | ~6.5 GB | ~5.5 GB |
| `small` | ~2.5 GB | ~2 GB |
Quick Start
We don't claim these are optimal settings, LoRA behavior varies a lot with dataset size, style, and hardware. But these are the configurations we've found work well for most datasets and are good starting points before tuning.
Standard (recommended starting point)
uv run python scripts/train_lora.py \
--model medium-base \
--data_dir ./my_data \
--rank 16 \
--adapter_type dora-rows \
--steps 1000Good default for most datasets. `dora-rows` is the default adapter and tends to generalize well.
Reduced VRAM (`medium` on ~16 GB)
uv run python scripts/train_lora.py \
--model medium-base \
--data_dir ./my_data \
--rank 16 \
--adapter_type lora-xs \
--base_precision bf16 \
--steps 1000`lora-xs` has far fewer trainable parameters than standard LoRA, and `--base_precision bf16` halves the memory used by frozen weights. Quality difference is usually small.
---
Training Configuration
> Looking for more advanced training tooling? Check out [Underfit](https://github.com/dada-bots/underfit) from Dadabots. There you can find scripts to orchestrate LoRA training runs, monitor runs from a dashboard, and much much more.
CLI Arguments
| Argument | Default | Description |
|---|---|---|
| `--rank` | 16 | LoRA rank. Lower = fewer parameters, higher = more expressive. |
| `--lora_alpha` | same as `--rank` | Scaling factor. The effective scaling is `alpha / rank`. Setting `alpha = rank` gives a scaling of 1.0. |
| `--adapter_type` | `dora` | Adapter type. See table above for all options. |
| `--dropout` | 0.0 | Dropout probability applied to LoRA inputs during training. |
| `--include` | (all layers) | Only apply LoRA to modules whose name contains one of these substrings. |
| `--exclude` | (no exclusions) | Skip modules matching any of these substrings, even if they match `--include`. |
| `--svd_bases_path` | None | Path to a pre-computed SVD bases `.pt` file. Used by `-XS` adapter types to skip per-layer SVD at startup. |
| `--base_precision` | None | Cast frozen base weights to `bf16`, `bfloat16`, `fp16`, or `float16` after applying LoRA. LoRA params remain in fp32. Reduces VRAM usage. |
| `--lora_checkpoint` | None | Path to an existing `.safetensors` LoRA checkpoint to resume training from. |
How LoRA Training Works
When LoRA training is run:
1. Base weights are frozen. Both the diffusion model and conditioner are set to `eval()` mode with `requires_grad_(False)`. No base model weights are updated during training.
2. SVD bases are loaded (for -XS adapters). If `--svd_bases_path` is provided, pre-computed `U`/`V` bases are loaded from disk. Otherwise, SVD is computed per layer at initialisation time.
3. LoRA layers are added. `add_lora()` uses PyTorch's `nn.utils.parametrize` API to add LoRA parametrizations to `Linear` and `Conv1d` layers in both the model and conditioner. If `--include` or `--exclude` filters are specified, only matching layers receive LoRA.
4. Checkpoint is loaded (if resuming). The LoRA state dict is loaded with `strict=False`.
5. LoRA params are cast to fp32. All LoRA parameters are moved to float32 so the optimizer has full precision, even if the base model is in bfloat16.
6. Base weights are optionally downcast. If `--base_precision` is set, all frozen base weights (including the pretransform) are cast to bf16 or fp16 to reduce VRAM usage.
7. Only LoRA params are optimised. The optimizer receives only the LoRA parameters (collected via `get_lora_params()`), not the full model parameters.
8. Checkpoints save only LoRA. On save, `get_lora_state_dict()` extracts only the LoRA tensors and saves them as `.safetensors` with the `lora_config` embedded in the file metadata.
Layer Filtering
You can control which layers receive LoRA using `--include` and `--exclude`. Both accept lists of substring patterns with bracket expansion support (e.g., `layers[0-11]`).
When `--include` is specified, only layers whose fully-qualified name contains at least one of the include substrings are candidates for LoRA. When `--exclude` is specified, any layer matching an exclude pattern is skipped, even if it also matches an include pattern. When neither is specified, all layers of matching types (Linear, Conv1d) receive LoRA.
Examples
Exclude the `seconds_total` conditioner (prevents conditioner hijacking on small datasets):
uv run python scripts/train_lora.py --model medium-base --data_dir ./my_data \
--rank 16 --adapter_type lora-xs --exclude seconds_totalOnly apply LoRA to transformer layers:
uv run python scripts/train_lora.py --model medium-base --data_dir ./my_data \
--include transformer.layersOnly the first 12 transformer layers:
uv run python scripts/train_lora.py --model medium-base --data_dir ./my_data \
--include "layers[0-11]"Everything except local embedding and seconds_total conditioner:
uv run python scripts/train_lora.py --model medium-base --data_dir ./my_data \
--exclude to_local_embed seconds_totalLayer names are matched against the module's fully-qualified name relative to the submodel root. For the diffusion backbone these look like `transformer.layers.0.self_attn.to_qkv`, `to_timestep_embed.0`, etc. For the conditioner: `conditioners.seconds_total.embedder.embedding.1`, etc.
The `include`/`exclude` config is persisted in the checkpoint and automatically applied when loading the LoRA at inference time.
Memory Optimisation
`--base_precision bf16` casts all frozen base model weights to bfloat16 after LoRA is applied, while keeping LoRA parameters in fp32 for the optimizer. This halves the VRAM used by the frozen weights with negligible quality impact on most tasks:
uv run python scripts/train_lora.py --model medium-base --data_dir ./my_data \
--adapter_type dora-rows --base_precision bf16Pre-computed SVD bases avoid the per-layer SVD computation at startup for `-XS` adapters. Compute once and reuse across training runs:
uv run python scripts/train_lora.py --model medium-base --data_dir ./my_data \
--adapter_type lora-xs --svd_bases_path ./svd_bases.ptWithout `--svd_bases_path`, SVD is computed on-the-fly per layer (slower startup, same training quality).
Resuming Training
Use `--lora_checkpoint` to continue training from an existing checkpoint. The LoRA weights are loaded with `strict=False`, so the checkpoint does not need to cover exactly the same set of layers:
uv run python scripts/train_lora.py --model medium-base --data_dir ./my_data \
--lora_checkpoint ./lora_out/lora_step500.safetensors \
--steps 1000 --output_dir ./lora_out_continued---
Inference with LoRA
Loading LoRA Checkpoints
Use the `--lora-ckpt-path` argument when running the Gradio interface. You can load one or multiple LoRAs:
# Single LoRA
uv run python run_gradio.py --model medium-base --lora-ckpt-path lora.safetensors
# Multiple LoRAs
uv run python run_gradio.py --model medium-base \
--lora-ckpt-path style_a.safetensors style_b.safetensorsThe loading process (`load_and_apply_loras`):
1. The base model is created and loaded normally
2. The model is moved to GPU (important for LoRA-XS GPU-accelerated SVD)
3. SVD bases are loaded once if any LoRA is a `-XS` type
4. LoRA parametrizations are added based on the config embedded in each checkpoint
5. LoRA weights are loaded with `strict=False`
When multiple LoRAs are loaded, they are stacked using PyTorch's native `nn.utils.parametrize` API. Each call to `register_parametrization` appends to the `ParametrizationList` on each weight, and the forward chains them additively. Each LoRA is assigned a unique `lora_index` (0, 1, 2, ...) that enables independent control.
Multiple LoRAs can use different adapter types (e.g., one standard LoRA and one DoRA), different ranks, and different layer filters. They are all applied simultaneously during inference.
Gradio UI Controls
When LoRA checkpoints are loaded, the Gradio interface shows per-LoRA controls. Each LoRA gets its own collapsible accordion with independent settings:
### LoRA Diffusion Transformer Strength
Controls the strength of the LoRA effect on the diffusion model backbone. Default is 1.0 (full effect). Setting to 0 disables the LoRA entirely; values above 1.0 amplify the effect. Range: 0.0 - 10.0.
### LoRA Conditioner Strength
Controls the LoRA effect on the text conditioner independently from the Diffusion Transformer. Default is 1.0. This allows you to, for example, keep the LoRA effect on the conditioner while reducing it on the backbone.
### LoRA Interval
Controls when LoRA is active during the sampling process based on the noise level (sigma). The interval is specified as `[min, max]` where both are in the range 0.0 to 1.0.
- `[0.0, 1.0]` (default): LoRA is active at all noise levels
- `[0.0, 0.95]`: LoRA is active only during the low-noise (late) part of sampling
- `[0.95, 1.0]`: LoRA is active only during the high-noise (early) part of sampling
This can be useful for controlling which aspects of generation the LoRA influences - early steps affect global structure while later steps affect fine details.
### LoRA Layer Filter
A text field that selectively disables LoRA on specific layers by name. Comma-separated substrings are matched against layer names (logical OR). Bracket notation supports ranges.
Examples:
- `.to_global_embed` - disables LoRA on the global embedding projection
- `.transformer.layers[0-5]` - disables LoRA on transformer layers 0 through 5
- `.transformer.layers[0-10], .to_global_embed` - disables both
Layers matching any filter substring are disabled; all other LoRA layers remain active.
Multi-LoRA Example
With two LoRAs loaded, you could configure them independently:
- LoRA 1 (style): Diffusion Transformer strength 1.0, interval `[0.0, 1.0]` (active everywhere)
- LoRA 2 (detail): Diffusion Transformer strength 0.5, interval `[0.0, 0.5]` (active only in later denoising steps)
Each LoRA's interval and layer filter are evaluated independently at each sampling step. A LoRA is enabled for a step only if the current sigma falls within its interval.
---
Adapter Types
LoRA (Low-Rank Adaptation) enables parameter-efficient fine-tuning of large diffusion models. Instead of updating all model weights during training, LoRA freezes the pre-trained weights and injects small, trainable low-rank matrices into each layer. This dramatically reduces the number of trainable parameters and the size of saved checkpoints while still allowing the model to learn new behaviors.
Stable Audio 3 supports a family of adapter types that trade off expressiveness against parameter count:
| Adapter | Trainable Params per Layer | Use Case |
|---|---|---|
| lora | `rank * (fan_in + fan_out)` | General-purpose fine-tuning. Good balance of expressiveness and efficiency. |
| dora-rows (default) | `rank * (fan_in + fan_out) + fan_out` | DoRA with per-row (per-output-neuron) magnitude. Paper-correct variant. |
| dora-cols | `rank * (fan_in + fan_out) + fan_in` | DoRA with per-column (per-input-feature) magnitude. |
| bora | `rank * (fan_in + fan_out) + fan_in + fan_out` | Bi-dimensional DoRA — independent row and column magnitudes. |
| lora-xs | `rank²` | Maximum parameter efficiency. Only a tiny core matrix is trainable; bases are frozen SVD factors. |
| dora-rows-xs | `rank² + fan_out` | DoRA-rows + LoRA-XS. Frozen SVD bases plus per-row magnitude. |
| dora-cols-xs | `rank² + fan_in` | DoRA-cols + LoRA-XS. Frozen SVD bases plus per-column magnitude. |
| bora-xs | `rank² + fan_in + fan_out` | BoRA + LoRA-XS. Frozen SVD bases plus both row and column magnitudes. |
LoRA is applied to both the diffusion backbone and the conditioner (only trainable parameters like the `seconds_total` conditioner).
LoRA (Standard)
Standard LoRA decomposes each weight update into two low-rank matrices:
W' = W + (alpha/rank) * B @ AWhere `W` is the frozen pre-trained weight, `A` is a `(rank, fan_in)` matrix, and `B` is a `(fan_out, rank)` matrix. Only `A` and `B` are trained. The `alpha/rank` ratio controls the scaling of the LoRA update relative to the original weight.
`A` is initialized with Kaiming uniform initialization and `B` is initialized to zeros, so the LoRA update starts at zero and the model begins training from its pre-trained behavior.
DoRA (default: dora-rows)
DoRA (Weight-Decomposed Low-Rank Adaptation) extends standard LoRA by separating weight updates into direction and magnitude components:
V = W + (alpha/rank) * B @ A (low-rank update, same as LoRA)
V_hat = V / ||V||_row (row-normalized direction)
W' = V_hat * magnitude (scale by per-row magnitude)The `magnitude` vector is initialized from the row norms of the original pre-trained weight: `||W||_row`. This means the model starts with the same effective weights as the pre-trained model, but can independently adjust the direction and magnitude of each row during training.
There are two variants:
- `dora-rows` (default): magnitude per output neuron (`fan_out` values). This is the paper-correct variant.
- `dora-cols`: magnitude per input feature (`fan_in` values).
BoRA
BoRA (Bi-dimensional DoRA) applies independent magnitude scaling to both rows and columns:
V = W + (alpha/rank) * B @ A
V_r = V / ||V||_row (row-normalize)
H_r = magnitude_r * V_r (scale rows)
H_c = H_r / ||H_r||_col (column-normalize)
W' = H_c * magnitude_c (scale columns)This adds both a `magnitude_r` vector of size `fan_out` and a `magnitude_c` vector of size `fan_in`. More expressive than either DoRA variant, at the cost of slightly more parameters.
LoRA-XS
LoRA-XS takes parameter efficiency to the extreme by freezing the low-rank bases and only training a tiny core matrix:
U, S, V^T = SVD(W) (SVD of pre-trained weight)
W' = W + (alpha/rank) * U[:, :r] @ M_xs @ V[:, :r]^T`U` and `V` are frozen buffers derived from the SVD of the original weight. Only `M_xs`, a `(rank, rank)` matrix, is trainable. For a typical rank of 8, this means only 64 trainable parameters per layer, compared to thousands for standard LoRA.
Because LoRA-XS computes SVD during initialization, placing the model on GPU before adding LoRA enables GPU-accelerated SVD, which is significantly faster for large models. The inference code does this automatically.
For training, computing SVD for every layer at startup can be slow. Use `--svd_bases_path` to pass a pre-computed bases file instead (see [Training Configuration](#training-configuration)).
-XS Hybrid Variants
The `-xs` suffix can be combined with any DoRA or BoRA variant:
- `dora-rows-xs`: Frozen SVD bases + `M_xs` core + per-row magnitude
- `dora-cols-xs`: Frozen SVD bases + `M_xs` core + per-column magnitude
- `bora-xs`: Frozen SVD bases + `M_xs` core + both row and column magnitudes
These combine the parameter efficiency of LoRA-XS with the magnitude decomposition of DoRA/BoRA.
---
Advanced Features
Strength Control
The `set_lora_strength()` function adjusts the LoRA contribution at runtime without modifying the LoRA weights:
from stable_audio_3.models.lora import set_lora_strength
set_lora_strength(model, 0.5) # Half-strength on all LoRAs
set_lora_strength(model, 0.0) # Effectively disable all LoRAs
set_lora_strength(model, 2.0) # Double-strength on all LoRAs
# With multiple LoRAs, target a specific one by index:
set_lora_strength(model, 1.0, lora_index=0) # Full strength on first LoRA
set_lora_strength(model, 0.5, lora_index=1) # Half strength on second LoRAThis works by scaling the `lora_strength` buffer in each LoRA layer, which multiplies the low-rank delta before adding it to the base weight. When `lora_index` is `None` (the default), all LoRAs are affected.
Multiple LoRA Merging
You can merge multiple LoRA checkpoints with different weights into a single base model:
from stable_audio_3.models.lora.utils import merge_loras_into_base_model
lora_configurations = [
{
'name': 'style_a',
'state_dict': lora_sd_a,
'application_weight': 0.7
},
{
'name': 'style_b',
'state_dict': lora_sd_b,
'application_weight': 0.3
}
]
merge_loras_into_base_model(model, lora_configurations)This computes the weighted sum of LoRA deltas and applies them directly to the base weights. After merging, the LoRA parametrizations are disabled (the effect is baked into the base weights). This works with all adapter types (LoRA, DoRA, BoRA, LoRA-XS, and all hybrid variants).
Weight Tying
For models where the input embedding and output projection share weights, LoRA supports weight tying:
from stable_audio_3.models.lora.utils import tie_weights, untie_weights
tie_weights(linear_layer, embedding_layer) # Share LoRA params
untie_weights(linear_layer, embedding_layer) # Create independent copiesThis is only supported for standard LoRA (not DoRA, BoRA, or -XS variants).
Promotion Decision
Keep in the searchable backlog until it intersects a live paper or system.
Source Anchor
MotionMix/research/external/audio-ai/stable-audio-3/docs/workflows/lora.md
Detected Structure
Figures · Code Anchors · Architecture