░▒▓███████▓▒░▒▓█▓▒░▒▓███████▓▒░░▒▓████████▓▒░░▒▓███████▓▒░▒▓████████▓▒░▒▓████████▓▒░▒▓███████▓▒░
░▒▓█▓▒░ ░▒▓█▓▒░▒▓█▓▒░░▒▓█▓▒░▒▓█▓▒░ ░▒▓█▓▒░ ░▒▓█▓▒░ ░▒▓█▓▒░ ░▒▓█▓▒░░▒▓█▓▒░
░▒▓█▓▒░ ░▒▓█▓▒░▒▓█▓▒░░▒▓█▓▒░▒▓█▓▒░ ░▒▓█▓▒░ ░▒▓█▓▒░ ░▒▓█▓▒░ ░▒▓█▓▒░░▒▓█▓▒░
░▒▓██████▓▒░░▒▓█▓▒░▒▓█▓▒░░▒▓█▓▒░▒▓██████▓▒░ ░▒▓██████▓▒░ ░▒▓█▓▒░ ░▒▓██████▓▒░ ░▒▓███████▓▒░
░▒▓█▓▒░▒▓█▓▒░▒▓█▓▒░░▒▓█▓▒░▒▓█▓▒░ ░▒▓█▓▒░ ░▒▓█▓▒░ ░▒▓█▓▒░ ░▒▓█▓▒░
░▒▓█▓▒░▒▓█▓▒░▒▓█▓▒░░▒▓█▓▒░▒▓█▓▒░ ░▒▓█▓▒░ ░▒▓█▓▒░ ░▒▓█▓▒░ ░▒▓█▓▒░
░▒▓███████▓▒░░▒▓█▓▒░▒▓███████▓▒░░▒▓████████▓▒░▒▓███████▓▒░ ░▒▓█▓▒░ ░▒▓████████▓▒░▒▓█▓▒░
by dernet -- BETA
Standalone training toolkit for ACE-Step 1.5 audio generation models. Takes you from raw audio files to a working adapter without the friction. Variant-aware multi-adapter fine-tuning (LoRA, DoRA, LoKR, LoHA, OFT) with auto-detection, low-VRAM support, and three ways to work.
Status: v1.1.2-beta -- Stable enough for daily use. Some features are still experimental. This is maintained by one person only; if you encounter an issue, please let me know in the issues tab.
Side-Step auto-detects your model variant (base, sft, or turbo), selects the scientifically correct training schedule, and runs on consumer hardware down to 8 GB VRAM. Version 1.1.2 adds user-selectable timestep sampling (continuous or discrete) across all three interfaces, building on 1.1.1's Music Flamingo/Transcriber Server providers, batched caption jobs, TensorBoard-parity charts, and training pipeline improvements.
- Auto-Configured Training -- All variants default to continuous logit-normal sampling + CFG dropout. Optionally switch to discrete 8-step sampling via
--timestep-mode discrete(CLI), the Wizard, or the GUI dropdown. The upstream trainer forces the Turbo schedule on all models; Side-Step fixes this automatically. - LoRA + LoKR Adapters -- Standard and Kronecker-product low-rank fine-tuning.
- Preprocessing++ (PP++) -- Fisher Information analysis assigns adaptive per-module ranks based on how important each layer is to your specific audio. Writes a
fisher_map.jsonthat training auto-detects. - Two-Pass Preprocessing -- Converts raw audio to training tensors in two low-memory passes (~3 GB then ~6 GB).
- Interactive Wizard -- Step-by-step prompts with "Go Back" support, presets, flow chaining, and session carry-over defaults.
- Dataset Builder -- Point at a folder of audio + sidecar
.txtfiles and get adataset.jsonautomatically. - Low VRAM -- 8-bit optimizers, gradient checkpointing, encoder offloading. Trains down to ~10 GB.
- Standalone & Portable -- Installs as its own project via
uv. No need to touch your ACE-Step installation.
- Full Electron GUI -- Desktop application with Ez Mode (3-click training), Advanced Mode (every knob), real-time Monitor (loss charts, GPU stats), and a Lab workspace (datasets, preprocessing, PP++, export). CRT shader with phosphor bloom, scanlines, and chromatic aberration. Themeable (4 built-in themes + full editor).
- DoRA, LoHA, OFT Adapters -- Three additional adapter architectures alongside LoRA and LoKR.
- ComfyUI Export -- Convert PEFT LoRA/DoRA adapters to the single-file
.safetensorsformat ComfyUI expects. LyCORIS adapters (LoKR, LoHA) are already natively compatible. - AI Captioning -- Generate sidecar metadata with local AI (Qwen2.5-Omni, no API key, runs on your GPU), Google Gemini, OpenAI, or lyrics scraped from Genius.
- Offline Audio Analysis -- BPM, key, and time signature extraction via
demucsstem separation +librosa. No API keys required. - Built-in Music Player -- Play dataset audio directly in the GUI. Marquee display, EQ visualizer, volume control, auto-play, dockable bar.
- Live VRAM Estimation -- Segmented bar shows model + activation + optimizer breakdown before you start training. Changes reactively as you adjust settings.
- VRAM Presets -- One-click profiles: 8 GB, 12 GB, 16 GB, 24 GB+, Quick Test, High Quality, Recommended.
- Run History -- Persistent log of past training runs with best loss, adapter path, and hyperparameters.
- Tag Management -- Bulk add/remove trigger tags and convert legacy sidecar formats.
- Cross-Platform Entry Point --
sidestep(oruv run sidestepif not on PATH) works on all platforms.
- Cruise Control (Target Loss) -- Set a target loss value and Side-Step automatically damps the learning rate as training approaches it, holding the model at a sweet spot instead of over-fitting past it. EMA-smoothed loss signal, configurable warmup and floor. Works with all schedulers (conflict guards for Prodigy and cosine restarts). Resumes cleanly from checkpoints.
- Caption System Overhaul -- Richer song-focused prompts that emphasize audible content over generic descriptions. Configurable generation parameters (temperature, top_p, penalties). Structured response parsing extracts genre, BPM, key, and time signature alongside the caption. Google Search grounding for Gemini. Lossless audio auto-converted to MP3 before upload to save bandwidth.
- Local Captioner Rewrite -- Qwen2.5-Omni local captioner rebuilt with tiered VRAM configs, OOM recovery (retries with reduced token count), CPU offload option, audio transcoding fallback, cancellation support, and timing logs.
- Default Model Variant: Base -- Base is now the recommended default everywhere (CLI, GUI, wizard, TUI, presets). Turbo remains available but is no longer the automatic first choice. Model variant dropdown auto-selects base > sft > turbo based on what's available in your checkpoint directory.
- Preset Revamp -- All 7 built-in presets fleshed out with complete field coverage (adapter type, cruise control, checkpointing ratio, etc.). Presets now display adapter type, rank, LR, and epochs in the selection card. Type coercion fixes presets saved with numbers as strings.
- Encoding Error Resilience -- Genius, Gemini, and OpenAI providers now detect encoding errors (including errors wrapped by SDK exception types) and bail immediately instead of retrying 3× on deterministic failures. Saves ~70 seconds per batch when processing songs with non-ASCII titles.
- Linux Desktop Integration --
.desktopfile and icon installed to XDG standard locations by the Linux installer. Side-Step appears in your application menu with its own icon. - Electron Hardening -- Navigation guard prevents blank-page crashes, renderer crash detection, DevTools shortcut (F12), native desktop notifications for training completion.
- Prompt Helpers Fix --
ask()now correctly casts default values throughtype_fn, fixing numeric wizard defaults that were silently returned as strings.
- Tensorboard-Like Monitor -- Revamped Monitor tab to include more relevant information and better data representation, paired with Tensorboard style and Side-Step's design language.
- Music Flamingo Provider -- Use Music Flamingo as a metadata and/or lyrics provider. Supports local servers via configurable URL and remote Hugging Face endpoints with token authentication.
- Transcriber Server Provider -- Dedicated lyrics provider backed by a configurable Transcriber Server URL. Nested response parsing, multipart transport, and automatic fallback handling.
- Batched Caption Jobs -- The GUI now runs caption generation as batched jobs. Multiple audio files are queued and processed in sequence with per-file progress, automatic retries, and cancellation support. No more one-at-a-time blocking. (Technically there since 1.0 but updated the readme to reflect the changes)
- Overwrite-Lyrics-Only Mode -- Update only the lyrics field in existing sidecars without touching the rest of the metadata. Useful when re-running lyrics with a different provider.
- Explicit Sequence Crop Controls -- Choose between full sample, chunk by seconds, or max latent length. Backend, presets, UI, and VRAM estimation all support the new modes.
- Turbo Training Overhaul -- Replaced the old discrete 8-step Turbo schedule with continuous logit-normal timestep sampling and re-enabled CFG dropout. Turbo LoRA training now follows a proper training-oriented distribution.
- Cruise Control Progress -- Target loss scale and EMA are now reported in the progress file, visible in the GUI monitor.
- TensorBoard-Parity Charts -- The GUI training monitor now matches TensorBoard's smoothing algorithm, y-domain (P5-P95 with nice boundaries), grid counts, scroll zoom, pan, and closest-point finding. No external TensorBoard needed.
- CLI / Wizard / GUI Parity -- All new features (crop modes, provider selection, endpoint URLs, HF token) are available across all three interfaces.
- Bug Fixes --
dtype→torch_dtypein allfrom_pretrainedcalls (models were loading in default precision), LR restore on gradient flush path, caption regex truncation on apostrophes, faster TensorBoard flush (5s vs 30s), and several provider integration fixes.
- Selectable Timestep Sampling -- Choose between continuous (logit-normal, recommended) and discrete (8-step turbo inference schedule) timestep sampling. Available in the GUI as a dropdown, in the Wizard under "All the Levers", and via
--timestep-modein the CLI. Default is continuous for all model variants. Discrete mode is the legacy turbo behavior for users who want to train at exactly the 8 inference timesteps.
The experimental TUI from 0.9.0 and before has been deprecated. The interactive Wizard is its definitive replacement.
Visual training, dataset management, live charts, and CRT-classic aesthetics (if you manage to find it :3).
uv run sidestep guiModes: Ez Mode | Advanced | Monitor | Lab (History, Tensor Datasets, Audio Library, Preprocess, PP++, Export)
Terminal prompts with back-navigation, presets, and flow chaining (preprocess -> train, PP++ -> train, build dataset -> preprocess -> train).
uv run sidestepAutomate pipelines or bypass menus entirely. Every argument has a (default: X) in --help.
uv run sidestep train \
--checkpoint-dir ./checkpoints \
--model base \
--dataset-dir ./my_tensors \
--output-dir ./output/my_lora \
--adapter-type dora \
--rank 64 --alpha 128 \
--epochs 500git clone https://github.com/koda-dernet/Side-Step.git
cd Side-Step
chmod +x install_linux.sh && ./install_linux.shgit clone https://github.com/koda-dernet/Side-Step.git
cd Side-Step
.\install_windows.ps1The installer handles Python 3.11, PyTorch, Electron, and all dependencies via uv. Flash Attention is pulled from pre-built wheels -- no 20-minute (or more) local compilation.
You need the ACE-Step 1.5 checkpoints. If you don't have them:
git clone https://github.com/ace-step/ACE-Step-1.5.git
cd ACE-Step-1.5 && uv sync && uv run acestep-downloadSide-Step runs on everything from an RTX 3060 to an H100. Built-in presets configure these automatically.
| Profile | VRAM | Strategy |
|---|---|---|
| Comfortable | 24 GB+ | AdamW, Batch 2+, Rank 128 |
| Standard | 16-24 GB | AdamW, Batch 1, Rank 64 |
| Tight | 12-16 GB | AdamW8bit, Encoder offloading |
| Minimal | 8-10 GB | AdamW8bit, Offloading, Grad accumulation 8, Rank 16 |
Gradient checkpointing is on by default, reducing baseline VRAM to ~7 GB before optimizer state.
Convert raw audio into training tensors. Two-pass approach keeps peak VRAM low.
uv run sidestep preprocess \
--audio-dir ./my_songs \
--tensor-output ./my_tensors \
--normalize peakTrain an adapter on preprocessed tensors. Side-Step detects your variant and applies the correct schedule.
uv run sidestep train \
--checkpoint-dir ./checkpoints \
--model base \
--dataset-dir ./my_tensors \
--output-dir ./output/my_lora \
--epochs 500Find which layers matter most for your data, then allocate rank accordingly.
uv run sidestep analyze \
--checkpoint-dir ./checkpoints \
--model base \
--dataset-dir ./my_tensorsWrites fisher_map.json into the dataset folder. Training auto-detects it and applies variable ranks.
Generate rich sidecar metadata for your audio files.
uv run sidestep captions \
--audio-dir ./my_songs \
--provider local_16gb # or gemini, openai, lyrics_only
# With Music Flamingo metadata + Transcriber Server lyrics:
uv run sidestep captions \
--audio-dir ./my_songs \
--metadata-provider music_flamingo \
--lyrics-provider transcriber_server \
--music-flamingo-url http://localhost:5000 \
--transcriber-server-url http://localhost:8000uv run sidestep export \
--adapter-dir ./output/my_lora/final \
--target nativeuv run sidestep dataset --input ./my_music_folderRun uv run sidestep --help for full details.
| Subcommand | Description |
|---|---|
train |
Train an adapter (LoRA, DoRA, LoKR, LoHA, OFT) |
preprocess |
Convert audio to .pt tensors (two-pass pipeline) |
analyze |
PP++ -- Fisher analysis for adaptive rank assignment |
audio-analyze |
Offline BPM, key, time signature extraction |
captions |
AI caption generation + lyrics scraping |
tags |
Bulk sidecar tag operations (add/remove triggers) |
dataset |
Build dataset.json from audio + sidecar folders |
convert-sidecars |
Migrate legacy sidecar formats |
history |
List past training runs and best loss values |
export |
Export adapter to ComfyUI .safetensors |
settings |
View/modify persistent configuration |
gui |
Launch the Electron desktop application |
Side-Step ensures your fine-tuning matches the base model's original training distribution:
- Continuous mode (default) -- Logit-normal sampling + CFG dropout. Recommended for all variants. Samples from a smooth distribution centered around the model's training regime.
- Discrete mode -- 8-step turbo inference schedule (shift=3.0). Legacy behavior for turbo models — trains at exactly the timestep values used during 8-step inference.
Select via --timestep-mode continuous|discrete (CLI), the "Timestep sampling" dropdown (GUI), or the Wizard. The upstream trainer often forces the Turbo schedule on all models, which is incorrect for Base/SFT. Side-Step defaults to continuous for all variants and lets you override when needed.
See sidestep_documentation/ for detailed guides:
- Getting Started
- End-to-End Tutorial
- Dataset Preparation
- Training Guide
- Preprocessing++
- Preset Management
- VRAM Optimization Guide
- Shift and Timestep Sampling
- Using Your Adapter
- CLI Argument Reference
- Windows Notes
CC BY-NC-SA 4.0 — free for personal and research use with attribution. Commercial use requires written permission from the author.
Contributions are always welcome. The inherent novelty of Audio Transformer-Based Diffusion makes these scripts fresh, and your contributions help every one of us. Open an issue, send a PR, or just share your results.
-
Massive shoutout to @Signorlimone for designing and compositing the Side-Step logo.
-
Amazing work done by @robustini in the training pipeline and lovely optimizations



