Skip to content

koda-dernet/Side-Step

Repository files navigation

Side-Step for ACE-Step 1.5

 ░▒▓███████▓▒░▒▓█▓▒░▒▓███████▓▒░░▒▓████████▓▒░░▒▓███████▓▒░▒▓████████▓▒░▒▓████████▓▒░▒▓███████▓▒░ 
░▒▓█▓▒░      ░▒▓█▓▒░▒▓█▓▒░░▒▓█▓▒░▒▓█▓▒░      ░▒▓█▓▒░         ░▒▓█▓▒░   ░▒▓█▓▒░      ░▒▓█▓▒░░▒▓█▓▒░
░▒▓█▓▒░      ░▒▓█▓▒░▒▓█▓▒░░▒▓█▓▒░▒▓█▓▒░      ░▒▓█▓▒░         ░▒▓█▓▒░   ░▒▓█▓▒░      ░▒▓█▓▒░░▒▓█▓▒░
 ░▒▓██████▓▒░░▒▓█▓▒░▒▓█▓▒░░▒▓█▓▒░▒▓██████▓▒░  ░▒▓██████▓▒░   ░▒▓█▓▒░   ░▒▓██████▓▒░ ░▒▓███████▓▒░ 
       ░▒▓█▓▒░▒▓█▓▒░▒▓█▓▒░░▒▓█▓▒░▒▓█▓▒░             ░▒▓█▓▒░  ░▒▓█▓▒░   ░▒▓█▓▒░      ░▒▓█▓▒░       
       ░▒▓█▓▒░▒▓█▓▒░▒▓█▓▒░░▒▓█▓▒░▒▓█▓▒░             ░▒▓█▓▒░  ░▒▓█▓▒░   ░▒▓█▓▒░      ░▒▓█▓▒░       
░▒▓███████▓▒░░▒▓█▓▒░▒▓███████▓▒░░▒▓████████▓▒░▒▓███████▓▒░   ░▒▓█▓▒░   ░▒▓████████▓▒░▒▓█▓▒░       
 by dernet -- BETA

Standalone training toolkit for ACE-Step 1.5 audio generation models. Takes you from raw audio files to a working adapter without the friction. Variant-aware multi-adapter fine-tuning (LoRA, DoRA, LoKR, LoHA, OFT) with auto-detection, low-VRAM support, and three ways to work.

Status: v1.1.2-beta -- Stable enough for daily use. Some features are still experimental. This is maintained by one person only; if you encounter an issue, please let me know in the issues tab.

Why Side-Step?

Side-Step auto-detects your model variant (base, sft, or turbo), selects the scientifically correct training schedule, and runs on consumer hardware down to 8 GB VRAM. Version 1.1.2 adds user-selectable timestep sampling (continuous or discrete) across all three interfaces, building on 1.1.1's Music Flamingo/Transcriber Server providers, batched caption jobs, TensorBoard-parity charts, and training pipeline improvements.

What was already here

  • Auto-Configured Training -- All variants default to continuous logit-normal sampling + CFG dropout. Optionally switch to discrete 8-step sampling via --timestep-mode discrete (CLI), the Wizard, or the GUI dropdown. The upstream trainer forces the Turbo schedule on all models; Side-Step fixes this automatically.
  • LoRA + LoKR Adapters -- Standard and Kronecker-product low-rank fine-tuning.
  • Preprocessing++ (PP++) -- Fisher Information analysis assigns adaptive per-module ranks based on how important each layer is to your specific audio. Writes a fisher_map.json that training auto-detects.
  • Two-Pass Preprocessing -- Converts raw audio to training tensors in two low-memory passes (~3 GB then ~6 GB).
  • Interactive Wizard -- Step-by-step prompts with "Go Back" support, presets, flow chaining, and session carry-over defaults.
  • Dataset Builder -- Point at a folder of audio + sidecar .txt files and get a dataset.json automatically.
  • Low VRAM -- 8-bit optimizers, gradient checkpointing, encoder offloading. Trains down to ~10 GB.
  • Standalone & Portable -- Installs as its own project via uv. No need to touch your ACE-Step installation.

New in 1.0.0

  • Full Electron GUI -- Desktop application with Ez Mode (3-click training), Advanced Mode (every knob), real-time Monitor (loss charts, GPU stats), and a Lab workspace (datasets, preprocessing, PP++, export). CRT shader with phosphor bloom, scanlines, and chromatic aberration. Themeable (4 built-in themes + full editor).
  • DoRA, LoHA, OFT Adapters -- Three additional adapter architectures alongside LoRA and LoKR.
  • ComfyUI Export -- Convert PEFT LoRA/DoRA adapters to the single-file .safetensors format ComfyUI expects. LyCORIS adapters (LoKR, LoHA) are already natively compatible.
  • AI Captioning -- Generate sidecar metadata with local AI (Qwen2.5-Omni, no API key, runs on your GPU), Google Gemini, OpenAI, or lyrics scraped from Genius.
  • Offline Audio Analysis -- BPM, key, and time signature extraction via demucs stem separation + librosa. No API keys required.
  • Built-in Music Player -- Play dataset audio directly in the GUI. Marquee display, EQ visualizer, volume control, auto-play, dockable bar.
  • Live VRAM Estimation -- Segmented bar shows model + activation + optimizer breakdown before you start training. Changes reactively as you adjust settings.
  • VRAM Presets -- One-click profiles: 8 GB, 12 GB, 16 GB, 24 GB+, Quick Test, High Quality, Recommended.
  • Run History -- Persistent log of past training runs with best loss, adapter path, and hyperparameters.
  • Tag Management -- Bulk add/remove trigger tags and convert legacy sidecar formats.
  • Cross-Platform Entry Point -- sidestep (or uv run sidestep if not on PATH) works on all platforms.

New in 1.1.0

  • Cruise Control (Target Loss) -- Set a target loss value and Side-Step automatically damps the learning rate as training approaches it, holding the model at a sweet spot instead of over-fitting past it. EMA-smoothed loss signal, configurable warmup and floor. Works with all schedulers (conflict guards for Prodigy and cosine restarts). Resumes cleanly from checkpoints.
  • Caption System Overhaul -- Richer song-focused prompts that emphasize audible content over generic descriptions. Configurable generation parameters (temperature, top_p, penalties). Structured response parsing extracts genre, BPM, key, and time signature alongside the caption. Google Search grounding for Gemini. Lossless audio auto-converted to MP3 before upload to save bandwidth.
  • Local Captioner Rewrite -- Qwen2.5-Omni local captioner rebuilt with tiered VRAM configs, OOM recovery (retries with reduced token count), CPU offload option, audio transcoding fallback, cancellation support, and timing logs.
  • Default Model Variant: Base -- Base is now the recommended default everywhere (CLI, GUI, wizard, TUI, presets). Turbo remains available but is no longer the automatic first choice. Model variant dropdown auto-selects base > sft > turbo based on what's available in your checkpoint directory.
  • Preset Revamp -- All 7 built-in presets fleshed out with complete field coverage (adapter type, cruise control, checkpointing ratio, etc.). Presets now display adapter type, rank, LR, and epochs in the selection card. Type coercion fixes presets saved with numbers as strings.
  • Encoding Error Resilience -- Genius, Gemini, and OpenAI providers now detect encoding errors (including errors wrapped by SDK exception types) and bail immediately instead of retrying 3× on deterministic failures. Saves ~70 seconds per batch when processing songs with non-ASCII titles.
  • Linux Desktop Integration -- .desktop file and icon installed to XDG standard locations by the Linux installer. Side-Step appears in your application menu with its own icon.
  • Electron Hardening -- Navigation guard prevents blank-page crashes, renderer crash detection, DevTools shortcut (F12), native desktop notifications for training completion.
  • Prompt Helpers Fix -- ask() now correctly casts default values through type_fn, fixing numeric wizard defaults that were silently returned as strings.

New in 1.1.1

  • Tensorboard-Like Monitor -- Revamped Monitor tab to include more relevant information and better data representation, paired with Tensorboard style and Side-Step's design language.
  • Music Flamingo Provider -- Use Music Flamingo as a metadata and/or lyrics provider. Supports local servers via configurable URL and remote Hugging Face endpoints with token authentication.
  • Transcriber Server Provider -- Dedicated lyrics provider backed by a configurable Transcriber Server URL. Nested response parsing, multipart transport, and automatic fallback handling.
  • Batched Caption Jobs -- The GUI now runs caption generation as batched jobs. Multiple audio files are queued and processed in sequence with per-file progress, automatic retries, and cancellation support. No more one-at-a-time blocking. (Technically there since 1.0 but updated the readme to reflect the changes)
  • Overwrite-Lyrics-Only Mode -- Update only the lyrics field in existing sidecars without touching the rest of the metadata. Useful when re-running lyrics with a different provider.
  • Explicit Sequence Crop Controls -- Choose between full sample, chunk by seconds, or max latent length. Backend, presets, UI, and VRAM estimation all support the new modes.
  • Turbo Training Overhaul -- Replaced the old discrete 8-step Turbo schedule with continuous logit-normal timestep sampling and re-enabled CFG dropout. Turbo LoRA training now follows a proper training-oriented distribution.
  • Cruise Control Progress -- Target loss scale and EMA are now reported in the progress file, visible in the GUI monitor.
  • TensorBoard-Parity Charts -- The GUI training monitor now matches TensorBoard's smoothing algorithm, y-domain (P5-P95 with nice boundaries), grid counts, scroll zoom, pan, and closest-point finding. No external TensorBoard needed.
  • CLI / Wizard / GUI Parity -- All new features (crop modes, provider selection, endpoint URLs, HF token) are available across all three interfaces.
  • Bug Fixes -- dtypetorch_dtype in all from_pretrained calls (models were loading in default precision), LR restore on gradient flush path, caption regex truncation on apostrophes, faster TensorBoard flush (5s vs 30s), and several provider integration fixes.

New in 1.1.2

  • Selectable Timestep Sampling -- Choose between continuous (logit-normal, recommended) and discrete (8-step turbo inference schedule) timestep sampling. Available in the GUI as a dropdown, in the Wizard under "All the Levers", and via --timestep-mode in the CLI. Default is continuous for all model variants. Discrete mode is the legacy turbo behavior for users who want to train at exactly the 8 inference timesteps.

The Three Ways to Use Side-Step

The experimental TUI from 0.9.0 and before has been deprecated. The interactive Wizard is its definitive replacement.

1. The Desktop Window (GUI)

Visual training, dataset management, live charts, and CRT-classic aesthetics (if you manage to find it :3).

uv run sidestep gui

Modes: Ez Mode | Advanced | Monitor | Lab (History, Tensor Datasets, Audio Library, Preprocess, PP++, Export)

Ez Mode Advanced Mode Audio Library

2. The Interactive Wizard

Terminal prompts with back-navigation, presets, and flow chaining (preprocess -> train, PP++ -> train, build dataset -> preprocess -> train).

uv run sidestep

Wizard

3. The Command Line (CLI)

Automate pipelines or bypass menus entirely. Every argument has a (default: X) in --help.

uv run sidestep train \
    --checkpoint-dir ./checkpoints \
    --model base \
    --dataset-dir ./my_tensors \
    --output-dir ./output/my_lora \
    --adapter-type dora \
    --rank 64 --alpha 128 \
    --epochs 500

Quick Install

Linux / macOS

git clone https://github.com/koda-dernet/Side-Step.git
cd Side-Step
chmod +x install_linux.sh && ./install_linux.sh

Windows

git clone https://github.com/koda-dernet/Side-Step.git
cd Side-Step
.\install_windows.ps1

The installer handles Python 3.11, PyTorch, Electron, and all dependencies via uv. Flash Attention is pulled from pre-built wheels -- no 20-minute (or more) local compilation.

Get Models

You need the ACE-Step 1.5 checkpoints. If you don't have them:

git clone https://github.com/ace-step/ACE-Step-1.5.git
cd ACE-Step-1.5 && uv sync && uv run acestep-download

VRAM Profiles

Side-Step runs on everything from an RTX 3060 to an H100. Built-in presets configure these automatically.

Profile VRAM Strategy
Comfortable 24 GB+ AdamW, Batch 2+, Rank 128
Standard 16-24 GB AdamW, Batch 1, Rank 64
Tight 12-16 GB AdamW8bit, Encoder offloading
Minimal 8-10 GB AdamW8bit, Offloading, Grad accumulation 8, Rank 16

Gradient checkpointing is on by default, reducing baseline VRAM to ~7 GB before optimizer state.


Workflows

Preprocessing

Convert raw audio into training tensors. Two-pass approach keeps peak VRAM low.

uv run sidestep preprocess \
    --audio-dir ./my_songs \
    --tensor-output ./my_tensors \
    --normalize peak

Training

Train an adapter on preprocessed tensors. Side-Step detects your variant and applies the correct schedule.

uv run sidestep train \
    --checkpoint-dir ./checkpoints \
    --model base \
    --dataset-dir ./my_tensors \
    --output-dir ./output/my_lora \
    --epochs 500

Preprocessing++ (Adaptive Ranks)

Find which layers matter most for your data, then allocate rank accordingly.

uv run sidestep analyze \
    --checkpoint-dir ./checkpoints \
    --model base \
    --dataset-dir ./my_tensors

Writes fisher_map.json into the dataset folder. Training auto-detects it and applies variable ranks.

AI Captioning

Generate rich sidecar metadata for your audio files.

uv run sidestep captions \
    --audio-dir ./my_songs \
    --provider local_16gb       # or gemini, openai, lyrics_only

# With Music Flamingo metadata + Transcriber Server lyrics:
uv run sidestep captions \
    --audio-dir ./my_songs \
    --metadata-provider music_flamingo \
    --lyrics-provider transcriber_server \
    --music-flamingo-url http://localhost:5000 \
    --transcriber-server-url http://localhost:8000

Export to ComfyUI

uv run sidestep export \
    --adapter-dir ./output/my_lora/final \
    --target native

Dataset Building

uv run sidestep dataset --input ./my_music_folder

Complete Subcommand List

Run uv run sidestep --help for full details.

Subcommand Description
train Train an adapter (LoRA, DoRA, LoKR, LoHA, OFT)
preprocess Convert audio to .pt tensors (two-pass pipeline)
analyze PP++ -- Fisher analysis for adaptive rank assignment
audio-analyze Offline BPM, key, time signature extraction
captions AI caption generation + lyrics scraping
tags Bulk sidecar tag operations (add/remove triggers)
dataset Build dataset.json from audio + sidecar folders
convert-sidecars Migrate legacy sidecar formats
history List past training runs and best loss values
export Export adapter to ComfyUI .safetensors
settings View/modify persistent configuration
gui Launch the Electron desktop application

Technical Notes: Timestep Sampling

Side-Step ensures your fine-tuning matches the base model's original training distribution:

  1. Continuous mode (default) -- Logit-normal sampling + CFG dropout. Recommended for all variants. Samples from a smooth distribution centered around the model's training regime.
  2. Discrete mode -- 8-step turbo inference schedule (shift=3.0). Legacy behavior for turbo models — trains at exactly the timestep values used during 8-step inference.

Select via --timestep-mode continuous|discrete (CLI), the "Timestep sampling" dropdown (GUI), or the Wizard. The upstream trainer often forces the Turbo schedule on all models, which is incorrect for Base/SFT. Side-Step defaults to continuous for all variants and lets you override when needed.


Documentation

See sidestep_documentation/ for detailed guides:

License

CC BY-NC-SA 4.0 — free for personal and research use with attribution. Commercial use requires written permission from the author.


Contributions are always welcome. The inherent novelty of Audio Transformer-Based Diffusion makes these scripts fresh, and your contributions help every one of us. Open an issue, send a PR, or just share your results.


Contrubutors

  • Massive shoutout to @Signorlimone for designing and compositing the Side-Step logo.

  • Amazing work done by @robustini in the training pipeline and lovely optimizations

About

The most powerful training scripts for ACE-Step 1.5 including a Command Line Interface, a Terminal Wizard and a Graphical User Interface.

Resources

License

Stars

Watchers

Forks

Packages