Skip to content

Commit fe6777e

Browse files
feat: add 5 AI integration skills (893L) for REVITHION STUDIO
New skills in .github/skills/ai-integration/: - ace-step-inference: GGML/CUDA music generation, 3 modes, async threading - ollama-copilot: in-DAW AI assistant, HTTP streaming, tool calling - demucs-separation: LibTorch 4/6-stem separation, overlap-add - matchering-mastering: reference-based AI mastering, batch processing - whisper-transcription: CTranslate2 STT, VAD, streaming, timeline markers Total skills library now: 46 SKILL.md files across 7 categories (6,939 lines) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
1 parent 75a054d commit fe6777e

5 files changed

Lines changed: 893 additions & 0 deletions

File tree

Lines changed: 189 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,189 @@
1+
---
2+
name: ace-step-inference
3+
version: "1.0"
4+
description: ACE-Step 1.5 music generation — GGML inference, text-to-music, covers, repainting, CUDA acceleration, 48kHz stereo output for REVITHION STUDIO
5+
tags: [ai, music-generation, ace-step, ggml, cuda, inference]
6+
category: ai-integration
7+
---
8+
9+
# ACE-Step 1.5 Music Generation Integration
10+
11+
ACE-Step 1.5 is a diffusion-based music generation model that produces full stereo audio from text prompts, reference tracks, or partial audio inputs. It supports three generation modes: **text-to-music** (prompt-only), **covers** (style transfer from a reference), and **repainting** (inpainting/outpainting on existing audio). REVITHION STUDIO integrates ACE-Step through a GGML-quantized C++ backend with CUDA acceleration, outputting 48kHz/32-bit float stereo suitable for direct insertion into the DAW timeline.
12+
13+
## Architecture Overview
14+
15+
The inference pipeline consists of a text encoder (CLAP), a latent diffusion UNet, and a vocoder (BigVGAN). The GGML backend loads quantized weights (Q4_K_M or Q8_0) into GPU VRAM via CUDA, keeping the host CPU free for DAW audio processing. A dedicated inference thread communicates with the audio engine through a lock-free FIFO, ensuring zero-glitch playback during generation.
16+
17+
## GGML Model Loading & CUDA Context
18+
19+
```cpp
20+
#include <ggml/ggml.h>
21+
#include <ggml/ggml-cuda.h>
22+
23+
struct AceStepContext {
24+
ggml_context* ctx = nullptr;
25+
ggml_backend_t backend = nullptr;
26+
ggml_backend_buffer_t buffer = nullptr;
27+
28+
bool loadModel(const std::string& modelPath, int gpuLayers) {
29+
backend = ggml_backend_cuda_init(0);
30+
if (!backend) return false;
31+
32+
struct ggml_init_params params = {
33+
.mem_size = 512 * 1024 * 1024,
34+
.mem_buffer = nullptr,
35+
.no_alloc = true
36+
};
37+
ctx = ggml_init(params);
38+
39+
// Load quantized weights onto GPU
40+
auto* model = ggml_model_load(modelPath.c_str(), ctx, backend, gpuLayers);
41+
return model != nullptr;
42+
}
43+
44+
~AceStepContext() {
45+
if (ctx) ggml_free(ctx);
46+
if (buffer) ggml_backend_buffer_free(buffer);
47+
if (backend) ggml_backend_free(backend);
48+
}
49+
};
50+
```
51+
52+
## Text-to-Music Generation
53+
54+
```cpp
55+
struct GenerationParams {
56+
std::string prompt;
57+
float durationSec = 30.0f;
58+
int steps = 100;
59+
float cfgScale = 7.0f;
60+
int sampleRate = 48000;
61+
int seed = -1; // -1 = random
62+
};
63+
64+
std::vector<float> generateFromText(AceStepContext& ace, const GenerationParams& params) {
65+
auto tokens = ace.encodeText(params.prompt);
66+
67+
// Diffusion loop with classifier-free guidance
68+
auto latent = ace.initNoise(params.durationSec, params.sampleRate, params.seed);
69+
for (int step = 0; step < params.steps; ++step) {
70+
auto conditioned = ace.denoise(latent, tokens, step, params.cfgScale);
71+
auto unconditioned = ace.denoise(latent, {}, step, params.cfgScale);
72+
latent = unconditioned + params.cfgScale * (conditioned - unconditioned);
73+
}
74+
75+
return ace.vocoder(latent); // 48kHz stereo interleaved float
76+
}
77+
```
78+
79+
## Cover & Repainting Modes
80+
81+
```cpp
82+
enum class AceMode { TextToMusic, Cover, Repaint };
83+
84+
std::vector<float> generateWithReference(AceStepContext& ace,
85+
const GenerationParams& params,
86+
AceMode mode,
87+
const float* refAudio,
88+
int refSamples,
89+
float strength = 0.75f) {
90+
auto latent = ace.encodeAudio(refAudio, refSamples);
91+
92+
if (mode == AceMode::Cover) {
93+
// Partial noise injection preserving melodic structure
94+
int startStep = static_cast<int>(params.steps * (1.0f - strength));
95+
latent = ace.addNoise(latent, startStep);
96+
return ace.denoiseFrom(latent, ace.encodeText(params.prompt), startStep, params);
97+
}
98+
99+
if (mode == AceMode::Repaint) {
100+
auto mask = ace.buildTimeMask(params.durationSec, params.sampleRate);
101+
return ace.inpaint(latent, mask, ace.encodeText(params.prompt), params);
102+
}
103+
104+
return generateFromText(ace, params);
105+
}
106+
```
107+
108+
## JUCE Integration — Async Generation Thread
109+
110+
```cpp
111+
class AceStepProcessor : public juce::Thread {
112+
AceStepContext context;
113+
juce::AbstractFifo fifo { 48000 * 120 * 2 }; // 120s stereo buffer
114+
std::vector<float> ringBuffer;
115+
std::atomic<bool> generating { false };
116+
117+
public:
118+
AceStepProcessor() : Thread("ACE-Step-Inference") {
119+
ringBuffer.resize(static_cast<size_t>(fifo.getTotalSize()));
120+
}
121+
122+
void startGeneration(const GenerationParams& params) {
123+
currentParams = params;
124+
generating = true;
125+
startThread(juce::Thread::Priority::normal);
126+
}
127+
128+
void run() override {
129+
auto audio = generateFromText(context, currentParams);
130+
int written = 0;
131+
while (written < static_cast<int>(audio.size()) && !threadShouldExit()) {
132+
auto scope = fifo.write(static_cast<int>(audio.size()) - written);
133+
std::copy_n(audio.data() + written, scope.blockSize1, ringBuffer.data() + scope.startIndex1);
134+
written += scope.blockSize1 + scope.blockSize2;
135+
}
136+
generating = false;
137+
}
138+
139+
void pullSamples(float* dest, int numSamples) {
140+
auto scope = fifo.read(numSamples);
141+
std::copy_n(ringBuffer.data() + scope.startIndex1, scope.blockSize1, dest);
142+
}
143+
144+
private:
145+
GenerationParams currentParams;
146+
};
147+
```
148+
149+
## Python API Bridge (ACE-Step HTTP)
150+
151+
```python
152+
import httpx, struct
153+
154+
async def generate_music(prompt: str, duration: float = 30.0,
155+
mode: str = "text2music",
156+
reference_path: str | None = None) -> bytes:
157+
"""Call ACE-Step API server at localhost:8001."""
158+
payload = {
159+
"prompt": prompt, "duration": duration, "mode": mode,
160+
"sample_rate": 48000, "cfg_scale": 7.0, "steps": 100
161+
}
162+
if reference_path:
163+
payload["reference_audio"] = reference_path
164+
165+
async with httpx.AsyncClient(timeout=300) as client:
166+
resp = await client.post("http://localhost:8001/generate", json=payload)
167+
resp.raise_for_status()
168+
return resp.content # Raw 48kHz float32 PCM
169+
```
170+
171+
## Anti-Patterns
172+
173+
- ❌ Don't run inference on the audio thread — always use a separate thread with FIFO handoff
174+
- ❌ Don't load full FP32 weights on a 24GB GPU — use Q4_K_M or Q8_0 quantization to fit in VRAM
175+
- ❌ Don't generate at 44.1kHz then resample — generate natively at 48kHz to avoid aliasing artifacts
176+
- ❌ Don't block the UI thread waiting for generation — use async callbacks or polling
177+
- ❌ Don't skip CUDA device synchronization before reading output buffers
178+
- ❌ Don't use cfg_scale > 15 — it causes spectral collapse and harsh artifacts
179+
180+
## Checklist
181+
182+
- [ ] GGML backend initialized with CUDA device 0 before model load
183+
- [ ] Model weights quantized to Q4_K_M or Q8_0 and validated with checksum
184+
- [ ] Inference thread priority set below audio thread priority
185+
- [ ] Ring buffer sized for maximum generation duration (120s × 48kHz × 2ch)
186+
- [ ] Output sample rate matches DAW session rate (48kHz default)
187+
- [ ] VRAM usage monitored — abort generation if free VRAM < 2GB
188+
- [ ] Seed stored with generated clip for reproducibility
189+
- [ ] All three modes (text-to-music, cover, repaint) tested with reference audio
Lines changed: 173 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,173 @@
1+
---
2+
name: demucs-separation
3+
version: "1.0"
4+
description: AI stem separation with Demucs/HTDemucs — 4-stem and 6-stem modes, GPU inference via LibTorch, real-time preview for REVITHION STUDIO
5+
tags: [ai, stem-separation, demucs, libtorch, cuda, audio]
6+
category: ai-integration
7+
---
8+
9+
# AI Stem Separation with Demucs/HTDemucs
10+
11+
Demucs (Hybrid Transformer Demucs) is Meta's state-of-the-art source separation model that splits mixed audio into individual stems. REVITHION STUDIO integrates HTDemucs v4 via LibTorch C++ for native GPU-accelerated inference. The 4-stem model separates drums, bass, vocals, and other; the 6-stem model adds guitar and piano. Output is 44.1kHz float32 per stem, resampled to the session rate (typically 48kHz). A real-time preview mode processes small overlapping chunks for interactive auditioning before committing to full offline separation.
12+
13+
## Model Configuration
14+
15+
HTDemucs uses a hybrid architecture combining a temporal convolutional network with a spectral transformer. The C++ integration loads TorchScript-traced models, keeping the audio engine independent of Python at runtime.
16+
17+
```cpp
18+
struct DemucsConfig {
19+
int numStems = 4; // 4 = drums/bass/vocals/other, 6 adds guitar/piano
20+
int sampleRate = 44100; // Native model rate
21+
int segmentLength = 7; // Seconds per chunk (overlap-add)
22+
float overlap = 0.25f; // 25% overlap between segments
23+
bool useCuda = true;
24+
int cudaDevice = 0;
25+
juce::String modelPath; // Path to .pt TorchScript model
26+
};
27+
```
28+
29+
## LibTorch C++ Integration
30+
31+
```cpp
32+
#include <torch/script.h>
33+
#include <torch/cuda.h>
34+
35+
class DemucsEngine {
36+
torch::jit::script::Module model;
37+
torch::Device device;
38+
DemucsConfig config;
39+
40+
public:
41+
DemucsEngine(const DemucsConfig& cfg)
42+
: device(cfg.useCuda && torch::cuda::is_available()
43+
? torch::Device(torch::kCUDA, cfg.cudaDevice)
44+
: torch::Device(torch::kCPU)),
45+
config(cfg) {}
46+
47+
bool loadModel() {
48+
try {
49+
model = torch::jit::load(config.modelPath.toStdString(), device);
50+
model.eval();
51+
torch::NoGradGuard noGrad;
52+
// Warm up with a short tensor to trigger CUDA kernel compilation
53+
auto dummy = torch::randn({1, 2, config.sampleRate * 2}).to(device);
54+
model.forward({dummy});
55+
return true;
56+
} catch (const c10::Error& e) {
57+
DBG("Demucs load failed: " << e.what());
58+
return false;
59+
}
60+
}
61+
62+
// Returns [numStems, 2, numSamples] tensor
63+
torch::Tensor separate(const float* interleavedStereo, int numFrames) {
64+
torch::NoGradGuard noGrad;
65+
auto input = torch::from_blob(
66+
const_cast<float*>(interleavedStereo),
67+
{1, 2, numFrames}, torch::kFloat32
68+
).to(device);
69+
70+
auto output = model.forward({input}).toTensor(); // [1, stems, 2, samples]
71+
return output.squeeze(0).cpu();
72+
}
73+
};
74+
```
75+
76+
## Overlap-Add Segment Processing
77+
78+
Full tracks are too long for a single forward pass. Segment the input with overlap and crossfade the outputs to eliminate boundary artifacts.
79+
80+
```cpp
81+
struct StemResult {
82+
std::vector<std::vector<float>> stems; // [stemIdx][interleavedSamples]
83+
};
84+
85+
StemResult separateFullTrack(DemucsEngine& engine, const float* audio,
86+
int totalFrames, const DemucsConfig& cfg) {
87+
int segSamples = cfg.segmentLength * cfg.sampleRate;
88+
int hopSamples = static_cast<int>(segSamples * (1.0f - cfg.overlap));
89+
int numStems = cfg.numStems;
90+
91+
StemResult result;
92+
result.stems.resize(static_cast<size_t>(numStems),
93+
std::vector<float>(static_cast<size_t>(totalFrames * 2), 0.0f));
94+
std::vector<float> weightSum(static_cast<size_t>(totalFrames), 0.0f);
95+
96+
// Build Hann crossfade window
97+
std::vector<float> window(static_cast<size_t>(segSamples));
98+
for (int i = 0; i < segSamples; ++i)
99+
window[static_cast<size_t>(i)] = 0.5f * (1.0f - std::cos(2.0f * M_PI * i / segSamples));
100+
101+
for (int offset = 0; offset < totalFrames; offset += hopSamples) {
102+
int chunkLen = std::min(segSamples, totalFrames - offset);
103+
auto output = engine.separate(audio + offset * 2, chunkLen);
104+
105+
for (int s = 0; s < numStems; ++s) {
106+
auto stemData = output[s].contiguous().data_ptr<float>();
107+
for (int i = 0; i < chunkLen; ++i) {
108+
float w = window[static_cast<size_t>(i)];
109+
size_t outIdx = static_cast<size_t>((offset + i) * 2);
110+
result.stems[s][outIdx] += stemData[i * 2] * w;
111+
result.stems[s][outIdx + 1] += stemData[i * 2 + 1] * w;
112+
weightSum[static_cast<size_t>(offset + i)] += w;
113+
}
114+
}
115+
}
116+
117+
// Normalize by accumulated window weight
118+
for (int s = 0; s < numStems; ++s)
119+
for (size_t i = 0; i < static_cast<size_t>(totalFrames); ++i) {
120+
if (weightSum[i] > 0.0f) {
121+
result.stems[s][i * 2] /= weightSum[i];
122+
result.stems[s][i * 2 + 1] /= weightSum[i];
123+
}
124+
}
125+
126+
return result;
127+
}
128+
```
129+
130+
## Python — Batch Separation Script
131+
132+
```python
133+
import torch, torchaudio
134+
from demucs.pretrained import get_model
135+
from demucs.apply import apply_model
136+
137+
def separate_file(input_path: str, output_dir: str, model_name: str = "htdemucs") -> list[str]:
138+
model = get_model(model_name)
139+
model.cuda()
140+
wav, sr = torchaudio.load(input_path)
141+
wav = wav.unsqueeze(0).cuda() # [1, channels, samples]
142+
143+
with torch.no_grad():
144+
stems = apply_model(model, wav, shifts=1, overlap=0.25)
145+
146+
paths = []
147+
for i, name in enumerate(model.sources):
148+
out = stems[0, i].cpu()
149+
path = f"{output_dir}/{name}.wav"
150+
torchaudio.save(path, out, sr)
151+
paths.append(path)
152+
return paths
153+
```
154+
155+
## Anti-Patterns
156+
157+
- ❌ Don't run separation on the audio thread — even GPU inference takes 2–10× real-time
158+
- ❌ Don't skip overlap-add windowing — hard segment boundaries cause audible clicks
159+
- ❌ Don't assume 48kHz input — Demucs expects 44.1kHz; resample before inference
160+
- ❌ Don't allocate CUDA tensors in a loop — pre-allocate and reuse buffers
161+
- ❌ Don't mix LibTorch CUDA contexts with GGML CUDA contexts without stream synchronization
162+
- ❌ Don't forget `torch::NoGradGuard` — gradient tracking wastes VRAM during inference
163+
164+
## Checklist
165+
166+
- [ ] LibTorch linked with CUDA 12+ and matching cuDNN version
167+
- [ ] TorchScript model traced and validated against Python reference outputs
168+
- [ ] Overlap-add window function tested for perfect reconstruction (sum-to-one)
169+
- [ ] Sample rate conversion (48kHz ↔ 44.1kHz) uses high-quality sinc resampler
170+
- [ ] GPU memory monitored — fallback to CPU if VRAM < 4GB free
171+
- [ ] Preview mode processes ≤ 8 seconds to maintain interactive latency
172+
- [ ] Stem output gain-matched to original mix (sum of stems ≈ input)
173+
- [ ] All 4/6 stem names mapped to correct mixer channels

0 commit comments

Comments
 (0)