-
Notifications
You must be signed in to change notification settings - Fork 13
Description
Hi,
I built latest acstep.cpp on my windows pc. Unfortunately the ace-lm.exe is generating invalid audio codes. The generated audio is messed up.
Please check the attached program output and generated json file. Earlier I used ace-step with python and it was working fine. If i remove the audio codes from generated json file, then ace-synth.exe is giving me nice output. Why the ace-lm.exe is generating invalid audio codes? (I also tried both acestep-5Hz-lm-0.6B-Q8_0.gguf and acestep-5Hz-lm-1.7B-Q8_0.gguf. but the results were same.)
simple.json
{
"caption": "pop rock with female vocal",
"lyrics": "",
"vocal_language": "en",
"inference_steps": 8,
"guidance_scale": 1.0,
"shift": 3.0,
"duration": 180
}
generated simple0.json
{
"caption": "An energetic pop-rock track driven by a crunchy, overdriven electric guitar riff and a punchy, straightforward drum beat. A powerful female lead vocal delivers the melody with a clear, confident tone, soaring into a belted, anthemic chorus. The arrangement is built on a tight rhythm section of bass and drums, with layered guitars providing both rhythmic drive and melodic hooks. The song follows a classic verse-chorus structure, culminating in a dynamic bridge that builds tension before a final, powerful chorus and a guitar-led outro that ends with a definitive crash.",
"lyrics": "[Intro - Guitar Riff]\n\n[Verse 1]\nShe's got a smile that lights up the night\nEyes that shine like stars so bright\nHeart racing fast, can't control the beat\nEvery time she walks by my life feels sweet\n\n[Pre-Chorus]\nShe sends me gifts filled with delight\nWhispers my name in the soft moonlight\nCan't get enough of her laugh so loud\nIn her presence I feel so proud\n\n[Chorus]\nCrazy love, it's taking over me\nLike a storm rolling wild and free\nEvery moment a rush, every touch a spark\nCrazy love, set my heart on fire\n\n[Guitar Riff]\n\n[Verse 2]\nShe dances around my mind all day\nSings a tune and she takes me away\nLips that spell magic I can't deny\nGet hypnotized just by her eye\n\n[Pre-Chorus]\nTime stops when she's near\nEvery seconds worth the fear\nHer voice a melody, soft and clear\nIn her arms the world disappears\n\n[Chorus]\nCrazy love, it's taking over me\nLike a storm rolling wild and free\nEvery moment a rush, every touch a spark\nCrazy love, set my heart on fire\n\n[Guitar Solo]\n\n[Outro]\n[Final chord sustains and fades]",
"bpm": 83,
"duration": 180.0,
"keyscale": "G major",
"timesignature": "4",
"vocal_language": "en",
"seed": 7592087709092679405,
"batch_size": 1,
"lm_temperature": 0.85,
"lm_cfg_scale": 2.0,
"lm_top_p": 0.90,
"lm_top_k": 0,
"lm_negative_prompt": "",
"use_cot_caption": true,
"inference_steps": 8,
"guidance_scale": 1.0,
"shift": 3.0,
"audio_cover_strength": 0.50,
"repainting_start": -1.0,
"repainting_end": -1.0,
"audio_codes": "44281,44281,23290,10106,10683,12571,12571,17498,43026,46649,46649,46714,49403,49403,49403,49403,49403,49403,49403,49403,49403,49403,49403,49403,49403,49403,49403,49403,49403,49403,49403,49403,49403,49403,49403,49403,49403,49403,49403,49403,49403,49403,49403,49403,49403,49403,49403,49403,49403,49403,49403,49403,49403,49403,49403,49403,49403,49403,49403,49403,49403,49403,49403,49403,49403,49403,49403,49403,49403,49403,49403,49403,49403,49403,49403,49403,49403,49403,49403,49467,49403,36603,49403,49403,49403,49403,49403,49403,49403,49403,49403,49403,49403,49403,49403,49403,49403,49403,49403,49403,49403,49403,49403,49403,49403,49403,49403,49403,49403,49403,49403,49403,49403,49403,49403,49403,49403,49403,49403,49403,49403,49403,49403,49403,49403,49403,49403,49403,49403,49403,49403,49403,49403,49403,49403,49403,49403,49403,49403,49403,49403,49403,49403,49403,49403,49404,49403,49403,49403,49403,49403,49403,49403,49403,49403,49403,49403,49403,49403,49403,49403,49403,49403,49403,49403,49403,36603,49403,49403,49403,49339,36539,36603,49403,49403,49403,49403,49403,49403,49403,49403,49403,49403,49403,49403,49403,49403,49403,49403,49403,49403,49403,49403,49403,49403,49403,49403,49403,49403,49403,49403,49403,36603,49403,49403,49403,49459,49467,49403,49403,49403,49403,49459,49403,49459,49403,49403,49403,49403,36603,49403,49459,49403,49403,49403,49339,49403,49395,49339,49403,49403,49395,49331,49403,48883,49403,49403,49403,49395,36603,49403,49403,49403,49403,49403,49395,49459,49395,49395,49403,49395,49395,36603,36595,49403,49395,49331,49395,49395,49403,49395,49395,49403,49395,49403,49395,36595,49395,49395,49395,49395,49395,49395,49395,49459,49395,49459,49459,49395,49459,49395,49395,49395,49331,49395,49395,49395,49459,49403,49395,36595,49395,49395,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49451,49387,49387,49387,49387,49323,49387,49386,49387,49387,49387,36587,49387,36587,49387,49387,49387,49387,36587,49387,49387,49387,49387,36587,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49323,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49387,49898,49898,49899,49899,49899,49899,49963,49963,49963,49963,47403,49963,49963,49963,49963,49963,49963,49963,49963,49963,49963,49963,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50027,50540,47467,25446"
}
program output
C:\acecpp\build>ace-lm.exe --request simple.json --lm acestep-5Hz-lm-4B-Q8_0.gguf
[Request] Parsed simple.json
[Request] seed=-1 batch=1
[Request] caption: pop rock with female vocal
[Request] lyrics: 0 bytes
[Request] bpm=0 dur=180 key= ts= lang=en
[Request] lm: temp=0.85 cfg=2.0 top_p=0.90 top_k=0
[Request] dit: steps=8 guidance=1.0 shift=3.0
[Request] audio_codes: (none)
[BPE] Loaded from GGUF: 151643 vocab, 151387 merges
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 2060 SUPER, compute capability 7.5, VMM: yes
[Load] LM backend: CUDA0 (CPU threads: 2)
[LM] FP16 clamp enabled (cc=750)
[GGUF] acestep-5Hz-lm-4B-Q8_0.gguf: 398 tensors, data at offset 5346304
[LM-Config] 36L, H=2560, V=217204, Nh=32, Nkv=8, D=128, tied=1
[Qwen3] Attn: Q+K+V fused
[Qwen3] MLP: gate+up fused
[WeightCtx] Loaded 398 tensors, 4245.7 MB into backend
[LM-KV] Allocated 2 sets x 36 layers (4D batched), 2304.0 MB
[FSM] Prefix trees: bpm=301, dur=601, key=108, lang=55, tsig=5 nodes
[Ace-LM] Loaded: vocab=217204, max_seq=8192, max_batch=1, kv_sets=2
[Fill] lyrics=generate metas=fill gaps | 36 tokens, CFG: 1.00, N=1
[Phase1] Prefill 177ms, 36 tokens, N=1, CFG=1.00
[Phase1] Step 100, 1 active, 43.2 tok/s
[Phase1] Step 200, 1 active, 43.9 tok/s
[Phase1] Step 300, 1 active, 45.4 tok/s
[Phase1] Step 400, 1 active, 46.4 tok/s
[Phase1] Decode 9411ms
[Phase1 Batch0] seed=7592087709092679405, 438 tokens
[Fill Batch0] seed=7592087709092679405:
bpm:83
caption: An energetic pop-rock track driven by a crunchy, overdriven electric guitar
riff and a punchy, straightforward drum beat. A powerful female lead vocal delivers
the melody with a clear, confident tone, soaring into a belted, anthemic chorus.
The arrangement is built on a tight rhythm section of bass and drums, with layered
guitars providing both rhythmic drive and melodic hooks. The song follows a classic
verse-chorus structure, culminating in a dynamic bridge that builds tension before
a final, powerful chorus and a guitar-led outro that ends with a definitive crash.
duration: 157
keyscale:G major
language:en
timesignature:4
Lyric
[Intro - Guitar Riff]
[Verse 1]
She's got a smile that lights up the night
Eyes that shine like stars so bright
Heart racing fast, can't control the beat
Every time she walks by my life feels sweet[Pre-Chorus]
She sends me gifts filled with delight
Whispers my name in the soft moonlight
Can't get enough of her laugh so loud
In her presence I feel so proud[Chorus]
Crazy love, it's taking over me
Like a storm rolling wild and free
Every moment a rush, every touch a spark
Crazy love, set my heart on fire[Guitar Riff]
[Verse 2]
She dances around my mind all day
Sings a tune and she takes me away
Lips that spell magic I can't deny
Get hypnotized just by her eye[Pre-Chorus]
Time stops when she's near
Every seconds worth the fear
Her voice a melody, soft and clear
In her arms the world disappears[Chorus]
Crazy love, it's taking over me
Like a storm rolling wild and free
Every moment a rush, every touch a spark
Crazy love, set my heart on fire[Guitar Solo]
[Outro]
[Final chord sustains and fades][Phase2] N=1, CoT[0]:
bpm: 83
caption: An energetic pop-rock track driven by a crunchy, overdriven electric guitar
riff and a punchy, straightforward drum beat. A powerful female lead vocal delivers
the melody with a clear, confident tone, soaring into a belted, anthemic chorus.
The arrangement is built on a tight rhythm section of bass and drums, with layered
guitars providing both rhythmic drive and melodic hooks. The song follows a classic
verse-chorus structure, culminating in a dynamic bridge that builds tension before
a final, powerful chorus and a guitar-led outro that ends with a definitive crash.
duration: 180
keyscale: G major
language: en
timesignature: 4
[Phase2] max_tokens: 1000, CFG: 2.00, seeds: 7592087709092679405..7592087709092679405
[Phase2] Prefill 440ms (shared, 1 cond + 1 uncond)
[Decode] Step 50, 1 active, 51 total codes, 50.7 tok/s
[Decode] Step 100, 1 active, 101 total codes, 50.7 tok/s
[Decode] Step 150, 1 active, 151 total codes, 50.6 tok/s
[Decode] Step 200, 1 active, 201 total codes, 50.4 tok/s
[Decode] Step 250, 1 active, 251 total codes, 50.3 tok/s
[Decode] Step 300, 1 active, 301 total codes, 50.3 tok/s
[Decode] Step 350, 1 active, 351 total codes, 50.2 tok/s
[Decode] Step 400, 1 active, 401 total codes, 50.0 tok/s
[Decode] Step 450, 1 active, 451 total codes, 49.9 tok/s
[Decode] Step 500, 1 active, 501 total codes, 49.7 tok/s
[Decode] Step 550, 1 active, 551 total codes, 49.6 tok/s
[Decode] Step 600, 1 active, 601 total codes, 49.4 tok/s
[Decode] Step 650, 1 active, 651 total codes, 49.3 tok/s
[Decode] Step 700, 1 active, 701 total codes, 49.2 tok/s
[Decode] Step 750, 1 active, 751 total codes, 49.1 tok/s
[Decode] Step 800, 1 active, 801 total codes, 49.0 tok/s
[Decode] Step 850, 1 active, 851 total codes, 48.8 tok/s
[Phase2] Decode 18341ms
[Batch 0] seed=7592087709092679405, 894 codes
[Ace-LM] Load 3396 | Total 28397ms | seed=7592087709092679405
[Request] Wrote simple0.jsonC:\acecpp\build>ace-synth.exe --request simple0.json --embedding Qwen3-Embedding-0.6B-Q8_0.gguf --dit acestep-v15-turbo-Q8_0.gguf --vae vae-BF16.gguf --wav
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 2060 SUPER, compute capability 7.5, VMM: yes
[Load] DiT backend: CUDA0 (CPU threads: 2)
[Load] Backend init: 37.6 ms
[GGUF] acestep-v15-turbo-Q8_0.gguf: 678 tensors, data at offset 56864
[DiT] Self-attn: Q+K+V fused
[DiT] Cross-attn: Q+K+V fused
[DiT] MLP: gate+up fused
[Load] null_condition_emb found (CFG available)
[WeightCtx] Loaded 478 tensors, 1600.7 MB into backend
[Load] DiT: 24 layers, H=2048, Nh=16/8, D=128
[Load] DiT weight load: 1295.2 ms
[GGUF] acestep-v15-turbo-Q8_0.gguf: 678 tensors, data at offset 56864
[Load] silence_latent: [15000, 64] from GGUF
[GGUF] vae-BF16.gguf: 365 tensors, data at offset 30048
[Load] VAE backend: CUDA0 (shared)
[VAE] Backend: CUDA0, Weight buffer: 161.1 MB
[VAE] Loaded: 5 blocks, upsample=1920x, F32 activations
[Load] VAE weights: 1400.9 ms
[BPE] Loaded from GGUF: 151643 vocab, 151387 merges
[Load] BPE tokenizer: 135.0 ms
[Load] TextEncoder backend: CUDA0 (shared)
[GGUF] Qwen3-Embedding-0.6B-Q8_0.gguf: 310 tensors, data at offset 5337664
[Load] TextEncoder: 28L, H=1024, Nh=16/8
[Qwen3] Attn: Q+K+V fused
[Qwen3] MLP: gate+up fused
[WeightCtx] Loaded 310 tensors, 742.7 MB into backend
[Load] TextEncoder: 578.2 ms
[Load] CondEncoder backend: CUDA0 (shared)
[CondEncoder] FP16 clamp enabled (cc=750)
[GGUF] acestep-v15-turbo-Q8_0.gguf: 678 tensors, data at offset 56864
[Load] LyricEncoder: 8L
[Qwen3] Attn: Q+K+V fused
[Qwen3] MLP: gate+up fused
[Load] TimbreEncoder: 4L
[Qwen3] Attn: Q+K+V fused
[Qwen3] MLP: gate+up fused
[WeightCtx] Loaded 140 tensors, 616.6 MB into backend
[Load] CondEncoder: lyric(8L), timbre(4L), text_proj, null_cond
[Load] ConditionEncoder: 463.6 ms
[GGUF] acestep-v15-turbo-Q8_0.gguf: 678 tensors, data at offset 56864
[WeightCtx] Loaded 30 tensors, 106.5 MB into backend
[Load] Detokenizer: FSQ(6->2048) + 2L encoder(S=5, 2048->64)
[Load] Detokenizer: 83.8 ms
[Ace-Synth] All models loaded, turbo=yes
[Request] Parsed simple0.json
[Request] seed=7592087709092679405 batch=1
[Request] caption: An energetic pop-rock track driven by a crunchy, overdriven ...
[Request] lyrics: 1018 bytes
[Request] bpm=83 dur=180 key=G major ts=4 lang=en
[Request] lm: temp=0.85 cfg=2.0 top_p=0.90 top_k=0
[Request] dit: steps=8 guidance=1.0 shift=3.0
[Request] audio_codes: (present)
[Request 1/1] simple0.json (batch=1)
[Pipeline] 894 audio codes (178.8s @ 5Hz)
[Pipeline] T=4470, S=2235
[Pipeline] seed=7592087709092679405, steps=8, guidance=1.0, shift=3.0, duration=180.0s
[Pipeline] caption: 169 tokens, lyrics: 284 tokens
[Encode] TextEncoder (169 tokens): 103.2 ms
[Encode] Lyric vocab lookup (284 tokens): 0.9 ms
[CondEnc] Lyric sliding mask: 284x284, window=128
[CondEnc] Timbre sliding mask: 750x750, window=128
[Encode] Packed: lyric=284 + timbre=1 + text=169 = 454 tokens
[Encode] ConditionEncoder: 47.9 ms, enc_S=454
[Context] Decoded: 894 codes -> 4470 frames (178.8s @ 25Hz)
[Context] Detokenizer: 848.8 ms
[Context Batch0] Philox noise seed=7592087709092679405, [4470, 64]
[DiT] Starting: T=4470, S=2235, enc_S=454, steps=8, batch=1
[DiT] Batch N=1, T=4470, S=2235, enc_S=454
[DiT] Graph: 1841 nodes
[DiT] Step 1/8 t=1.000
[DiT] Step 2/8 t=0.955
[DiT] Step 3/8 t=0.900
[DiT] Step 4/8 t=0.833
[DiT] Step 5/8 t=0.750
[DiT] Step 6/8 t=0.643
[DiT] Step 7/8 t=0.500
[DiT] Step 8/8 t=0.300
[DiT] Total generation: 3533.3 ms (3533.3 ms/sample)
[VAE] Tiled decode: 35 tiles (chunk=256, overlap=64, stride=128)
[VAE] Graph: 335 nodes, T_latent=192
[VAE] Upsample factor: 1920.00 (expected ~1920)
[VAE] Graph: 335 nodes, T_latent=256
[VAE] Graph: 335 nodes, T_latent=182
[VAE] Tiled decode done: 35 tiles -> T_audio=8582400 (178.80s @ 48kHz)
[VAE Batch0] Decode: 8998.6 ms
[WAV] Wrote simple00.wav: 8582400 samples, 48000 Hz, stereo
[Request 1/1] Done
[Pipeline] All doneC:\acecpp\build>pause
Press any key to continue . . .