Skip to content

Extremely short outputs #34

@firefox42

Description

@firefox42

I complied the program on a Ubuntu 24.04.4 system, with ROCm 7.1.1 and a R9700. Generating the request0.json file with ace-lm works fine, i.e.

ace-lm output using simple.sh
![Image](https://github.com/user-attachments/assets/523bd7e6-5131-4658-afd5-46407c4cda67)

[Request] Parsed simple.json
[Request] seed=-1 batch=1
[Request] caption: Upbeat pop rock anthem with driving electric guitars, punchy...
[Request] lyrics: 0 bytes
[Request] bpm=0 dur=0 key= ts= lang=fr
[Request] lm: temp=0.85 cfg=2.0 top_p=0.90 top_k=0
[Request] dit: steps=8 guidance=1.0 shift=3.0
[Request] audio_codes: (none)
[BPE] Loaded from GGUF: 151643 vocab, 151387 merges
ggml_cuda_init: found 4 ROCm devices:
  Device 0: AMD Radeon AI PRO R9700, gfx1201 (0x1201), VMM: no, Wave Size: 32
  Device 1: AMD Radeon AI PRO R9700, gfx1201 (0x1201), VMM: no, Wave Size: 32
  Device 2: AMD Radeon AI PRO R9700, gfx1201 (0x1201), VMM: no, Wave Size: 32
  Device 3: AMD Radeon AI PRO R9700, gfx1201 (0x1201), VMM: no, Wave Size: 32
[Load] LM backend: ROCm0 (CPU threads: 8)
[GGUF] ../models/acestep-5Hz-lm-4B-Q8_0.gguf: 398 tensors, data at offset 5346304
[LM-Config] 36L, H=2560, V=217204, Nh=32, Nkv=8, D=128, tied=1
[Qwen3] Attn: Q+K+V fused
[Qwen3] MLP: gate+up fused
[WeightCtx] Loaded 398 tensors, 4245.7 MB into backend
[LM-KV] Allocated 2 sets x 36 layers (4D batched), 2304.0 MB
[FSM] Prefix trees: bpm=301, dur=601, key=108, lang=55, tsig=5 nodes
[Ace-LM] Loaded: vocab=217204, max_seq=8192, max_batch=1, kv_sets=2
[Fill] lyrics=generate metas=fill gaps | 54 tokens, CFG: 1.00, N=1
[Phase1] Prefill 64ms, 54 tokens, N=1, CFG=1.00
[Phase1] Step 100, 1 active, 65.8 tok/s
[Phase1] Step 200, 1 active, 65.8 tok/s
[Phase1] Step 300, 1 active, 66.2 tok/s
[Phase1] Step 400, 1 active, 66.3 tok/s
[Phase1] Step 500, 1 active, 66.4 tok/s
[Phase1] Step 600, 1 active, 66.4 tok/s
[Phase1] Decode 9714ms
[Phase1 Batch0] seed=660049505072788704, 645 tokens
[Fill Batch0] seed=660049505072788704:
bpm:86
caption: An energetic and upbeat pop-rock track driven by a tight rhythm section of
  punchy drums and a solid bass guitar. Overdriven electric guitars provide a crunchy,
  rhythmic foundation, while a lead guitar plays catchy melodic fills and a spirited,
  melodic solo. The lead male vocal is clear and enthusiastic, delivering the lyrics
  with a straightforward, anthemic quality. The song features a dynamic structure
  with a memorable chorus, a lively guitar solo, and a singalong bridge with 'whoa-oh'
  vocalizations before a powerful final chorus and a clean, definitive ending.
duration: 210
keyscale:G major
language:fr
timesignature:4
</think>

# Lyric
[Intro - Guitar Riff]

[Verse 1]
Les p'tits chats de la ville se promènent tous les soirs
Dans les ruelles sombres sans jamais voir de miroir
Leurs yeux malicieux nous regardent en bas
Tout en attendant l'heure de rentrer chez eux

[Chorus]
Les p'tits chats de la ville s'endorment sans éclat
Leur vie silencieuse dans ce monde si bas
Rêvant des journées où ils pourrait ronronner
Sans devoir se presser ni dormir ni rentrer

[Guitar Riff]

[Verse 2]
Le matin ils sautent pour chercher du pain
Leur queue frémissante attend l'instant divin
Où l'instant divin de retrouver un nid
Pour coucher leur tête en douceur sous l'abri

[Chorus]
Les p'tits chats de la ville s'endorment sans éclat
Leur vie silencieuse dans ce monde si bas
Rêvant des journées où ils pourrait ronronner
Sans devoir se presser ni dormir ni rentrer

[Guitar Solo]

[Bridge]
Les p'tits chats de la ville s'endorment sans éclat
Leur vie silencieuse dans ce monde si bas
Rêvant des journées où ils pourrait ronronner
Sans devoir se presser ni dormir ni rentrer

[Chorus - Key Change]
Les p'tits chats de la ville s'endorment sans éclat
Leur vie silencieuse dans ce monde si bas
Rêvant des journées où ils pourrait ronronner
Sans devoir se presser ni dormir ni rentrer

[Outro]
(Woah-oh-oh-oh)
(Woah-oh-oh-oh)
(Woah-oh-oh-oh)
(Woah-oh-oh-oh)
(Woah-oh-oh-oh)
(Woah-oh-oh-oh)
(Woah-oh-oh-oh)
(Woah-oh-oh-oh)
[Final guitar chord and cymbal crash]

[Phase2] N=1, CoT[0]:
bpm: 86
caption: An energetic and upbeat pop-rock track driven by a tight rhythm section of
  punchy drums and a solid bass guitar. Overdriven electric guitars provide a crunchy,
  rhythmic foundation, while a lead guitar plays catchy melodic fills and a spirited,
  melodic solo. The lead male vocal is clear and enthusiastic, delivering the lyrics
  with a straightforward, anthemic quality. The song features a dynamic structure
  with a memorable chorus, a lively guitar solo, and a singalong bridge with 'whoa-oh'
  vocalizations before a powerful final chorus and a clean, definitive ending.
duration: 210
keyscale: G major
language: fr
timesignature: 4
[Phase2] max_tokens: 1150, CFG: 2.00, seeds: 660049505072788704..660049505072788704
[Phase2] Prefill 244ms (shared, 1 cond + 1 uncond)
[Phase2] Decode 16ms
[Batch 0] seed=660049505072788704, 1 codes
[Ace-LM] Load 2990 | Total 10044ms | seed=660049505072788704
[Request] Wrote simple0.json

But the actual .mp3 file produced by ace-syth is always less than a second long, at only 3840 bytes each time, i.e.

ace-synth output using simple.sh
ggml_cuda_init: found 4 ROCm devices:
  Device 0: AMD Radeon AI PRO R9700, gfx1201 (0x1201), VMM: no, Wave Size: 32
  Device 1: AMD Radeon AI PRO R9700, gfx1201 (0x1201), VMM: no, Wave Size: 32
  Device 2: AMD Radeon AI PRO R9700, gfx1201 (0x1201), VMM: no, Wave Size: 32
  Device 3: AMD Radeon AI PRO R9700, gfx1201 (0x1201), VMM: no, Wave Size: 32
[Load] DiT backend: ROCm0 (CPU threads: 8)
[Load] Backend init: 42.3 ms
[GGUF] ../models/acestep-v15-turbo-Q8_0.gguf: 678 tensors, data at offset 56864
[DiT] Self-attn: Q+K+V fused
[DiT] Cross-attn: Q+K+V fused
[DiT] MLP: gate+up fused
[Load] null_condition_emb found (CFG available)
[WeightCtx] Loaded 478 tensors, 1600.7 MB into backend
[Load] DiT: 24 layers, H=2048, Nh=16/8, D=128
[Load] DiT weight load: 615.7 ms
[GGUF] ../models/acestep-v15-turbo-Q8_0.gguf: 678 tensors, data at offset 56864
[Load] silence_latent: [15000, 64] from GGUF
[GGUF] ../models/vae-BF16.gguf: 365 tensors, data at offset 30048
[Load] VAE backend: ROCm0 (shared)
[VAE] Backend: ROCm0, Weight buffer: 161.1 MB
[VAE] Loaded: 5 blocks, upsample=1920x, F32 activations
[Load] VAE weights: 854.2 ms
[BPE] Loaded from GGUF: 151643 vocab, 151387 merges
[Load] BPE tokenizer: 43.7 ms
[Load] TextEncoder backend: ROCm0 (shared)
[GGUF] ../models/Qwen3-Embedding-0.6B-Q8_0.gguf: 310 tensors, data at offset 5337664
[Load] TextEncoder: 28L, H=1024, Nh=16/8
[Qwen3] Attn: Q+K+V fused
[Qwen3] MLP: gate+up fused
[WeightCtx] Loaded 310 tensors, 742.7 MB into backend
[Load] TextEncoder: 223.3 ms
[Load] CondEncoder backend: ROCm0 (shared)
[GGUF] ../models/acestep-v15-turbo-Q8_0.gguf: 678 tensors, data at offset 56864
[Load] LyricEncoder: 8L
[Qwen3] Attn: Q+K+V fused
[Qwen3] MLP: gate+up fused
[Load] TimbreEncoder: 4L
[Qwen3] Attn: Q+K+V fused
[Qwen3] MLP: gate+up fused
[WeightCtx] Loaded 140 tensors, 616.6 MB into backend
[Load] CondEncoder: lyric(8L), timbre(4L), text_proj, null_cond
[Load] ConditionEncoder: 184.8 ms
[GGUF] ../models/acestep-v15-turbo-Q8_0.gguf: 678 tensors, data at offset 56864
[WeightCtx] Loaded 30 tensors, 106.5 MB into backend
[Load] Detokenizer: FSQ(6->2048) + 2L encoder(S=5, 2048->64)
[Load] Detokenizer: 33.5 ms
[Ace-Synth] All models loaded, turbo=yes
[Request] Parsed simple0.json
[Request] seed=660049505072788704 batch=1
[Request] caption: An energetic and upbeat pop-rock track driven by a tight rhy...
[Request] lyrics: 1429 bytes
[Request] bpm=86 dur=210 key=G major ts=4 lang=fr
[Request] lm: temp=0.85 cfg=2.0 top_p=0.90 top_k=0
[Request] dit: steps=8 guidance=1.0 shift=3.0
[Request] audio_codes: (present)
[Request 1/1] simple0.json (batch=1)
[Pipeline] 1 audio codes (0.2s @ 5Hz)
[Pipeline] T=6, S=3
[Pipeline] seed=660049505072788704, steps=8, guidance=1.0, shift=3.0, duration=210.0s
[Pipeline] caption: 169 tokens, lyrics: 495 tokens
[Encode] TextEncoder (169 tokens): 63.6 ms
[Encode] Lyric vocab lookup (495 tokens): 1.6 ms
[CondEnc] Lyric sliding mask: 495x495, window=128
[CondEnc] Timbre sliding mask: 750x750, window=128
[Encode] Packed: lyric=495 + timbre=1 + text=169 = 665 tokens
[Encode] ConditionEncoder: 27.6 ms, enc_S=665
[Context] Decoded: 1 codes -> 5 frames (0.2s @ 25Hz)
[Context] Detokenizer: 34.0 ms
[Context Batch0] Philox noise seed=660049505072788704, [6, 64]
[DiT] Starting: T=6, S=3, enc_S=665, steps=8, batch=1
[DiT] Batch N=1, T=6, S=3, enc_S=665
[DiT] Graph: 1841 nodes
[DiT] Step 1/8 t=1.000
[DiT] Step 2/8 t=0.955
[DiT] Step 3/8 t=0.900
[DiT] Step 4/8 t=0.833
[DiT] Step 5/8 t=0.750
[DiT] Step 6/8 t=0.643
[DiT] Step 7/8 t=0.500
[DiT] Step 8/8 t=0.300
[DiT] Total generation: 136.8 ms (136.8 ms/sample)
[VAE] Graph: 335 nodes, T_latent=6
[VAE] Decoded: T_latent=6 -> T_audio=11520 (0.24s @ 48kHz)
[VAE Batch0] Decode: 756.8 ms
[MP3] Encoding 0.2s @ 128 kbps, 48000 Hz stereo
[MP3] 3840 bytes (12.0:1), 53 ms (4.49x realtime), 1 threads
[MP3] Wrote simple00.mp3
[Request 1/1] Done
[Pipeline] All done

This happens regardless of the prompt or model quantization. Here's the output from this run:
simple00.mp3

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions