I found there aren't many Youtube video's around showing this, so I would like to contribute for something I feel is overlooked. People do seem to use the ksampler method, and that works, but I do not see people using the Ace-step nodes. It also could be because getting it to work needs some folders and files that are not installed in the ComfyUI install for Ace-Step.
UPDATE, I made a video covering this:
https://www.youtube.com/watch?v=383OV2ZJ5pc
I got great results doing Text2Audio using the Ace-Step nodes in ComfyUI , it seems more consistent than using KSampler it seems. Attached are pictures of the workflow. Do know you need to create some folders and download some files in order to get it working. Make sure you have these folders in the ComfyUI\models\TTS\ACE-Step-v1-3.5B\ folder:
ace_step_transformer
Loras (without this folder the loader will not work)
music_dcae_f8c8
music_vocoder
umt5-base
Then go to: https://huggingface.co/ACE-Step/ACE-Step-v1-3.5B/tree/main and download both the config files and the safetensors files and put them into the corresponding folders (do not rename the files, it needs them to be named like they are).
You should then be ready to make the following workflow (the added Yvann nodes and seed and normalize nodes are not required, but are there for added processing and convenience).
Guidance scaling seems to control prompt response. Guidance scale text seems to control prompt, and lyrics for the vocals/lyrics. In the picture they are 0 (default) but when you make text 0.8 or something, you'll find the prompt is considered.
Lastly I would like to express thanks for the hard work of the people working on these Ace nodes and models, I am truly impressed, and get way better results than Stable Audio 1.0. I'm really curious to see what will come out of this, I do think it could be what WAN and QWEN have done to go beyond SD in comparison (at least I feel they did).
Some useful tips for prompting:
just like images it really helps to specify what is generated and how it should sound. A little knowledge (or research) on composition and types of instruments and effects used in music will really improve the result with accurate prompting.
Ace - Generated Horizons.mp3
important note: Not all seeds will have good results, but I found it to be very consistent, getting at least 1 out of 5 I thought were great takes, sometimes it seems to mess up and use the negative prompt as part of the positive, but the negative prompt is useful and shouldn't be ignored. Use the negative prompt with things you don't want, such a noise, white noise, bad quality, or specific kinds of instruments. You can also specify no reverb on bass, or not too much reverb on drums, it works really well making it tighter and not bleed too much.
For positives, already saying it should be 'mixed in a professional studio' will change it to match that, or saying it should be mixed like a genre in a specific era will also do that. Saying a high-pass filter should cut everything below 40 hz will clean up the lows. Specifying music styles is important, also specify each instrument, its panning and how it should sound. A lot of the time electric guitar is mixed with a guitar channel panned to the left and a channel panned to the right, and specifying that does really open up the mix. I also found it knows what a p-bass is, which is very useful. In this way you can specify gear used and for example applying a noise gate pedal. Specifying bpm is also useful if you want to avoid strange timing changes.
Extending vowels in lyrics is easy, just type the letter multiple times, i.e. : I waaaaant want to be long. :) If it doesn't pick up on it, or uses a weird pronunciation, use phonetic spelling instead.
prompting goes a long way, it will not always work the first try, or how you expect, but if you want it, you can ask for it. And if it doesn't work, use more general language. It does even understand not having too much sizzle of the overheads..
Ace - Generated horizons 2.mp3
I found there aren't many Youtube video's around showing this, so I would like to contribute for something I feel is overlooked. People do seem to use the ksampler method, and that works, but I do not see people using the Ace-step nodes. It also could be because getting it to work needs some folders and files that are not installed in the ComfyUI install for Ace-Step.
UPDATE, I made a video covering this:
https://www.youtube.com/watch?v=383OV2ZJ5pc
I got great results doing Text2Audio using the Ace-Step nodes in ComfyUI , it seems more consistent than using KSampler it seems. Attached are pictures of the workflow. Do know you need to create some folders and download some files in order to get it working. Make sure you have these folders in the ComfyUI\models\TTS\ACE-Step-v1-3.5B\ folder:
ace_step_transformer
Loras (without this folder the loader will not work)
music_dcae_f8c8
music_vocoder
umt5-base
Then go to: https://huggingface.co/ACE-Step/ACE-Step-v1-3.5B/tree/main and download both the config files and the safetensors files and put them into the corresponding folders (do not rename the files, it needs them to be named like they are).
You should then be ready to make the following workflow (the added Yvann nodes and seed and normalize nodes are not required, but are there for added processing and convenience).
Guidance scaling seems to control prompt response. Guidance scale text seems to control prompt, and lyrics for the vocals/lyrics. In the picture they are 0 (default) but when you make text 0.8 or something, you'll find the prompt is considered.
Lastly I would like to express thanks for the hard work of the people working on these Ace nodes and models, I am truly impressed, and get way better results than Stable Audio 1.0. I'm really curious to see what will come out of this, I do think it could be what WAN and QWEN have done to go beyond SD in comparison (at least I feel they did).
Some useful tips for prompting:
just like images it really helps to specify what is generated and how it should sound. A little knowledge (or research) on composition and types of instruments and effects used in music will really improve the result with accurate prompting.
Ace - Generated Horizons.mp3
important note: Not all seeds will have good results, but I found it to be very consistent, getting at least 1 out of 5 I thought were great takes, sometimes it seems to mess up and use the negative prompt as part of the positive, but the negative prompt is useful and shouldn't be ignored. Use the negative prompt with things you don't want, such a noise, white noise, bad quality, or specific kinds of instruments. You can also specify no reverb on bass, or not too much reverb on drums, it works really well making it tighter and not bleed too much.
For positives, already saying it should be 'mixed in a professional studio' will change it to match that, or saying it should be mixed like a genre in a specific era will also do that. Saying a high-pass filter should cut everything below 40 hz will clean up the lows. Specifying music styles is important, also specify each instrument, its panning and how it should sound. A lot of the time electric guitar is mixed with a guitar channel panned to the left and a channel panned to the right, and specifying that does really open up the mix. I also found it knows what a p-bass is, which is very useful. In this way you can specify gear used and for example applying a noise gate pedal. Specifying bpm is also useful if you want to avoid strange timing changes.
Extending vowels in lyrics is easy, just type the letter multiple times, i.e. : I waaaaant want to be long. :) If it doesn't pick up on it, or uses a weird pronunciation, use phonetic spelling instead.
prompting goes a long way, it will not always work the first try, or how you expect, but if you want it, you can ask for it. And if it doesn't work, use more general language. It does even understand not having too much sizzle of the overheads..
Ace - Generated horizons 2.mp3