oh suno, why you gotta go and make things so complicated. just tell us how the stuff works pls. more info about the model, how it was trained, which text encoder and how, etc. this will make your (advanced) users so much happier and more productive.
my hypothesis
ps: complete conjecture, no connection to suno and absolutely not an expert on this matter, just a guy who spends too much time on generative models of all kinds (with quite a bit of programming and audio experience)
-
suno has a fast but lower quality decoder that is uses to convert the latent into an mp3 stream. it's less GPU intensive and enables the extraordinarily fast streaming feature, even before the song is done generating
-
when the song is fully generated, suno saves the full latent information of the generated song to disk. this, in theory, is a tiny amount of data.
-
suno has a higher quality, GPU intensive decoder that can't be used for streaming. when you trigger the "download WAV" command, it uses the previously saved latent and decodes it and creates the file
