I'm trying to generate my own "singing" database, recording french vowels: a, e, o, u, y, 2, 9, O, @, E.
So should I record these vowels alone with natural speech, for example the length of "a" will be 1.2 seconds/~20000 (16kHz). Or should I maintain the sound?
My goal is to generate a singing database, so notes can be longer than 1 second.
Same question for silence before and after a phoneme (a._ or _.a). Which should be the best choice: 1 second/16000 samples, higher or lower and why?