TTS(= Text-To-Speech) Model for studying and researching. This Repository is mainly based on ming024/FastSpeech2 and modified or added some codes. We use AI-HUB: Multi-Speaker-Speech dataset and MLS(=Multilingual LibriSpeech) dataset for training.
- AI-HUB: Multi-Speaker-Speech
Language: Korean 🇰🇷sample_rate: 48kHz
- MLS(=Multilingual LibriSpeech)
Language: German 🇩🇪sample_rate: 16kHz
- LJSpeech)
Language: English 🇺🇸sample_rate: 22.05kHz
We trained FastSpeech2 Model following languages with introducing each language's phonsets we embedded and trained. We used Montreal-Forced Alignment tool to obtain the alignments between the utterances and the phoneme sequences as described in the paper. As you can see, we embedded IPA Phoneset.
🇰🇷 Korean
'b d dʑ e eː h i iː j k kʰ k̚ k͈ m n o oː p pʰ p̚ p͈ s sʰ s͈ t tɕ tɕʰ tɕ͈ tʰ t̚ t͈ u uː w x ç ŋ ɐ ɕʰ ɕ͈ ɛ ɛː ɡ ɣ ɥ ɦ ɨ ɨː ɭ ɰ ɲ ɸ ɾ ʌ ʌː ʎ ʝ β'
🇩🇪 German
a aj aw aː b c cʰ d eː f h iː j k kʰ l l̩ m m̩ n n̩ oː p pf pʰ s t ts tʃ tʰ uː v x yː z ç øː ŋ œ ɐ ɔ ɔʏ ə ɛ ɟ ɡ ɪ ɲ ʁ ʃ ʊ ʏ
🇺🇸 English(US)
a aj aw aː b bʲ c cʰ cʷ d dʒ dʲ d̪ e ej f fʲ fʷ h i iː j k kp kʰ kʷ l m mʲ m̩ n n̩ o ow p pʰ pʲ pʷ s t tʃ tʰ tʲ tʷ t̪ u uː v vʲ vʷ w z æ ç ð ŋ ɐ ɑ ɑː ɒ ɒː ɔ ɔj ə əw ɚ ɛ ɛː ɜ ɜː ɝ ɟ ɟʷ ɡ ɡb ɡʷ ɪ ɫ ɫ̩ ɲ ɹ ɾ ɾʲ ɾ̃ ʃ ʉ ʉː ʊ ʎ ʒ ʔ θ
If you wanna see the training status, you can check here. You can check theses things above wandb link:
- Listen to the
samples(=Label Speech&predicted Speech)- Available only in some experiments in 🇩🇪 GERMAN.
- you can hear samples at:
Tablessection in the dashboardHidden Pannelssection in the bottom of each run's board.
- Available to listen samples:
T4MR_4_x_summed_1800k_BS1,T4MR_6_x_summed_max_ ...,T4MR_10_rs_22k_msl_ ...,
T4MR_15_hate_energy_ ...,T4MR_17_basic_but_bs64.
- you can hear samples at:
- We wanted to continue to collect samples during training in 🇰🇷 Korean, but couldn't. (Had to care storage)
- Available only in some experiments in 🇩🇪 GERMAN.
- Training / Eval's Mel-Spectrogram
- T27_Hope_that_u_can_replace_that_with_sth_better
- FastSpeech2 + PostNet | 🇺🇸 English | Single_Speaker
Batch_Size: 64Epochs: 800
- T25_END_Game
- FastSpeech2 + PostNet | 🇰🇷 Korean | Single_Speaker:
8505 - Resampled (from
48kHzto22.05kHz) Batch_Size: 64Epochs: 600
- FastSpeech2 + PostNet | 🇰🇷 Korean | Single_Speaker:
- T24_Thank_you_Mobius:
- FastSpeech2 | 🇰🇷 Korean | Single_Speaker:
8505 Non-StationaryNoise Reduction -> Resampled (from48kHzto22.05kHz)Batch_Size: 64Epochs: 600
- FastSpeech2 | 🇰🇷 Korean | Single_Speaker:
- T23_You_Just_Chosse_ur_Burden
- FastSpeech2 | 🇰🇷 Korean | Single_Speaker:
8505 - Resampled (from
48kHzto22.05kHz) ->Non-StationaryNoise Reduction Batch_Size: 64Epochs: 600
- FastSpeech2 | 🇰🇷 Korean | Single_Speaker:
- T22_Theres_No_comfort
- FastSpeech2 | 🇰🇷 Korean | Single_Speaker:
8505 - Resampled (from
48kHzto22.05kHz) Batch_Size: 64Epochs: 600
- FastSpeech2 | 🇰🇷 Korean | Single_Speaker:
- 🤗
acceleratecan allowmulti-gputraining easily: Trained on 2 x NVIDIA GeForce RTX 4090 GPUs. torchmalloc.pyand 🌈coloramacan show your resource in real-time (during training) like below:example
- 🔇
noisereduceis available when you runpreprocessor.py.Non-Stataionary Noise Reductionprop_decreasecan avoid data-distortion. (0.0 ~ 1.0)
wandbinstead ofTensorboard.wandbis compatible with 🤗accelerateand with 🔥pytorch.- 🔥
[Pytorch-Hub]NVIDIA/HiFi-GAN: used as a vocoder.
This preprocess.py can give you the pitch, energy, duration and phones from TextGrid files.
python preprocess.py config/LibriTTS/preprocess.yaml
First, you should log-in wandb with your token key in CLI.
wandb login --relogin '##### Token Key #######'
Next, you can set your training environment with following commands.
accelerate config
With this command, you can start training.
accelerate launch train.py --n_epochs 990 --save_epochs 50 --synthesis_logging_epochs 30 --try_name T4_MoRrgetda
Also, you can train your TTS model with this command.
CUDA_VISIBLE_DEVICES=0,3 accelerate launch train.py --n_epochs 990 --save_epochs 50 --synthesis_logging_epochs 30 --try_name T4_MoRrgetda
you can synthesize speech in CLI with this command:
python synthesize.py --raw_texts <Text to syntheize to speech> --restore_step 53100
Also, you can check this jupyter-notebook when you try to synthesize.


