-
Notifications
You must be signed in to change notification settings - Fork 2k
Description
Checks
- This template is only for usage issues encountered.
- I have thoroughly reviewed the project documentation but couldn't find information to solve my problem.
- I have searched for existing issues, including closed ones, and couldn't find a solution.
- I am using English to submit this issue to facilitate community communication.
Environment Details
Hi F5-TTS team and community,
I'm currently working on fine-tuning the F5-TTS model on a custom dataset (specifically, a Cantonese dialect dataset). I've followed the provided documentation and successfully completed the training process using the finetune_cli.py script.
My training setup is as follows:
- Base Model:
F5TTS_v1_Base/model_1250000.safetensors - Dataset: Cantonese dialect (details on size, hours, etc. can be added if relevant)
- Training Parameters:
--epochs 1,--dataset_name cantonese,--tokenizer_path ./data/cantonese/vocab.txt,--pretrain /content/drive/MyDrive/AIAA2205-assignment2-F5-TTS/ckpts/cantonese/pretrained_model_1250000.safetensors - Training Environment: Google Colab with a GPU (NVIDIA A100-SXM4-80GB)
The training completed with a reported loss of around 0.544.
I'm now moving on to the inference and evaluation steps, but I would greatly appreciate it if anyone could share their experiences and insights on fine-tuning this model, especially for different languages or dialects.
Specifically, I'm interested in:
- Best practices for preparing custom datasets for fine-tuning.
- Expected loss values or metrics to look for during training to indicate successful fine-tuning.
- Tips for optimizing hyperparameters for better performance on a specific dialect.
- Any common issues encountered during fine-tuning and how to resolve them.
- Experiences with the objective evaluation metrics and how they correlate with perceived audio quality.
I'm eager to learn from the community's experience to improve my results and contribute back if possible.
Thank you for your time and support!
Best regards
Steps to Reproduce
- Cloned the F5-TTS repository.
- Mounted Google Drive to access the project directory.
- Changed the current directory to
/content/drive/MyDrive/AIAA2205-assignment2-F5-TTS. - Installed dependencies using
pip install -r requirement.txt. - Prepared the Cantonese dialect dataset using
data_prepare_text.pywith input directory./data/dialect_data/and output directory./data/cantonese/. - Downloaded the pretrained model
F5TTS_v1_Base/model_1250000.safetensorsusinghuggingface_hub. - Ran the fine-tuning script with the following command:
!CUDA_VISIBLE_DEVICES=0 python ./src/f5_tts/train/finetune_cli.py --epochs 1 --dataset_name cantonese --tokenizer_path ./data/cantonese/vocab.txt --pretrain /content/drive/MyDrive/AIAA2205-assignment2-F5-TTS/ckpts/cantonese/pretrained_model_1250000.safetensors
✔️ Expected Behavior
Based on the training completion, I expect that the fine-tuned model should be able to generate speech in the Cantonese dialect using the provided reference audio and text. The generated speech should ideally sound natural and reflect the characteristics of the Cantonese voice from the reference audio.
❌ Actual Behavior
No response