Skip to content

Seeking Fine-tuning Guidance and Experience Sharing #1211

@RRiiiccckkk

Description

@RRiiiccckkk

Checks

  • This template is only for usage issues encountered.
  • I have thoroughly reviewed the project documentation but couldn't find information to solve my problem.
  • I have searched for existing issues, including closed ones, and couldn't find a solution.
  • I am using English to submit this issue to facilitate community communication.

Environment Details

Hi F5-TTS team and community,

I'm currently working on fine-tuning the F5-TTS model on a custom dataset (specifically, a Cantonese dialect dataset). I've followed the provided documentation and successfully completed the training process using the finetune_cli.py script.

My training setup is as follows:

  • Base Model: F5TTS_v1_Base/model_1250000.safetensors
  • Dataset: Cantonese dialect (details on size, hours, etc. can be added if relevant)
  • Training Parameters: --epochs 1, --dataset_name cantonese, --tokenizer_path ./data/cantonese/vocab.txt, --pretrain /content/drive/MyDrive/AIAA2205-assignment2-F5-TTS/ckpts/cantonese/pretrained_model_1250000.safetensors
  • Training Environment: Google Colab with a GPU (NVIDIA A100-SXM4-80GB)

The training completed with a reported loss of around 0.544.

I'm now moving on to the inference and evaluation steps, but I would greatly appreciate it if anyone could share their experiences and insights on fine-tuning this model, especially for different languages or dialects.

Specifically, I'm interested in:

  • Best practices for preparing custom datasets for fine-tuning.
  • Expected loss values or metrics to look for during training to indicate successful fine-tuning.
  • Tips for optimizing hyperparameters for better performance on a specific dialect.
  • Any common issues encountered during fine-tuning and how to resolve them.
  • Experiences with the objective evaluation metrics and how they correlate with perceived audio quality.

I'm eager to learn from the community's experience to improve my results and contribute back if possible.

Thank you for your time and support!

Best regards

Steps to Reproduce

  1. Cloned the F5-TTS repository.
  2. Mounted Google Drive to access the project directory.
  3. Changed the current directory to /content/drive/MyDrive/AIAA2205-assignment2-F5-TTS.
  4. Installed dependencies using pip install -r requirement.txt.
  5. Prepared the Cantonese dialect dataset using data_prepare_text.py with input directory ./data/dialect_data/ and output directory ./data/cantonese/.
  6. Downloaded the pretrained model F5TTS_v1_Base/model_1250000.safetensors using huggingface_hub.
  7. Ran the fine-tuning script with the following command: !CUDA_VISIBLE_DEVICES=0 python ./src/f5_tts/train/finetune_cli.py --epochs 1 --dataset_name cantonese --tokenizer_path ./data/cantonese/vocab.txt --pretrain /content/drive/MyDrive/AIAA2205-assignment2-F5-TTS/ckpts/cantonese/pretrained_model_1250000.safetensors

✔️ Expected Behavior

Based on the training completion, I expect that the fine-tuned model should be able to generate speech in the Cantonese dialect using the provided reference audio and text. The generated speech should ideally sound natural and reflect the characteristics of the Cantonese voice from the reference audio.

❌ Actual Behavior

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    help wantedExtra attention is needed

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions