Skip to content

krafton-ai/Raon-OpenTTS

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Raon-OpenTTS

Raon-OpenTTS

Open Models and Data for Robust Text-to-Speech

HuggingFace Model 0.3B Dataset Eval

Technical Report (Coming soon) | Raon-OpenTTS-1B (Coming soon)

Highlights

  • Fully open: both model weights and training data are publicly available.
  • Large-scale training: 510.1K hours of quality-filtered speech (Raon-OpenTTS-Core), drawn from a 615K-hour open pool (Raon-OpenTTS-Pool) comprising 11 English datasets.
  • Competitive with closed-data SOTA: matches or outperforms MaskGCT, VoxCPM, CosyVoice 3, and Qwen3-TTS on standard benchmarks while being the first system that is simultaneously open-weight and open-data at this scale.
  • Two model sizes: 0.3B and 1B parameters, both based on the F5-TTS DiT architecture.

Model Zoo

Model Params Architecture Training Data Download
Raon-OpenTTS-0.3B 336M DiT (dim=1024, depth=22, heads=16, ff_mult=2) Raon-OpenTTS-Core (510.1K hrs) HuggingFace
Raon-OpenTTS-1B 1048M DiT (dim=1408, depth=28, heads=24, ff_mult=4) Raon-OpenTTS-Core (510.1K hrs) TBD

Both models use character-level tokenization (vocab size 5,512) with text_dim=512, and are trained on 80-channel log mel-spectrograms at 16 kHz (hop=256). A pretrained HiFi-GAN vocoder (16 kHz, LibriTTS) is used for waveform synthesis.

Benchmark Results

Seed-TTS-Eval (English)

WER measured via Whisper-large-v3; speaker similarity (SIM) via WavLM-large.

TBD

CV3-Eval

SIM measured via ERes2Net.

TBD

Raon-OpenTTS-Eval

Covers 4 acoustic regimes (Clean, Noisy, Wild, Expressive) across 12 datasets with 6K prompt-text pairs.

TBD

Installation

git clone https://github.com/krafton-ai/Raon-OpenTTS.git
cd Raon-OpenTTS
pip install -e .

# With evaluation dependencies (WER, SIM, DNSMOS)
pip install -e ".[eval]"

Vocoder

We use a HiFi-GAN vocoder fine-tuned on LibriTTS at 16 kHz (originally from speechbrain/tts-hifigan-libritts-16kHz). Our standalone loader requires no speechbrain dependency.

mkdir -p pretrained_models
huggingface-cli download speechbrain/tts-hifigan-libritts-16kHz generator.ckpt --local-dir pretrained_models

Quick Start: Inference

python -m f5_tts.infer.infer_cli \
    --config src/f5_tts/configs/03b.yaml \
    --ckpt_dir checkpoints/Raon-OpenTTS-0.3B \
    --ckpt_name model_last.pt \
    --lst_path data/librispeech_pc_test_clean_cross_sentence.lst \
    --audio_root data/librispeech/test-clean \
    --output_dir output/inference

VAD-based Duration Estimation

The inference pipeline uses VAD-trimmed reference length for generation-length estimation, while the original (non-trimmed) audio is used as the conditioning signal. A dynamic silence threshold adapts to the speaker's volume, and a minimum speech rate (12 chars/sec) is enforced to prevent excessively long generations.

from f5_tts.infer.utils_infer import infer_process

audio, sr, _ = infer_process(ref_audio, ref_text, gen_text, model, vocoder)

Training

Both models are trained from the Raon-OpenTTS-Pool HuggingFace dataset using the core split (quality-filtered).

Launch training

# 0.3B model (1 nodes x 8 GPUs)
accelerate launch --multi_gpu --mixed_precision bf16 \
    --num_processes 8 --num_machines 1 \
    -m f5_tts.train.train --config-name=03b

# 1B model (1 nodes x 8 GPUs)
accelerate launch --multi_gpu --mixed_precision bf16 \
    --num_processes 8 --num_machines 1 \
    -m f5_tts.train.train --config-name=1b

Adapting to different hardware

If you use a different number of GPUs or batch size, recompute total_updates_per_epoch with a dry run:

# Run one step and check log output for "TOTAL UPDATES <N>"
accelerate launch --multi_gpu --mixed_precision bf16 \
    --num_processes <num_gpus> \
    -m f5_tts.train.train --config-name=03b
# Then set: total_updates_per_epoch = TOTAL_UPDATES / (epochs x num_gpus)

Evaluation

We evaluate on 3 benchmarks measuring intelligibility (WER) and speaker similarity (SIM):

Benchmark Metrics Description
Seed-TTS-Eval (EN) WER (Whisper-large-v3), SIM (WavLM-large) Standard zero-shot TTS evaluation with cross-sentence prompts
CV3-Eval WER, SIM (ERes2Net), DNSMOS CV3-EN and CV3-Hard-EN subsets with diverse speakers
Raon-OpenTTS-Eval WER, SIM 4 acoustic regimes (Clean, Noisy, Wild, Expressive), 12 datasets, 6K prompt-text pairs
# Run evaluation across all benchmarks
bash src/f5_tts/eval/run_infer_eval.sh

Data

Raon-OpenTTS-Pool (615K hours, 11 English speech datasets) is publicly available on HuggingFace

Raon-OpenTTS-Core (510.1K hours, 194.5M segments) is the quality-filtered subset used for training. It is obtained by applying a combined filter based on DNSMOS, WER, and VAD rank scores, removing the bottom 15% of Raon-OpenTTS-Pool. The core split in the HuggingFace dataset corresponds to Raon-OpenTTS-Core.

Acknowledgement

This project is built upon F5-TTS by SWivid. We thank the authors for their excellent open-source work.

License

This project is licensed under Apache 2.0.

Citation

@article{raon2026opentts,
    title={Raon-OpenTTS: Open Models and Data for Robust Text-to-Speech},
    author={TBD},
    year={2026},
    url={https://github.com/krafton-ai/Raon-OpenTTS
}

About

Open-source text-to-speech model from KRAFTON trained exclusively on public speech data, with curated datasets and reproducible training support.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors