Mongolian Cyrillic (Khalkha) multi-speaker Text-to-Speech system using VITS architecture.
- VITS Architecture: End-to-end TTS with variational inference and adversarial training
- Multi-speaker: Support for distinct male and female voices
- Mongolian Text Processing: Custom rule-based phonemizer for Cyrillic script
- Number Normalization: Comprehensive Mongolian number-to-text transliteration
- Audio Denoising: DeepFilterNet integration for preprocessing non-professional recordings
- Hugging Face Integration: Dataset and model hub support
# Create virtual environment
python3.11 -m venv .venv
source .venv/bin/activate
# Install dependencies
pip install -e ".[dev]"oron-tts/
├── src/
│ ├── data/ # Dataset wrappers, denoising, preprocessing
│ ├── models/ # VITS architecture components
│ ├── training/ # Training loop, losses, checkpointing
│ └── utils/ # Audio processing, text normalization
├── scripts/
│ ├── prepare.py # Dataset preparation
│ ├── train.py # Model training
│ └── infer.py # Inference/synthesis
└── configs/ # YAML configuration files
Clean and denoise audio from Common Voice and MBSpeech datasets:
python scripts/prepare.py \
--output-dir data/processed \
--dataset all \
--upload \
--hf-repo btsee/oron-tts-datasetFor detailed RunPod setup instructions, see RUNPOD.md
Quick start on RunPod:
# Run setup script
wget https://raw.githubusercontent.com/btseee/oron-tts/main/runpod_setup.sh
chmod +x runpod_setup.sh
./runpod_setup.sh
# Start training
python scripts/train.py \
--config configs/vits_runpod.yaml \
--from-hf \
--dataset btsee/mbspeech_mn \
--push-to-hub \
--hf-repo btsee/oronttsLocal/Custom training:
# Single GPU
python scripts/train.py \
--config configs/vits_runpod.yaml \
--from-hf \
--dataset btsee/oron-tts-dataset \
--push-to-hub \
--hf-repo btsee/oron-tts-model
# Multi-GPU
python scripts/train.py \
--config configs/vits_runpod.yaml \
--from-hf \
--dataset btsee/oron-tts-dataset \
--num-gpus 4Generate speech from text:
python scripts/infer.py \
--checkpoint checkpoints/vits_best.pt \
--text "Сайн байна уу" \
--speaker 0 \
--output output.wavSpeaker IDs:
0: Female voice1: Male voice
| Input | Output |
|---|---|
| 10 | арван |
| 25 | хорин тав |
| 100 | зуун |
| 1-р | нэгдүгээр |
| 2024 | хоёр мянга хорин дөрөв |
Key hyperparameters in configs/vits_base.yaml:
sample_rate: 22050
batch_size: 16
learning_rate: 0.0002
model:
hidden_channels: 192
n_layers: 6
n_heads: 2OronTTS supports two logging modes:
Local Training (with tqdm):
use_tqdm: true # Progress bars for interactive training
log_interval: 100 # Log every 100 stepsRunPod/Cloud Training (structured logs):
use_tqdm: false # Disable tqdm for container logs
log_interval: 50 # More frequent loggingContainer logs will show:
[2026-01-28 14:37:22] [INFO] Starting Epoch 1
[2026-01-28 14:37:24] [INFO] Step 0 | Batch 1/320 | Loss: 281.89 | Mel: 100.38 | KL: 175.99 | Dur: 0.01 | LR: 0.000200
[2026-01-28 14:37:29] [INFO] Step 10 | Batch 11/320 | Loss: 139.46 | Mel: 72.75 | KL: 63.69 | Dur: 0.01 | LR: 0.000200
# Install dev dependencies
pip install -e ".[dev]"
# Lint
ruff check src/ scripts/
# Format
ruff format src/ scripts/
isort src/ scripts/MIT
If you use OronTTS in your research, please cite:
@software{orontts2024,
title = {OronTTS: Mongolian Text-to-Speech},
year = {2024},
url = {https://github.com/btsee/oron-tts}
}