Few-shot Korean voice cloning with VITS architecture
Ullim VITS is a few-shot voice cloning system based on the VITS (Variational Inference with adversarial learning for end-to-end Text-to-Speech) architecture, optimized for Korean language and A100 40GB GPU. The system enables high-quality voice synthesis with minimal reference audio samples.
- Few-shot Voice Cloning: Generate speech in new voices with just 3-10 audio samples
- Korean Language Support: Native Korean phoneme processing with g2pk2
- Multi-speaker Training: Trained on Zeroth-Korean dataset with 115 speakers
- A100 Optimized: Configuration optimized for A100 40GB GPU with mixed precision training
- Modern Architecture: VITS with HiFi-GAN decoder, normalizing flows, and stochastic duration prediction
- End-to-end text-to-speech synthesis
- Few-shot voice adaptation
- Multi-speaker voice cloning
- Korean text processing with automatic phoneme conversion
- WandB integration for experiment tracking
- Hydra-based configuration management
- Mixed precision training (FP16)
- Python 3.10+
- CUDA 11.8+ (for GPU training)
- Poetry
- Install Poetry:
curl -sSL https://install.python-poetry.org | python3 -- Clone the repository:
git clone https://github.com/yourusername/ullim-vits.git
cd ullim-vits- Install dependencies:
poetry install- For development (includes jupyter, pytest, etc.):
poetry install --with dev- Activate environment:
poetry shellCreate a .env file from the example:
cp .env.example .envEdit .env and add your WandB API key:
WANDB_API_KEY=your-wandb-api-key-here
Get your API key from wandb.ai/authorize
Option 1: Automatic Download (Recommended)
Download and preprocess Zeroth-Korean dataset automatically:
poetry run ullim-preprocessThis will:
- Download Zeroth-Korean from HuggingFace
- Resample audio to 22050Hz
- Generate metadata files
- Compute dataset statistics
Option 2: Manual Download
If automatic download fails, manually download from HuggingFace and place in data/zeroth/ directory, then run:
poetry run ullim-preprocessStart training with default configuration:
poetry run ullim-trainOr with custom config:
poetry run ullim-train model=vits_large train.batch_size=16Generate speech from text:
poetry run ullim-infer \
--checkpoint experiments/vits_fewshot_zeroth/checkpoints/checkpoint_100000.pt \
--text "안녕하세요, 음성 합성 테스트입니다." \
--output output.wav \
--speaker_id 0Adapt model to a new voice with reference audio:
from ullim_vits.inference.fewshot_adapter import FewShotAdapter
adapter = FewShotAdapter(config, checkpoint_path)
adapted_model = adapter.adapt(
reference_audio_dir="path/to/reference/audio",
reference_metadata="path/to/metadata.txt",
target_speaker_id=999,
n_epochs=100
)
adapter.save_adapted_model("adapted_model.pt")The training pipeline uses Hydra for configuration management. Key configuration files:
configs/config.yaml: Main configurationconfigs/model/vits_base.yaml: Model architectureconfigs/data/zeroth.yaml: Dataset configurationconfigs/train/default.yaml: Training hyperparameters
Training logs are automatically sent to WandB:
# View at https://wandb.ai/your-username/ullim-vitspoetry run ullim-train resume=experiments/vits_fewshot_zeroth/checkpoints/latest.ptBasic synthesis:
from ullim_vits.inference.synthesizer import Synthesizer
synthesizer = Synthesizer(config, checkpoint_path)
audio = synthesizer.synthesize(
text="음성 합성 테스트",
speaker_id=0,
noise_scale=0.667,
length_scale=1.0
)
synthesizer.save_audio(audio, "output.wav")Key hyperparameters in configs/train/default.yaml:
batch_size: 32 # A100 40GB optimized
mixed_precision: true # FP16 training
gradient_clip_val: 1000.0
epochs: 1000
optimizer:
generator:
lr: 2.0e-4
discriminator:
lr: 2.0e-4
losses:
mel_loss_weight: 45.0
kl_loss_weight: 1.0
feature_loss_weight: 2.0ullim-vits/
├── ullim_vits/ # Main package
│ ├── models/ # Neural network models
│ │ ├── vits/ # VITS components
│ │ ├── discriminator/ # MPD & MSD
│ │ ├── speaker_encoder/
│ │ └── common/ # Shared modules
│ ├── data/ # Dataset and data processing
│ ├── losses/ # Loss functions
│ ├── utils/ # Audio, text, alignment utilities
│ ├── training/ # Training loop and optimization
│ ├── inference/ # TTS and few-shot adaptation
│ └── cli/ # Command-line interfaces
├── configs/ # Hydra configurations
├── tools/ # Preprocessing scripts
├── recipes/ # Dataset-specific recipes
├── tests/ # Unit tests
├── docs/ # Documentation
└── experiments/ # Training outputs
See detailed documentation in docs/ directory.
Ullim VITS implements the VITS architecture with the following components:
-
Text Encoder (Prior Network)
- Transformer encoder with relative positional encoding
- Normalizing flows for flexible prior distribution
- Speaker conditioning via FiLM layers
-
Posterior Encoder
- WaveNet-style residual blocks
- Extracts latent representation from mel-spectrogram
- Variational inference with speaker embedding
-
Duration Predictor
- Stochastic duration prediction with flows
- Monotonic Alignment Search (MAS) for alignment learning
-
Decoder
- HiFi-GAN style vocoder
- Multi-receptive field fusion
- Transposed convolutions with residual blocks
-
Discriminators
- Multi-Period Discriminator (MPD)
- Multi-Scale Discriminator (MSD)
-
Speaker Encoder
- CNN + GRU architecture
- Extracts speaker embeddings from reference audio
- Enables few-shot voice cloning
- Source: HuggingFace - kresnik/zeroth_korean
- Speakers: 115
- Utterances: 51,000+
- Sampling Rate: 22050 Hz (resampled)
- Language: Korean
To use your own dataset, format your data as:
data/
├── train/
│ ├── audio/
│ │ ├── speaker1_001.wav
│ │ └── ...
│ └── metadata.txt
└── test/
├── audio/
└── metadata.txt
Metadata format: audio_filename|speaker_id|transcript
speaker1_001.wav|speaker1|안녕하세요
- Recommended: NVIDIA A100 40GB
- Minimum: RTX 3090 24GB (reduce batch size to 16)
- RAM: 64GB+
- Storage: 100GB+ for dataset and checkpoints
On A100 40GB with default config:
- ~2-3 days for 1000 epochs on full Zeroth-Korean
- ~500k steps to convergence
Model is configured for A100 40GB:
- Batch size: 32
- Mixed precision: FP16
- Gradient accumulation: 1
- Learning rate: 2e-4 (both G and D)
- Scheduler: Exponential (gamma=0.999875)
- Start with smaller model (
vits_base) for faster prototyping - Use gradient checkpointing if running out of memory
- Monitor KL loss - should stabilize after 50k steps
- Duration predictor loss should decrease steadily
- Validation mel loss < 3.0 indicates good quality
Training results and audio samples will be added after initial training run.
- Mel-spectrogram reconstruction loss
- KL divergence
- Duration prediction accuracy
- MOS (Mean Opinion Score) - to be evaluated
@inproceedings{kim2021conditional,
title={Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech},
author={Kim, Jaehyeon and Kong, Jungil and Son, Juhee},
booktitle={International Conference on Machine Learning},
pages={5530--5540},
year={2021},
organization={PMLR}
}@misc{zeroth_korean,
title={Zeroth-Korean},
author={Goodatlas},
year={2017},
url={https://github.com/goodatlas/zeroth}
}@inproceedings{kong2020hifi,
title={HiFi-GAN: Generative adversarial networks for efficient and high fidelity speech synthesis},
author={Kong, Jungil and Kim, Jaehyeon and Bae, Jaekyoung},
booktitle={Advances in Neural Information Processing Systems},
volume={33},
pages={17022--17033},
year={2020}
}MIT License - see LICENSE file for details
- VITS implementation inspired by jaywalnut310/vits
- HiFi-GAN decoder from jik876/hifi-gan
- Zeroth-Korean dataset from goodatlas/zeroth
- Korean phonemizer: g2pk2
Detailed documentation available in docs/:
- Add pretrained model weights
- Add audio sample demos
- Implement emotion control
- Add voice conversion mode
- Multi-GPU training support
- ONNX export for deployment
- Real-time inference optimization