[English|中文]
A simple Webui that allows you to inference VITS TTS models.
Also comes with API supports to interact with other processes.
- VITS Text-to-Speech
- GPU Acceleration
- Support for Multiple Models
- Automatic Language Recognition & Processing
- Customize Parameters
- Batch Processing for Long Text
- No longer hardcode the paths in Config. The project is now portable.
- Automatically load the model paths. No more manually editing the entry in Config.
- Prioritize PyTorch w/ Nvidia GPU Support. (Built on
CUDA 11.8)Edit
requirements.txtif using other CUDA versions - Should not throw issues when installing
fasttext, at least on Windows - Clean up a few entries of the Config.
- Removed all Docker related stuffs...
- By default, only supports VITS models. You will need to edit the
config.pyand some other scripts to use VITS2, etc.
Some original features might be missing...
Open the console at the target location, then run the following:
git clone https://github.com/HaomingXR/vits-webui- Create a virtual environment using the Python installed on your system (Tested on
3.10.10)
python -m venv venv
venv\scripts\activate- Download the self-contained Python runtime, Windows Embeddable Package
- Open the
python3<version>._pthfile (with a text editor) - Uncomment the
import siteline - Then, download and run get-pip.py to install
pip
Edit
requirements.txtif using other CUDA versions, or not using Nvidia GPU
pip install -r requirements.txtRun the following command to start the service:
python app.pyOn Windows, you can also run webui.bat to directly launch the service.
Edit the file and point to the Python runtime
- You may find various VITS models online, usually on
HuggingFacespaces - Download the VITS model files (including both
.pthand.jsonfiles)
- Place both the model and config into their own folder, then place the folder inside the
modelsdirectory - On launch, the system should automatically detect the models
The file config.py contains a few default options. After launching the service for the first time,
it will generate a config.yaml in the directory. All future launches will load this config instead.
The Admin Backend allows loading and unloading models, with login authentication.
For added security, you can just disable the backend in the config.yaml:
'IS_ADMIN_ENABLED': !!bool 'false'When enabled, it will automatically generate a pair of username and password in
config.yaml
You can enable this setting, so that the API usages require a key to connect.
'API_KEY_ENABLED': !!bool 'false'When enabled, it will automatically generate a random key in
config.yaml
You can edit this setting to set the local server port for the API.
'PORT': !!int '8888'- Return the dictionary mapping of IDs to Speaker
GET http://127.0.0.1:8888/voice/speakers
- Return the audio data speaking prompt
default parameters are used when not specified
GET http://127.0.0.1:8888/voice/vits?text=prompt
VITS
| Parameter | Required | Default Value | Type | Instruction |
|---|---|---|---|---|
| text | true | str | Text to speak | |
| id | false | From config.yaml |
int | Speaker ID |
| format | false | From config.yaml |
str | wav / ogg / mp3 / flac |
| lang | false | From config.yaml |
str | The language of the text to be synthesized |
| length | false | From config.yaml |
float | The length of the synthesized speech. The larger the value, the slower the speed. |
| noise | false | From config.yaml |
float | The randomness of the synthesis |
| noisew | false | From config.yaml |
float | The length of phoneme pronunciation |
| segment_size | false | From config.yaml |
int | Divide the text into paragraphs based on punctuation marks |
| streaming | false | false | bool | Stream synthesized speech with faster initial response |
Check the original repo for more info
- vits: https://github.com/jaywalnut310/vits
- MoeGoe: https://github.com/CjangCjengh/MoeGoe
- vits-uma-genshin-honkai: https://huggingface.co/spaces/zomehwh/vits-uma-genshin-honkai
- vits-models: https://huggingface.co/spaces/zomehwh/vits-models