VoiceSculptor

Official inference code for
VoiceSculptor: Your Voice, Designed By You

💡 System Overview

VoiceSculptor is composed of two core components: voice design and voice clone. The voice design module enables the generation of timbre from natural language descriptions and supports command refinement through Retrieval-Augmented Generation (RAG). It also provides fine-grained control over voice attributes, including gender, age, speaking rate, pitch, volume, and emotional expression. The synthesized audio produced by the voice design module can be used as a prompt waveform for the CosyVoice2 voice cloning model, enabling timbre cloning and downstream speech synthesis tasks.

📊 Instruct TTS Eval

Instruct TTS Eval (ZH)

Model	APS (%)	DSD (%)	RP (%)	AVG (%)
Gemini 2.5-Flash*	88.2	90.9	77.3	85.4
Gemini 2.5-Pro*	89.0	90.1	75.5	84.8
GPT-4o-Mini-TTS*	54.9	52.3	46.0	51.1
ElevenLabs*	42.8	50.9	59.1	50.9
VoxInstruct	47.5	52.3	42.6	47.5
MiMo-Audio-7B-Instruct	70.1	66.1	57.1	64.5
VoiceSculptor	75.7	64.7	61.5	67.6

Note

Models marked with * are commercial models.

InstructTTSEval — InstructTTSEval: Benchmarking Complex Natural-Language Instruction Following in Text-to-Speech Systems. arXiv preprint arXiv:2506.16381.

✨ Demo Video

final.mov

🔥 News

[2026-1-18] We have released the RAG and WebUI code, and the WebUI supports vLLM.
[2026-1-16] We have released the technical report! VoiceSculptor
[2026-1-8] We have released the Demo Page and Demo Video! VoiceSculptor-Demo-Page
[2026-1-2] We opened the repository and uploaded the voice design models! VoiceSculptor

🚀 Getting Started

1. Environment Setup

Follow the steps below to clone the repository and install the required environment.

# Clone the repository and enter the directory
git clone https://github.com/ASLP-lab/VoiceSculptor.git
cd VoiceSculptor

# Create and activate a Conda environment
conda create -n VoiceSculptor python=3.10 -y
conda activate VoiceSculptor

# Install dependencies
pip install -r requirements.txt

2. Download Pre-trained Models

git lfs install
git clone https://huggingface.co/ASLP-lab/VoiceSculptor-VD
git clone https://huggingface.co/HKUSTAudio/xcodec2

3. Infer

For detailed instructions on how to design high-quality voice prompts,
please refer to Voice Design Guide or Voice Design Guide EN.

You need to specify the local paths to the voice-design model and the xcodec2 model in the infer.py file.

python infer.py

4. RAG

RAG Private Text Vector Database

This project provides a simple workflow to build and deploy a private text vector database for Retrieval-Augmented Generation (RAG). You can create your own database from text files and run a lightweight server to query it via a client.

1. Build the Private Vector Database

Use the create_database.py script located in the rag/ directory.

Before running the script, make sure to:

Replace the model path
Replace the input text file path

Input Text File Format

Each line in the input file should follow this structure:


utt_file_name \t wav_path \t text_command <|endofprompt|> target_text

Example


ZH_B00074_S00400_W000029        Emilia/ZH/ZH_B00074/ZH_B00074_S00400/mp3/ZH_B00074_S00400_W000029.mp3        这是一位中年男性的中低音有声书朗读，嗓音浑厚略带粗砺，以标准普通话清晰咬字，通过多变的语调 动态的语速和戏剧化的停顿，生动演绎充满张力的动作场景<|endofprompt|>而就在此时，邵飞忽然露出坏笑，他一脚踹在赵和的窝锅子上。赵和顿时扑通一下，跪了下去。

Run the Database Creation Script

python rag/create_database.py

Once completed, your private RAG vector database will be generated.

After building the database, you can launch the query service using the server script in the rag/ folder.

Before starting the server:

Update the database path
Update the model path

2. Start the Server

bash run_server.sh

3. Query the Database Using the Client

To connect to the running service, modify the IP address and port in rag/client.py so they match the server configuration.

Run the Client

python client.py

4. Workflow Summary

Step	Action
1	Prepare and format your text dataset
2	Build the vector database
3	Configure and start the RAG server
4	Query and retrieve results via the client

5. WebUI

TODO：add instruction manual

python gradio/webui.py

6. vLLM

TODO：add instruction manual

python gradio/webui.py

📋 TODO

Citation

@misc{hu2026voicesculptorvoicedesigned,
      title={VoiceSculptor: Your Voice, Designed By You}, 
      author={Jingbin Hu and Huakang Chen and Linhan Ma and Dake Guo and Qirui Zhan and Wenhao Li and Haoyu Zhang and Kangxiang Xia and Ziyu Zhang and Wenjie Tian and Chengyou Wang and Jinrui Liang and Shuhan Guo and Zihang Yang and Bengu Wu and Binbin Zhang and Pengcheng Zhu and Pengyuan Xie and Chuan Xie and Qiang Zhang and Jie Liu and Lei Xie},
      year={2026},
      eprint={2601.10629},
      archivePrefix={arXiv},
      primaryClass={eess.AS},
      url={https://arxiv.org/abs/2601.10629}, 
}
@misc{ye2025llasascalingtraintimeinferencetime,
      title={Llasa: Scaling Train-Time and Inference-Time Compute for Llama-based Speech Synthesis},
      author={Zhen Ye and Xinfa Zhu and Chi-Min Chan and Xinsheng Wang and Xu Tan and Jiahe Lei and Yi Peng and Haohe Liu and Yizhu Jin and Zheqi Dai and Hongzhan Lin and Jianyi Chen and Xingjian Du and Liumeng Xue and Yunlin Chen and Zhifei Li and Lei Xie and Qiuqiang Kong and Yike Guo and Wei Xue},
      year={2025},
      eprint={2502.04128},
      archivePrefix={arXiv},
      primaryClass={eess.AS},
      url={https://arxiv.org/abs/2502.04128},
}

License

We use the Apache 2.0 license. Researchers and developers are free to use the codes and model weights of our VoiceSculptor. Check the license at LICENSE for more details.

Acknowledgement

This repo benefits from LLaSA
This repo benefits from CosyVoice

Usage Disclaimer

Additional Notice on Generated Voices

This project provides a speech synthesis model for voice design, intended for academic research, educational purposes, and legitimate applications, such as personalized speech synthesis, assistive technologies, and linguistic research.

Please note:

Do not use this model for unauthorized voice cloning, impersonation, fraud, scams, deepfakes, or any illegal or malicious activities.

Ensure compliance with local laws and regulations when using this model and uphold ethical standards.

The developers assume no liability for any misuse of this model.

Important clarification regarding generated voices:

As a generative model, the voices produced by this system are synthetic outputs inferred by the model, not recordings of real human voices.

The generated voice characteristics do not represent or reproduce any specific real individual, and are not derived from or intended to imitate identifiable persons.

We advocate for the responsible development and use of AI and encourage the community to uphold safety and ethical principles in AI research and applications.

Contact Us

If you are interested in leaving a message to our work, feel free to email jingbin.hu@mail.nwpu.edu.cn or lxie@nwpu.edu.cn

You’re welcome to join our WeChat group for technical discussions, updates.

Name		Name	Last commit message	Last commit date
Latest commit History 73 Commits
assets		assets
demo		demo
docs		docs
gradio		gradio
rag		rag
.DS_Store		.DS_Store
LICENSE.txt		LICENSE.txt
README.md		README.md
infer.py		infer.py
requirements.txt		requirements.txt
run_gradio.sh		run_gradio.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

VoiceSculptor

💡 System Overview

📊 Instruct TTS Eval

Instruct TTS Eval (ZH)

✨ Demo Video

🔥 News

🚀 Getting Started

1. Environment Setup

2. Download Pre-trained Models

3. Infer

4. RAG

RAG Private Text Vector Database

1. Build the Private Vector Database

2. Start the Server

3. Query the Database Using the Client

4. Workflow Summary

5. WebUI

TODO：add instruction manual

6. vLLM

TODO：add instruction manual

📋 TODO

Citation

License

Acknowledgement

Usage Disclaimer

Contact Us

Star History

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

ASLP-lab/VoiceSculptor

Folders and files

Latest commit

History

Repository files navigation

VoiceSculptor

💡 System Overview

📊 Instruct TTS Eval

Instruct TTS Eval (ZH)

✨ Demo Video

🔥 News

🚀 Getting Started

1. Environment Setup

2. Download Pre-trained Models

3. Infer

4. RAG

RAG Private Text Vector Database

1. Build the Private Vector Database

2. Start the Server

3. Query the Database Using the Client

4. Workflow Summary

5. WebUI

TODO：add instruction manual

6. vLLM

TODO：add instruction manual

📋 TODO

Citation

License

Acknowledgement

Usage Disclaimer

Contact Us

Star History

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages