Voxtral Voice Clone

Training the missing codec encoder for Mistral's Voxtral-4B-TTS, enabling zero-shot voice cloning on the open-weight model.

Status: The encoder produces intelligible speech from reference audio. Voice identity preservation is actively improving with V3 training.

Results

We successfully trained a codec encoder that:

Produces codes the LLM accepts without any fine-tuning (no LoRA needed)
Generates clear, intelligible speech from reference audio clips
Produces embeddings with norms matching preset voices (4.6 vs 3.7 target)
Uses 50-200+ unique semantic codes per utterance (matching preset range of 52-152)

Voice identity is improving with V3 training (speaker verification loss + native 24kHz data).

What This Does

Mistral released Voxtral-4B-TTS-2603 with an important gap: the codec encoder weights were not included. Without them, the model is limited to 20 preset voices and cannot clone new voices from audio.

This project trains the missing encoder from scratch using techniques from the Voxtral paper, EnCodec, and speaker verification research.

Architecture

The Voxtral codec is a VQ-FSQ hybrid that compresses audio to 2.14 kbps:

12.5 Hz frame rate (240-sample patch at 24kHz, 8x downsampling)
1 semantic code (VQ, 8192 entries) + 36 acoustic codes (FSQ, 21 levels)
Voice embeddings = sum of 37 codebook lookups per frame -> [N, 3072]

See ARCHITECTURE.md for the full technical breakdown, weight mapping, and research findings.

Quick Start

Requirements

Training (encoder):

Resource	Minimum	Recommended
GPU	1x A100-80GB	4x A100-80GB SXM
VRAM per GPU	80GB	80GB
Total VRAM	80GB	320GB
Batch size	2 per GPU	8 per GPU
Epoch time	~20h (1 GPU)	~3-4h (4 GPU)
Disk	100GB (model + data)	300GB

Inference (voice cloning):

Resource	Minimum	Recommended
GPU	1x with >= 16GB VRAM	1x A100/H100
VRAM	16GB	40GB+
RAM	16GB	32GB
Disk	8GB (model checkpoint)	8GB

The model runs on a single GPU for inference. Two stages (LLM + acoustic transformer) share the GPU with --gpu-memory-utilization 0.45.

pip install -r requirements.txt
pip install speechbrain  # for ECAPA-TDNN speaker verification loss

Train Codec Encoder

export MODEL_DIR=/path/to/Voxtral-4B-TTS-2603
export HF_CACHE=/path/to/data_cache
export OUTPUT_DIR=/path/to/encoder_output

torchrun --nproc_per_node=4 train_encoder.py

Training uses LibriTTS-R (native 24kHz, 585h, 2456 speakers) downloaded automatically from HuggingFace.

Inject Weights & Test

python inject_encoder.py
python patch_tokenizer.py

# Serve with vLLM
vllm serve /path/to/model --omni --gpu-memory-utilization 0.45

Training Recipe (V3)

The current training combines techniques from multiple research papers:

Component	Source	Purpose
Stochastic VQ (50/25/25)	Voxtral paper	Prevents code saturation
ASR distillation (Whisper)	Voxtral paper	Semantic token diversity
Codebook diversity loss	Our innovation	Breaks semantic collapse (1 -> 200+ codes)
Frozen speaker loss (ECAPA-TDNN)	SAC paper	Explicit speaker identity preservation
Gradient-norm balancer	EnCodec/AudioCraft	Auto-scales loss contributions
Acoustic distribution shaping	Preset analysis	Matches preset code statistics
Multi-resolution STFT disc (64ch, 8 scales)	Voxtral paper + EnCodec	Waveform quality
Native 24kHz data (LibriTTS-R)	Our discovery	Fixes 16kHz upsampling artifacts
Adam beta1=0.5	EnCodec/HiFi-GAN	GAN training stability

V3 Training Metrics (current best)

Metric	V1 (initial)	V2 (diversity fix)	V3 (current)
mel loss	1.8 (plateau)	1.5 (plateau)	0.87 (dropping)
sem_util	1/8192	70-100	228/8192
speaker loss	N/A	N/A	0.37 (dropping)
acoustic codes	all 18s	diverse 3-15	diverse 3-18
speech output	hums	intelligible, wrong voice	intelligible, identity improving

Research Narrative

The Problem

Voxtral's codec encoder is deliberately withheld from the open-weight release. The paper provides full architecture details, but training a compatible encoder requires solving multiple interacting problems.

The Invisible Walls (and how we broke them)

Wall 1 - Codebook Collapse (sem_util=1/8192): All encoder outputs map to one codebook entry. ASR distillation alone is insufficient because the encoder can produce diverse continuous features that all land in one Voronoi cell. Solved with entropy-based codebook diversity loss (sem_util 1 -> 200+).

Wall 2 - Acoustic Code Saturation: Without stochastic quantization, codes collapse to extremes (0 and 20). Solved with the paper's 50/25/25 schedule plus acoustic distribution shaping targeting preset statistics (mean~10).

Wall 3 - Speaker Identity Loss: Mel reconstruction alone does NOT preserve speaker identity at low bitrates (2.14 kbps). The codec can sound good while erasing speaker-discriminative features. Solved with frozen ECAPA-TDNN speaker verification loss (cosine similarity between original and reconstructed speaker embeddings).

Wall 4 - Sample Rate Mismatch: Training on 16kHz LibriSpeech upsampled to 24kHz means the 8-12kHz band (where speaker timbre lives) contains interpolated artifacts. Solved by switching to LibriTTS-R (native 24kHz).

Wall 5 - Loss Balancing: Manual loss weights (ASR=5, diversity=5) caused mel to plateau at 1.5 for 2+ epochs. Solved with EnCodec's gradient-norm balancer that auto-scales each loss based on gradient magnitude.

Key Discovery: LoRA Is Not Needed

We initially assumed the LLM would reject our encoder's embeddings and require LoRA fine-tuning. Testing showed the opposite: the original unmodified LLM produces clear speech from our encoder's codes. LoRA at rank 8 destroyed the base model; rank 2 partially worked but damaged preset voices. The encoder's codes are "legal" tokens that the LLM accepts natively.

License

This project is licensed under CC BY-NC 4.0. The trained weights are derivative of Mistral's Voxtral-4B-TTS model and subject to its license terms.

Citation

@misc{voxtral-voice-clone,
  title={Training the Missing Voxtral Codec Encoder for Zero-Shot Voice Cloning},
  author={al0olo},
  year={2025},
  url={https://github.com/al0olo/voxtral-voice-clone}
}

Acknowledgements

Mistral AI for the Voxtral-4B-TTS model and paper
Meta AI for EnCodec's gradient balancer
SpeechBrain for the ECAPA-TDNN speaker verification model
OpenAI Whisper for ASR distillation
The LibriTTS-R and Common Voice communities for open audio datasets

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.gitignore		.gitignore
ARCHITECTURE.md		ARCHITECTURE.md
DATASETS.md		DATASETS.md
LICENSE		LICENSE
README.md		README.md
inject_encoder.py		inject_encoder.py
patch_tokenizer.py		patch_tokenizer.py
requirements.txt		requirements.txt
train_encoder.py		train_encoder.py
train_full_pipeline.py		train_full_pipeline.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Voxtral Voice Clone

Results

What This Does

Architecture

Quick Start

Requirements

Train Codec Encoder

Inject Weights & Test

Training Recipe (V3)

V3 Training Metrics (current best)

Research Narrative

The Problem

The Invisible Walls (and how we broke them)

Key Discovery: LoRA Is Not Needed

License

Citation

Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Voxtral Voice Clone

Results

What This Does

Architecture

Quick Start

Requirements

Train Codec Encoder

Inject Weights & Test

Training Recipe (V3)

V3 Training Metrics (current best)

Research Narrative

The Problem

The Invisible Walls (and how we broke them)

Key Discovery: LoRA Is Not Needed

License

Citation

Acknowledgements

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages