Fork of TMElyralab/MuseTalk
Extended with a real-time, browser-based talking-head avatar pipeline: LLM → Kokoro TTS → MuseTalk → browser canvas.
The original MuseTalk repo provides offline video-dubbing inference. This fork adds a fully streaming, conversational avatar system built on top of it:
| File | Role |
|---|---|
run_musetalk_avatar.py |
Flask server — entry point for the live avatar |
musetalk_avatar_pipeline.py |
Core pipeline: LLM → TTS → MuseTalk → output queue |
tts_kokoro.py |
Kokoro TTS wrapper (24 kHz → 16 kHz resampling) |
llm_wrapper.py |
Streaming LLM backend (Ollama / OpenAI / Echo) |
musetalk/utils/enhancer.py |
Optional GFPGAN face-restoration post-processing |
User text input (browser)
│
▼
┌─────────────────┐ sentence stream ┌──────────────────────────────┐
│ LLM thread │ ─────────────────► │ TTS + MuseTalk worker │
│ (streaming) │ │ (serial, per sentence) │
└─────────────────┘ └──────────────┬───────────────┘
│ SyncedChunk
│ (audio + frames)
▼
┌───────────────────┐
│Flask SSE/sync_feed│
└────────┬──────────┘
│
▼
Browser (canvas + Web Audio)
AudioContext as master clock
lip-sync via playbackRate
- LLM thread — streams tokens from the LLM backend, flushes complete sentences into
text_q - TTS + MuseTalk worker — picks sentences from
text_q, synthesizes audio with Kokoro, runs MuseTalk UNet inference, blends frames, optionally enhances with GFPGAN, pushesSyncedChunkobjects tooutput_q - Flask SSE — reads
output_qand streams each chunk to the browser as a Server-Sent Event containing base64-encoded WAV audio + JPEG frames - Browser — decodes chunks, schedules audio via
AudioContext, renders frames on<canvas>, adjustsplaybackRateto maintain A/V sync
Audio playback time is the master clock. The browser schedules each audio chunk with AudioContext.currentTime, then plays back the corresponding frames at a rate derived from audio_duration / frame_count. This eliminates JavaScript timer drift entirely.
Python 3.10 + CUDA 11.8 recommended (tested on RTX 3080).
conda create -n MuseTalk python==3.10
conda activate MuseTalkpip install torch==2.0.1 torchvision==0.15.2 torchaudio==2.0.2 \
--index-url https://download.pytorch.org/whl/cu118pip install -r requirements.txtpip install flask loguru kokoro>=0.9.2 scipy soundfile
# espeak-ng is required by Kokoro for phonemisation
sudo apt-get install espeak-ngpip install --no-cache-dir -U openmim
mim install mmengine
mim install "mmcv==2.0.1"
mim install "mmdet==3.1.0"
mim install "mmpose==1.1.0"# Ubuntu
sudo apt-get install ffmpeg
# Or export path to a static build
export FFMPEG_PATH=/path/to/ffmpeg-staticpip install gfpgan
# Download weights
mkdir -p experiments/pretrained_models
wget https://github.com/TencentARC/GFPGAN/releases/download/v1.3.4/GFPGANv1.4.pth \
-O experiments/pretrained_models/GFPGANv1.4.pth# Linux
sh ./download_weights.sh
# Windows
download_weights.batExpected layout after download:
./models/
├── musetalkV15/
│ ├── musetalk.json
│ └── unet.pth
├── musetalk/
│ ├── musetalk.json
│ └── pytorch_model.bin
├── whisper/
│ ├── config.json
│ ├── pytorch_model.bin
│ └── preprocessor_config.json
├── dwpose/
│ └── dw-ll_ucoco_384.pth
├── face-parse-bisent/
│ ├── 79999_iter.pth
│ └── resnet18-5c106cde.pth
├── sd-vae/
│ ├── config.json
│ └── diffusion_pytorch_model.bin
└── syncnet/
└── latentsync_syncnet.pt
python run_musetalk_avatar.py \
--avatar_image /path/to/your/face.jpg \
--llm_backend ollama \
--llm_model llama3.2 \
--tts_voice af_heart \
--port 7860Then open http://localhost:7860 in your browser.
| Argument | Default | Description |
|---|---|---|
--avatar_image |
(required) | Path to reference face image (PNG/JPG) |
--unet_config |
./models/musetalkV15/musetalk.json |
MuseTalk UNet config |
--unet_model_path |
./models/musetalkV15/unet.pth |
MuseTalk UNet weights |
--whisper_dir |
./models/whisper |
Whisper feature extractor path |
--vae_type |
sd-vae |
VAE type |
--use_float16 |
True |
Use fp16 (recommended for 3080) |
--batch_size |
8 |
Frames per UNet batch |
--bbox_shift |
0 |
Vertical shift for mouth crop (px) |
--extra_margin |
10 |
Extra pixels around face crop |
--fps |
25 |
Output frame rate |
--tts_voice |
af_heart |
Kokoro voice tag |
--tts_speed |
1.0 |
TTS speech rate |
--llm_backend |
echo |
echo / openai / ollama |
--llm_model |
llama3.2 |
LLM model name |
--llm_api_key |
None |
API key (OpenAI / compatible) |
--llm_base_url |
None |
Custom API base URL |
--ollama_host |
http://localhost:11434 |
Ollama server URL |
--port |
7860 |
Flask server port |
# Local Ollama (zero cost)
python run_musetalk_avatar.py --avatar_image face.jpg --llm_backend ollama --llm_model llama3.2
# OpenAI
python run_musetalk_avatar.py --avatar_image face.jpg --llm_backend openai \
--llm_model gpt-4o-mini --llm_api_key sk-...
# Echo (test pipeline without an LLM — reflects user input directly)
python run_musetalk_avatar.py --avatar_image face.jpg --llm_backend echoOn startup the pipeline pre-processes the reference image once:
- Detects face landmarks and bounding box
- Crops and resizes the face region to 256×256
- Encodes the crop through the VAE to get a reference latent
- Prepares blending masks via the face-parsing model
This pre-computation means no per-frame avatar encoding at inference time — only the audio-conditioned UNet runs per batch.
The musetalk/utils/enhancer.py module applies GFPGAN face restoration as a post-processing step after the full VAE decode + blending pipeline. Enhancement runs in pixel space and is optional.
To enable it, ensure the GFPGAN weights are downloaded (see Installation §7) and the enhancer is imported in your inference script:
from musetalk.utils.enhancer import enhance_frame
# called per-frame after get_image_blending()
frame = enhance_frame(frame)Standard offline inference from the original repo still works:
# MuseTalk 1.5 (recommended)
sh inference.sh v1.5 normal
# MuseTalk 1.0
sh inference.sh v1.0 normalSee the original Getting Started section below for Gradio demo and training instructions.
.
├── run_musetalk_avatar.py # Live avatar Flask server (entry point)
├── musetalk_avatar_pipeline.py # Core streaming pipeline
├── tts_kokoro.py # Kokoro TTS wrapper
├── llm_wrapper.py # Streaming LLM (Ollama / OpenAI / Echo)
├── musetalk/
│ └── utils/
│ ├── enhancer.py # GFPGAN post-processing (added)
│ ├── blending.py
│ ├── audio_processor.py
│ ├── face_parsing.py
│ ├── preprocessing.py
│ └── utils.py
├── models/ # Downloaded weights (see above)
├── experiments/
│ └── pretrained_models/
│ └── GFPGANv1.4.pth # Optional GFPGAN weights
├── scripts/
│ └── inference.py
├── configs/
├── requirements.txt
├── download_weights.sh
└── app.py # Original Gradio demo
| Component | Minimum | Recommended |
|---|---|---|
| GPU | RTX 3050 Ti (4 GB VRAM) | RTX 3080 (10 GB VRAM) |
| RAM | 16 GB | 32 GB |
| CUDA | 11.7 | 11.8 |
| Python | 3.10 | 3.10 |
fp16 mode (--use_float16) is strongly recommended on consumer GPUs. On an RTX 3080, the pipeline sustains real-time 25 fps output with batch size 8.
The sections below are preserved from the upstream MuseTalk repo for reference.
github | huggingface | Technical report
MuseTalk is a real-time high-quality audio-driven lip-syncing model (30 fps+ on Tesla V100), operating in the latent space of ft-mse-vae. It modifies a face region of 256×256 px, supports multiple languages, and is distinct from diffusion models — it inpaints in a single UNet step.
python app.py --use_float16 --ffmpeg_path /path/to/ffmpeg# Data preprocessing
python -m scripts.preprocess --config ./configs/training/preprocess.yaml
# Stage 1
sh train.sh stage1
# Stage 2
sh train.sh stage2- MuseTalk (TMElyralab) — base model and architecture
- Kokoro TTS — text-to-speech synthesis
- GFPGAN (TencentARC) — face restoration
- whisper (OpenAI) — audio feature extraction
- dwpose, face-parsing, face-alignment
@article{musetalk,
title={MuseTalk: Real-Time High-Fidelity Video Dubbing via Spatio-Temporal Sampling},
author={Zhang, Yue and Zhong, Zhizhou and Liu, Minhao and Chen, Zhaokang and Wu, Bin and
Zeng, Yubin and Zhan, Chao and He, Yingjie and Huang, Junxin and Zhou, Wenjiang},
journal={arxiv},
year={2025}
}