Imagine typing a script and watching a lifelike digital avatar bring it to life — speaking your text with natural expressions, synchronized lip movements, and realistic voice modulation. Welcome to Avatar Lab, where we merge the power of AI, deep learning, and speech synthesis to create real-time talking avatars.
The purpose of Avatar Lab is to build an end-to-end deep learning pipeline capable of converting textual input into realistic talking head videos. This is achieved by integrating a powerful text-to-speech model (SMALL-E) with a facial animation model (DiffDub). Our key focus is to enable seamless avatar generation that can be used in various domains like:
-
Virtual Assistants 🚀
-
Education/Online Learning 🎓
-
Gaming Industry (NPC Speech) 🎮
-
Digital Storytelling/Media Production 🎬
-
Customer Support Automation 💬
The overall architecture of our project is designed to efficiently convert text into talking avatar videos.
The workflow of the project can be visualized as follows:
Develop an efficient TTS model for natural speech and zero-shot cloning on limited hardware, using linear attention instead of quadratic transformers.
Replace decoder-only transformers with recurrent architectures (GLA, RWKV, Mamba) for efficient training on long sequences (up to 30s) with lower cost.
Introduce modified cross-attention with positional feedback to prevent TTS issues like skipping and repetition by improving text-audio alignment.
Model TTS as language modeling over RVQ audio tokens for zero-shot voice cloning without a separate speaker encoder.
Text is processed with byte-pair encoding and a non-causal transformer encoder.Audio is compressed using an RVQ codec and then encoded with a stack of LCLM blocks.
The Position-Aware Cross-Attention mechanism fuses text and audio embeddings.Uses sinusoidal positional embeddings and a recurrent feedback loop to maintain accurate alignment.
An audio decoder (mirroring the encoder architecture) reconstructs audio tokens from the fused embeddings, outputting logits for final token prediction.
Small-E trains faster, reduces errors, and sounds better than other models like YourTTS but not as good as MetaVoice.
Model Scale: Efficient with 64M parameters but trades off some performance. Future: Improve linear attention and explore streaming TTS for better quality and embedded use.
DiffDub is a diffusion-based model designed to generate accurate lip-sync animation from any audio input.
Initially, we planned to use DiffTalk, but we later realized that DiffDub provides: ✅ Higher lip-sync accuracy.
✅ Better performance on low-latency video generation.
✅ Easier model integration.
Generates realistic facial movements and expressions.
Low latency inference suitable for real-time generation.
High-quality video outputs.
| Contributor Name | First Milestone PPT | First Milestone Video | Second Milestone PPT | Second Milestone Video |
|---|---|---|---|---|
| P HRITHIK RAJ | PPT | Video | PPT | Video |
| A YASHWANTH | PPT | Video | PPT | Video |
| NIKHILESH NILAGIRI | PPT | Video | PPT | Video |
| N MAHESH | PPT | Video | PPT | Video |
| V VISHAL RAJ | PPT | Video | PPT | Video |
| K PRASANA KUMAR | PPT | Video | PPT | Video |
We plan to enhance Avatar Lab by:
✅ Adding more realistic facial expressions.
✅ Integrating multi-lingual text-to-speech models.
✅ Improving real-time performance using CUDA acceleration.
✅ Adding customizable avatars and background settings.