Real-time video chat with AI β it can see you and hear you, then talks back.
Built with Groq APIs for blazing-fast inference. Single file server, no frameworks, runs locally.
π€ You speak β Groq Whisper (STT)
π· Camera frame β Groq Llama 4 Scout (Vision)
β (parallel)
π§ Groq Llama 3.3 70B (Conversation) β combines what it heard + saw
β
π edge-tts (Text-to-Speech) β AI speaks back
All processing runs through Groq's API β no local GPU needed. Typical round-trip: 2-4 seconds.
Sign up at console.groq.com and create an API key.
git clone https://github.com/littleshuai-bot/ai-video-chat.git
cd ai-video-chat
# Set your API key
export GROQ_API_KEY=gsk_your_key_here
# Install dependencies
pip install -r requirements.txt
# Run
python server.pyGo to http://localhost:8765 β allow camera & microphone β click π€ to talk.
Copy .env.example to .env and customize:
cp .env.example .env| Variable | Default | Description |
|---|---|---|
GROQ_API_KEY |
(required) | Your Groq API key |
AGENT_NAME |
AI Assistant |
Name displayed on the AI avatar |
USER_NAME |
You |
Name displayed on your video |
PORT |
8765 |
Server port |
LANGUAGE |
zh |
STT language code (en, zh, ja, ko, es, fr, etc.) |
TTS_VOICE |
zh-CN-XiaoxiaoNeural |
edge-tts voice (list voices) |
LLM_MODEL |
llama-3.3-70b-versatile |
Groq LLM model for conversation |
VISION_MODEL |
meta-llama/llama-4-scout-17b-16e-instruct |
Groq vision model |
AGENT_PERSONA |
(auto-generated) | Custom system prompt override |
English:
LANGUAGE=en
TTS_VOICE=en-US-AriaNeuralChinese:
LANGUAGE=zh
TTS_VOICE=zh-CN-XiaoxiaoNeuralJapanese:
LANGUAGE=ja
TTS_VOICE=ja-JP-NanamiNeural- Python 3.10+
- ffmpeg β for audio conversion (
brew install ffmpeg/apt install ffmpeg) - Groq API key β free tier at console.groq.com
- Modern browser with camera & microphone support
βββββββββββββββββββββββββββββββββββββββββββββββββββ
β Browser (UI) β
β ββββββββββββ ββββββββββββββββββββ β
β β Camera β β AI Avatar β β
β β (user) β β + Subtitles β β
β ββββββββββββ ββββββββββββββββββββ β
β π€ Record β POST /api/chat (audio+image) β
β β { text, audio_url } β
βββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βββββββββββββββββββββββββββββββββββββββββββββββββββ
β Python Server (FastAPI) β
β β
β Audio βββ [ffmpeg] βββ Groq Whisper (STT) β
β Image βββ Groq Llama 4 Scout (Vision) β parallel
β β β
β transcript + scene βββ Groq Llama 3.3 (LLM) β
β β β
β reply text βββ edge-tts (TTS) βββ MP3 β
βββββββββββββββββββββββββββββββββββββββββββββββββββ
The frontend is a single HTML file with no build step. The backend is a single Python file with FastAPI.
- π€ Voice Input β press to record, release to send
- π· Vision β AI can see your camera feed
- π Voice Output β AI speaks its replies
- π¬ Subtitles β typewriter-style text animation
- β±οΈ Call Timer β FaceTime-style UI
- π± Responsive β works on mobile & desktop
- π Multi-language β configurable STT language and TTS voice
- π Custom Persona β fully customizable AI personality
| Component | Technology | Why |
|---|---|---|
| STT | Groq Whisper Large v3 Turbo | Fastest Whisper inference available |
| Vision | Groq Llama 4 Scout | Multimodal understanding |
| LLM | Groq Llama 3.3 70B | Fast, high-quality conversation |
| TTS | edge-tts | Free, many voices, low latency |
| Server | FastAPI + uvicorn | Async Python, minimal overhead |
| Frontend | Vanilla HTML/CSS/JS | No build step, just works |
MIT
Built by ExtraSmall β¨