Turn audio into videos with AI-generated visuals and word-by-word captions.
Send an audio file to a Telegram bot and get back a video with matching images and synced captions. Uses AssemblyAI for transcription, your choice of LLM (DeepSeek or Kimi) for scene descriptions, and FFmpeg for video rendering.
Built with TypeScript and Bun. Supports multiple video styles, AI image generation via Cloudflare Workers or Together AI, and optional MinIO/S3 upload.
Setup Guide • Docker Guide • License
- Upload audio through Telegram (or send a URL)
- AI transcribes it with word-level timestamps
- AI generates image descriptions for each segment
- Images are generated (AI) or searched (DuckDuckGo)
- Video is rendered with word-by-word captions
- You get the finished video back
Processing time: 3-7 minutes for a typical 2-minute audio.
Prerequisites:
- Docker and Docker Compose
- Telegram bot token (get one from @BotFather)
- API keys for AssemblyAI and your chosen AI provider (DeepSeek or Kimi)
Setup:
cp .env.example .env
# Edit .env and add your API keysRun:
docker-compose up -dSee DOCKER.md for detailed Docker instructions.
Prerequisites:
Install:
bun install
bun font/add.ts # Install caption fontConfigure:
cp .env.example .env
# Edit .env and add your API keysRun:
bun startSee the Setup Guide for detailed instructions, optional features, and API service setup.
The bot supports different video styles. Add a hashtag when sending audio to pick a style:
| Style | Hashtag | Look |
|---|---|---|
| History | #history (default) |
Oil painting aesthetic, karaoke captions, pan effect |
| WW2 | #ww2 |
Black-and-white archival photos, simple white captions |
You can also override specific settings with options:
#history --pan # Enable pan effect
#ww2 --karaoke # Enable karaoke highlighting
#history --highlight=yellow # Change highlight color
#ww2 --no-pan # Disable pan effect
Send /styles in Telegram to see all available styles and options.
| Command | What it does |
|---|---|
/start |
Get started |
/upload |
Upload an audio file |
/url |
Process audio from a URL |
/queue |
Check pending jobs |
/styles |
List available styles |
/help |
Show usage instructions |
/cleanup |
Clear temp files |
Telegram limits downloads to 20MB. For bigger files:
- Compress your audio first, or
- Upload to a file host and use
/url <link>
/url https://example.com/large-audio.mp3 #history
- AssemblyAI - Transcription with word-level timing
- DeepSeek / Kimi - LLM for scene descriptions
- Cloudflare Workers - AI image generation (SDXL 1.0)
- Together AI - AI image generation (FLUX.1-schnell)
- MinIO / AWS S3 - Optional video storage
OCL v1.0. Free for personal use. Commercial use requires contributing back. See LICENSE.md.