Batch video captioning using Qwen3-VL-8B vision-language model. Select a directory, process videos, get captions saved alongside them.
- Python 3.10+
- CUDA-capable GPU (8GB+ VRAM recommended)
- Node.js 18+ (for frontend build)
Windows:
install.bat
Linux/Mac:
chmod +x install.sh
./install.sh
This creates a virtual environment and installs all dependencies.
Windows:
start.bat
Linux/Mac:
./start.sh
Open http://localhost:8000 in your browser.
- Click Settings and select your working directory
- Videos from that directory appear in the grid
- Select videos and click Process
- Captions are saved as
.txtfiles alongside the videos
Edit config.py to adjust:
| Setting | Default | Description |
|---|---|---|
MODEL_ID |
Qwen/Qwen3-VL-8B-Instruct | HuggingFace model |
MAX_FRAMES_PER_VIDEO |
128 | Frames extracted per video |
FRAME_SIZE |
336 | Frame dimension in pixels |
MAX_TOKENS |
512 | Max caption length |
TEMPERATURE |
0.3 | Generation creativity |
On systems with multiple CUDA GPUs, the suite automatically detects available devices and enables parallel processing:
- Auto-detection: GPUs are detected on startup via
/api/system/gpu - Batch size: Set how many videos to process simultaneously (1 per GPU, max 8)
- Parallel workers: Each GPU loads its own model copy and processes videos independently
The batch size slider appears in Settings → Optimization only when multiple GPUs are detected. Each GPU requires ~16GB VRAM to hold the Qwen3-VL-8B model.
Video Caption Suite/
├── backend/ # FastAPI server
├── frontend/ # Vue 3 UI
├── models/ # Downloaded model cache
├── config.py # Settings
├── install.bat/sh # Installation
└── start.bat/sh # Launch server
MIT
