Interactive Video-Language Model (VLM) Playground with Live Prompt Editing
A real-time sandbox for experimenting with VideoLLaMA3-7B on short video clips using a web UI. Designed as a university project for the course Prompt Engineering (VITMAV82).
-
captures video (webcam or static file),
-
buffers short clips,
-
runs VLM inference,
-
streams output to a browser UI,
-
supports interactive prompt editing + preset management.
Tested on Ubuntu + RTX 4090.
-
🎥 Webcam or static video input (OpenCV)
-
⏱ Rolling clip buffer (few-second segments)
-
🧠 VLM inference via DAMO-NLP-SG/VideoLLaMA3-7B
-
🌡 Adjustable temperature (UI-controlled)
-
🧵 Thread-safe prompt updates
-
💾 Persistent user presets
-
🗂 Prompt history logging with timestamps
-
🌐 Live video streaming to browser
-
📜 Rich structured logging (Loguru)
Pipeline:
Video Source (Webcam / File)
↓
ClipRecorder (temporal buffer → .mp4)
↓
ThreadPoolExecutor
↓
VideoLLaMA3-7B (Transformers)
↓
WebSocket broadcast
↓
Browser UI
-
Clips are written to disk temporarily, then deleted after inference.
-
Single-worker inference executor (avoids GPU contention).
-
Prompt state shared via thread-safe Prompt object.
-
Presets stored in JSON.
-
Prompt history persisted only when content changes.
Model: DAMO-NLP-SG/VideoLLaMA3-7B
LLM Framework: HuggingFace Transformers
Attention: FlashAttention2
Precision: bfloat16
Device: cuda:0
→ Ubuntu
→ RTX 4090 (24GB VRAM)
→ CUDA compatible environment
→ ~24GB VRAM required (no quantization)
→ ~200ms at 10 FPS resampling
→ ~400ms at 30–40 FPS (in similar setup)
git clone https://github.com/LordLokator/VLMPromptEngineeringSandbox VLMDemo
cd VLMDemo
bash setup.sh
source .venv/bin/activate
python main.py
Then open:
http://localhost:8000/
Top Left:
Live video stream
Bottom Left:
Preset buttons (random colors)
Save preset (disabled if empty)
Load preset
Temperature slider
Right panel:
🟢 VLM outputs
🔵 User messages
⚫ System acknowledgments
Presets stored in:
./presets/user_presets.json
Prompt history stored in:
./prompting/history.json
-
Dominant Action
-
Vehicle Types
-
OCR Pass
-
Parking Events
-
Unusual Behavior
-
Activity Level
This was built for a university course on Prompt Engineering.
The goal was not to build another LLM chat interface, but to:
-
explore prompt design for multimodal models
-
experiment with temporal video chunking
-
observe latency vs FPS trade-offs
-
create a controlled sandbox for VLM behavior