-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Description
Research: Video Resolution vs Token Usage
Qwen3-VL Token Calculation
Formula: tokens = total_pixels / 1024 (32×32 pixel block per token)
| Resolution | Pixels | Tokens per Frame |
|---|---|---|
| 1080p (1920×1080) | 2,073,600 | ~2,025 tokens |
| 4K (3840×2160) | 8,294,400 | ~8,100 tokens |
4K uses exactly 4× more tokens than 1080p.
Video-Specific Details
- Temporal compression: 2× (every 2 frames compressed together)
- Default limits: Model may resize frames to fit a pixel budget
- Default max: ~20,480 × 32 × 32 pixels total across all frames
- Recommended token range: 256-16,384 per video
Practical Impact
For a 10-second clip at 2 fps (20 frames):
- 1080p: ~20,250 tokens (with temporal compression: ~10,125)
- 4K: ~81,000 tokens (with temporal compression: ~40,500)
Recommendation
Downscale videos to 1080p or even 720p before sending. This will:
- Use 4× fewer tokens (compared to 4K)
- Reduce KV cache pressure
- Likely maintain similar accuracy for kill detection
Implementation Ideas
- Add optional
--resizeflag to client script - Use ffmpeg to downscale before base64 encoding
- Consider 720p for even more savings (~1,013 tokens/frame)
Sources
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels