Skip to content

Optimization: Downscale videos to reduce token usage #1

@1kuna

Description

@1kuna

Research: Video Resolution vs Token Usage

Qwen3-VL Token Calculation

Formula: tokens = total_pixels / 1024 (32×32 pixel block per token)

Resolution Pixels Tokens per Frame
1080p (1920×1080) 2,073,600 ~2,025 tokens
4K (3840×2160) 8,294,400 ~8,100 tokens

4K uses exactly 4× more tokens than 1080p.

Video-Specific Details

  • Temporal compression: 2× (every 2 frames compressed together)
  • Default limits: Model may resize frames to fit a pixel budget
    • Default max: ~20,480 × 32 × 32 pixels total across all frames
  • Recommended token range: 256-16,384 per video

Practical Impact

For a 10-second clip at 2 fps (20 frames):

  • 1080p: ~20,250 tokens (with temporal compression: ~10,125)
  • 4K: ~81,000 tokens (with temporal compression: ~40,500)

Recommendation

Downscale videos to 1080p or even 720p before sending. This will:

  • Use 4× fewer tokens (compared to 4K)
  • Reduce KV cache pressure
  • Likely maintain similar accuracy for kill detection

Implementation Ideas

  • Add optional --resize flag to client script
  • Use ffmpeg to downscale before base64 encoding
  • Consider 720p for even more savings (~1,013 tokens/frame)

Sources

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions