Skip to content

Latest commit

 

History

History
194 lines (137 loc) · 6.88 KB

File metadata and controls

194 lines (137 loc) · 6.88 KB

video2ai

Turn any video into AI-ready structured content. Extract frames, transcribe audio, auto-detect key moments — all running locally on your Mac's Neural Engine.

No cloud. No API keys. No PyTorch. Just Apple Silicon doing what it does best.

PyPI version License: MIT Python 3.10+


Installation

Prerequisites

  • macOS (Apple Vision framework required)
  • ffmpegbrew install ffmpeg

Option 1: pip (recommended)

pip install video2ai

With Apple Vision support (OCR, embeddings, classification):

pip install "video2ai[vision]"

Option 2: Homebrew

brew tap sameeeeeeep/video2ai https://github.com/sameeeeeeep/video2ai.git
brew install video2ai

Option 3: Download binary

Grab the latest pre-built macOS binary from Releases — no Python required:

curl -L https://github.com/sameeeeeeep/video2ai/releases/latest/download/video2ai -o video2ai
chmod +x video2ai
sudo mv video2ai /usr/local/bin/

Option 4: Install from source

git clone https://github.com/sameeeeeeep/video2ai.git && cd video2ai
pip install -e ".[vision]"

Optional extras

pip install openai-whisper    # transcription (local, base model is fine)
brew install yt-dlp           # URL downloads (YouTube, Vimeo, etc.)

The Problem

You have a video. You need an AI to understand it. But LLMs can't watch videos — they need frames + text. Manually scrubbing through to pick the right frames is tedious. Existing tools are slow, memory-hungry, or require cloud APIs.

The Solution

Video → ffmpeg + Whisper + Apple Vision → structured content in seconds

Drop a video in. Get back:

  • Timestamped transcript — Whisper, fully local
  • Key frames auto-selected per transcript segment — Apple Vision Neural Engine embeddings + cosine similarity
  • Visual theme clusters — k-means on frame embeddings, filter out talking heads, keep product shots
  • Lightweight Markdown export — local image paths, no base64 bloat, AI reads text instantly and loads images on demand
  • Self-contained HTML export — images embedded inline, for human viewing
  • Screen capture — record any tab/screen directly from the browser, bypasses all platform download restrictions

Quick Start

Web UI

video2ai --web
# → http://localhost:8910

Three input modes:

  • Upload — drag a video file
  • Paste URL — YouTube, Threads, Vimeo, anything yt-dlp supports
  • Screen Capture — share any browser tab or screen, record at 1fps + audio, process through the same pipeline. Works with Instagram, TikTok, Netflix — anything on screen.

CLI

video2ai video.mp4 -o output/

Claude Code Skill

# Invoke from Claude Code:
/video2ai /path/to/video.mp4

The skill runs the full pipeline and outputs a lightweight Markdown file that Claude can read with local image paths.

How It Works

Video file / URL / Screen capture
  │
  ├─ ffmpeg ──────────── frames (1/sec, JPEG)
  │
  ├─ Whisper ─────────── transcript segments + timestamps
  │
  ├─ Apple Vision ────── 768-dim embedding per frame (Neural Engine)
  │    │
  │    ├─ per-segment ── cosine distance → visual state changes → key frame suggestions
  │    │
  │    └─ global ─────── k-means clustering → visual theme groups
  │
  ├─ Apple Vision OCR ── optional, on-demand text extraction from key frames
  │
  └─ Apple Intelligence ── on-device OCR summary via FoundationModels (auto-launches server)

The key insight: frame selection is a vector math problem, not an LLM problem. Embed every frame, embed (or timestamp-match) every transcript segment, pick the frames with the highest visual distinctiveness per segment. Runs in seconds, not minutes.

Zero ML overhead in Python. VNGenerateImageFeaturePrintRequest runs on the Neural Engine — the Python process just shuffles bytes. No PyTorch, no CLIP, no transformers loaded into RAM.

The Workflow

  1. Upload, paste URL, or screen capture — any input mode
  2. Pipeline runs — probe → extract → transcribe → embed → suggest
  3. Review — transcript sidebar, frame grid per segment, pre-selected key frames
  4. Filter by visual theme — click to deselect/select all frames in a theme, right-click to suppress
  5. OCR (optional) — run Apple Vision OCR on selected key frames, auto-summarized by Apple Intelligence on-device
  6. Export — Markdown (for AI) or HTML (for humans). OCR summary included by default, raw OCR opt-in.

Export Formats

Format Mode Best for
Markdown Download for AI AI consumption — lightweight text + local image paths, ~150 lines vs 170k tokens
HTML Download HTML Human viewing — self-contained, base64 images, opens in any browser
HTML (AI) ?mode=ai Compressed thumbnails, still self-contained

Architecture

Module What it does
probe.py ffprobe wrapper — duration, resolution, codecs, audio detection
frames.py ffmpeg frame extraction at configurable intervals
transcribe.py Whisper speech-to-text, returns timed segments
clip_match.py Apple Vision embeddings, visual change detection, k-means clustering
vision.py Apple Vision OCR + image classification + Apple Intelligence summarization
llm.py Ollama LLM analysis — optional, for summaries
web.py Flask web UI — upload, URL, screen capture, review, export
embed.py Bake metadata into video via ffmpeg

Why Not Just Use CLIP?

We tried. CLIP + PyTorch eats ~2GB RAM and requires loading a 600MB model. Apple Vision's VNGenerateImageFeaturePrintRequest runs on the Neural Engine with near-zero memory overhead — it's already on your machine, already optimized, and produces 768-dim embeddings that work great for frame similarity.

For transcript↔frame matching, we don't even need cross-modal embeddings. The transcript gives us timestamps → we know which frames belong to which segment → we pick the most visually distinct ones within each segment. Simple, fast, accurate.

Contributing

git clone https://github.com/sameeeeeeep/video2ai.git && cd video2ai
make dev

Releasing

make release VERSION=0.2.0

This bumps the version, commits, tags, and pushes. GitHub Actions handles PyPI publishing, binary builds, and Homebrew formula updates automatically.

License

MIT


Built with Claude Code.