by Raymond Gan. See YouTube video of my demo. Try it at https://raymond.hopto.org.
Inspired by Netflix's In-Video Search System, I built my own in 3 days at Tubi's Oct. 2025 company hackathon.
It's a semantic video search system to let you quickly find and play specific moments in videos with natural language queries. You may search multiple videos by dialogue, descriptions, people, objects, or scenes.
Built in Python with FastAPI backend and Streamlit frontend. I used these AI models: OpenAI CLIP for images, OpenAI Whisper for speech recognition, and FAISS (Facebook AI Similarity Search) for vector embeddings.
Viewers may use this to quickly find/jump to favorite scenes, video editors to make movie trailers/social media clips, and advertisers to jump to specific ad products in videos. Content moderators or standards and practices lawyers could use this to quickly find objectionable content.
My project got voted by my coworkers in the top 18 projects out of 62!
- 🎬 Automatic Shot Detection: Intelligently detects shot changes in videos
- 🖼️ Multi-Frame Pooling: Netflix-style shot representation using 3 frames per shot
- 🔍 Dual-Modal Search: Search by both visual content and spoken dialogue
- ⏱️ Precise Timestamps: Jump directly to relevant moments with exact timing
- 🎥 Inline Video Player: Play videos directly in browser at exact timestamps
- 🎛️ Search Balance Control: Adjustable alpha slider to pick importance of search by images vs dialogue
- 🚀 Real-time Processing: Fast video processing + indexing pipeline. Can run either locally on laptop or, for speed, on virtual machine with a GPU. I ran my demo on a Nebius virtual machine using Nvidia H200 NVLink GPU with Intel Sapphire Rapids, on Ubuntu 22.04 (CUDA 12). I configured this Nebius VM from scratch.
- 🧹 Data Management: Tools to clear processed data and restart
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Streamlit UI │ │ FastAPI │ │ Video Files │
│ (Frontend) │◄──►│ (Backend) │◄──►│ (Processing) │
│ Port 8501 │ │ Port 8000 │ │ │
└─────────────────┘ └─────────────────┘ └─────────────────┘
│ │ │
│ ▼ │
│ ┌─────────────────┐ │
│ │ AI Models │ │
│ OpenAI CLIP/Whisper │
│ └─────────────────┘ │
│ │ │
▼ ▼ ▼
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Search UI │ │ Vector DB │ │ Thumbnails │
│ (Results) │ │ (FAISS) │ │ (JPG Files) │
└─────────────────┘ └─────────────────┘ └─────────────────┘
- FastAPI Application (
app.py): REST API endpoints for video processing and search - Video Processing (
video_tools.py): Shot detection using PySceneDetect and FFmpeg - AI Models (
models.py): OpenAI speech (Whisper) and image embeddings (CLIP) - Vector Search (
index.py): FAISS = Facebook AI Similarity Search - Data Storage (
store.py): JSONL-based metadata persistence - Subtitle Search (
subs_index.py): FAISS vector search for subtitle embeddings - Speech Recognition (
asr.py): Automatic Speech Recognition usingfaster-whisper
- Streamlit Interface (
app.py): Web-based user interface for video upload and search - Environment Detection: Automatically detects Mac vs Nebius VM and switches configurations
- Real-time Results: Displays search results with thumbnails and timestamps
- Cross-Platform: Works seamlessly on both local development (Mac) and remote deployment (Nebius VM)
- FastAPI: Modern, fast web framework for building APIs
- Uvicorn: ASGI server for FastAPI applications
- PySceneDetect: Automatic shot change detection in videos
- FFmpeg: Video processing and multi-frame extraction
- Sentence Transformers: CLIP model for semantic embeddings
- FAISS: Facebook AI Similarity Search for vector operations
- Faster-Whisper: Automatic Speech Recognition (ASR) for subtitle generation
- PIL (Pillow): Image processing and manipulation
- NumPy: Multi-frame embedding averaging
- Streamlit: Rapid web app development framework
- Requests: HTTP client for API communication
- CLIP (Contrastive Language-Image Pre-training):
- Model:
clip-ViT-B-32 - Handles both text queries and image embeddings
- Enables semantic similarity between text and images
- Model:
- Multi-Frame Pooling: Averages embeddings from 3 frames per shot for better representation
- Faster-Whisper ASR: Automatic speech recognition for subtitle generation
- Vector Embeddings: 512-dimensional embeddings for similarity search
- Shot Detection: Content-based shot change detection with configurable thresholds
- Dual-Modal Search: Fuses image and subtitle search with adjustable weights
- Multi-Frame Thumbnails: JPG files stored in
/data/thumbs/(3 per shot) - Image Metadata: JSONL format in
/data/shots_meta.jsonl - Image Vector Index: FAISS index file
/data/shots.faiss - Subtitle Metadata: JSONL format in
/data/subs_meta.jsonl - Subtitle Vector Index: FAISS index file
/data/subs.faiss - Static Files: Served via FastAPI static file mounting
POST /process_video: Process a video file with multi-frame pooling and ASR- Parameters:
video_path,video_id,shot_threshold - Returns: Number of shots detected, frames processed, and subtitle segments
- Parameters:
POST /search: Dual-modal search for video content using text queries- Parameters:
query,k(number of results),alpha(image vs subtitle weight) - Returns: Ranked list of matching video segments with timestamps and relevance scores
- Alpha: 0.0 = subtitle only, 1.0 = image only, 0.6 = balanced (default)
- Parameters:
- Python 3.13 (required for
faster-whispercompatibility) - FFmpeg installed and available in PATH
- Virtual environment (recommended)
cd app
pip install -r requirements.txt
./run.sh # Starts server on port 8000cd ui
pip install -r requirements.txt
./run.sh # Starts Streamlit UI on port 8501Note: The frontend automatically detects whether it's running on Mac or Nebius VM and configures API endpoints and media URLs accordingly. No manual configuration needed!
- Start Backend: Run the FastAPI server (
./run.shin/app/) - Start Frontend: Run Streamlit UI (
./run.shin/ui/)- The UI automatically detects your environment (Mac or Nebius VM)
- API endpoints and media URLs are configured automatically
- Process Video:
- Select video from dropdown (episode1.mp4 or episode2.mp4)
- Adjust shot detection threshold (20-40, default 27)
- Click "Process" to extract multi-frame thumbnails and subtitles
- View processing time and statistics
- Search Content:
- Choose from predefined search examples or type custom queries
- Adjust search balance (alpha slider: 0.0 = dialogue, 1.0 = images, 0.6 = balanced)
- Set number of results (K slider)
- View Results:
- Browse thumbnails and subtitle snippets at full size
- Click video players to play at exact timestamps
- See exact match badges for perfect quote matches
- Use data management tools to delete processed data
The UI now includes a dropdown with common search patterns:
"have you tried turning it off and on again?"(exact quote)woman in elevator(visual description)man with glasses and big hair(visual description)woman walks by red shoes in window(visual description)old woman falls down stairs(visual description)0118999(phone number)tv ad(content type)"I am declaring war"(exact quote)"80 million people"(exact quote)bike shorts(visual description)trying on shoes(visual description)
- "woman walks by red shoes in window"
- "outdoor scene with trees"
- "close-up shot of hands"
- "trying on shoes"
- "old woman falling down stairs"
- "Have you tried turning it off and on again?"
- "emergency services"
- "stress is a disease"
- "I am declaring war"
- "man in suit talking"
- "person saying hello"
- "office conversation"
- "stress management"
ivs/
├── app/ # Backend API
│ ├── app.py # FastAPI application with dual-modal search
│ ├── models.py # AI/ML models (OpenAI CLIP)
│ ├── video_tools.py # Video processing with multi-frame pooling
│ ├── index.py # Image vector search (FAISS)
│ ├── subs_index.py # Subtitle vector search (FAISS)
│ ├── asr.py # Automatic Speech Recognition (OpenAI faster-whisper)
│ ├── store.py # Metadata storage
│ ├── requirements.txt # Backend dependencies
│ └── run.sh # Server startup script with process cleanup
├── ui/ # Frontend UI
│ ├── app.py # Streamlit interface with video players
│ ├── requirements.txt # Frontend dependencies
│ └── run.sh # Frontend startup script
├── data/ # Generated data (excluded from git)
│ ├── thumbs/ # Multi-frame thumbnails (3 per shot)
│ ├── videos/ # Source video files
│ ├── shots_meta.jsonl # Image metadata
│ ├── shots.faiss # Image vector index
│ ├── subs_meta.jsonl # Subtitle metadata
│ └── subs.faiss # Subtitle vector index
└── full_videos/ # Additional videos (excluded from git)
- Shot Detection: Configurable threshold (20-40) for sensitivity
- Multi-Frame Processing: 3x slower than single-frame (3 frames per shot)
- ASR Processing: ~1-2x real-time depending on hardware (CUDA recommended)
- Search Speed: Sub-second response for dual-modal semantic queries
- Memory Usage: CLIP model requires ~2GB RAM for embeddings
- Storage: ~150-300KB per shot (3 thumbnails + metadata)
- Video Player: Streamlit video component with automatic timestamp seeking
- 3 frames per shot for better representation
- Averaged embeddings capture shot dynamics
- Significantly improved search accuracy
- 3x storage cost but much better results
- Image search: CLIP text→image similarity
- Subtitle search: CLIP text→text similarity
- Adjustable fusion: Alpha slider controls balance
- ASR integration: Automatic speech recognition
- Inline video players with timestamp seeking
- Auto-generated video IDs from filenames
- Search balance controls (alpha slider)
- Data management tools for cleanup
- Accessibility improvements (proper labels)
- Video Selection Dropdown: Choose between episode1.mp4 and episode2.mp4
- Smart Processing Detection: Automatically detects already processed videos
- Predefined Search Examples: Dropdown with common search queries
- Custom Search Input: Option to type your own queries
- Processing Time Display: Shows how long video processing took
- Search Query Logging: Console output for debugging search requests
- Improved Error Handling: Better connection error management
- Full-Size Thumbnails: Thumbnails display at full resolution
- Exact Match Highlighting: Special badges for exact quote matches
- Hybrid GPU/CPU Processing: CLIP on GPU, Whisper on CPU for stability
- Automatic Environment Detection: Seamlessly works on both Mac (localhost) and Nebius VM (HTTPS proxy)
- Unified Codebase: Single
app.pyfile works on both environments with automatic configuration - Smart URL Construction: Video and thumbnail URLs automatically adapt to environment (local FastAPI vs remote proxy)
- Backend:
http://localhost:8000(FastAPI direct) - Frontend:
http://localhost:8501(Streamlit) - Media: Served via FastAPI
/staticmount
- Backend:
https://raymond.hopto.org/api(HTTPS proxy via Nginx) - Frontend:
https://raymond.hopto.org(HTTPS) - Media: Served directly from
/datadirectory - Automatic Detection: Frontend automatically detects environment and configures URLs
The system uses environment detection based on:
- Hostname check (looks for "computeinstance" in Nebius VMs)
- Network connectivity test (checks for Nebius internal IP)
- Falls back to Mac configuration if detection fails
Built for hackathon demonstration with focus on:
- Rapid prototyping and iteration
- Clear separation of concerns (API + UI)
- Scalable architecture for future enhancements
- Easy deployment and setup
- Production-ready features (multi-frame pooling, dual-modal search)
- Cross-platform compatibility (Mac and Nebius VM)