Full End-to-End AI Video Clipper

A full-stack web application that extracts captions and timestamps from video/audio files using faster-whisper for automatic speech recognition (ASR). Features a FastAPI backend and a modern Node.js frontend with Tailwind CSS.

Features

Web Interface: Modern, responsive UI built with Tailwind CSS
FastAPI Backend: RESTful API with automatic documentation
Multiple Output Formats:
- transcript.txt - Full text transcript
- segments.json - Segments with timestamps
- words.json - Individual words with timestamps
- transcript.srt - SRT subtitle file
- transcript.vtt - VTT subtitle file
Smart subtitle grouping - Optimized for mobile viewing (6 words max per caption)
Configurable Processing: Support for multiple Whisper model sizes
Voice Activity Detection: Optional VAD filtering
CPU/GPU Support: Optimized processing options
File Upload: Drag-and-drop interface with progress tracking
LLM Integration: AI-powered clip selection and processing

Requirements

Python 3.7+
Node.js 16+ and npm
FFmpeg (for audio/video processing)

Quick Start

Option 1: Start Everything at Once

# Make scripts executable (if not already)
chmod +x start-*.sh

# Start both backend and frontend
./start-all.sh

Option 2: Start Backend and Frontend Separately

# Terminal 1: Start Backend
./start-backend.sh

# Terminal 2: Start Frontend (in a new terminal)
./start-frontend.sh

Installation

Clone or download this repository
Install FFmpeg:
- macOS: brew install ffmpeg
- Ubuntu/Debian: sudo apt install ffmpeg
- Windows: Download from https://ffmpeg.org/download.html
Install Node.js:
- Visit https://nodejs.org/ and download the LTS version
- Or use a package manager like brew install node (macOS)
The start scripts will automatically install Python and Node.js dependencies

Application Architecture

Backend (FastAPI)

Port: 8000
Framework: FastAPI with Uvicorn
Features:
- File upload handling
- Caption extraction using faster-whisper
- RESTful API with automatic documentation
- CORS support for frontend communication
- Job-based processing with unique IDs

Frontend (Node.js + Express)

Port: 3000
Framework: Express.js serving static files
Styling: Tailwind CSS with custom components
Features:
- Drag-and-drop file upload
- Real-time progress tracking
- Responsive design
- Download management
- Error handling and user feedback

File Structure

├── main.py                 # FastAPI backend server
├── server.js              # Express.js frontend server
├── extract_captions.py    # Core caption extraction logic
├── package.json           # Node.js dependencies
├── requirements.txt       # Python dependencies
├── tailwind.config.js     # Tailwind CSS configuration
├── public/                # Frontend assets
│   ├── index.html         # Main HTML template
│   ├── css/               # CSS files
│   └── js/                # JavaScript files
├── uploads/               # Temporary upload directory
├── captions_out/          # Generated caption files
└── start-*.sh            # Deployment scripts

Usage

Web Interface

Start the application using one of the methods above
Open your browser and go to http://localhost:3000
Upload a file by dragging and dropping or clicking to browse
Configure options:
- Model: Choose Whisper model size (tiny=fastest, large=most accurate)
- Language: Select language or leave auto-detect
- Compute Type: Choose CPU or GPU processing
- VAD: Enable voice activity detection for better accuracy
Click "Extract Captions" and wait for processing
Download results in various formats (TXT, JSON, SRT, VTT)

API Usage

The backend provides a REST API at http://localhost:8000:

API Documentation: http://localhost:8000/docs
Health Check: GET /health
Upload File: POST /upload
Download Files: GET /download/{job_id}/{filename}

Command Line Usage (Legacy)

You can still use the original command-line script:

python frontend/extra_captions.py input_videos/input.mp4 --outdir captions_out --model small --compute-type int8

Note: The script now generates better subtitle grouping with fewer words per caption (6 words max) for improved readability on mobile devices and social media platforms.

Command Line Arguments

input (required): Path to input video/audio file
--outdir: Output directory (default: captions_out)
--model: Whisper model size (default: small)
- Options: tiny, base, small, medium, large, large-v2, large-v3
--language: Language code (e.g., en, es, fr). Auto-detects if not specified
--compute-type: Compute type (default: int8_float16)
- Options: float16, int8_float16, int8
- Use int8 for CPU-only processing
--beam-size: Beam size for decoding (default: 5)
--vad: Enable voice activity detection filtering

Examples

# Basic usage with small model
python frontend/extra_captions.py video.mp4

# Use tiny model for faster processing
python frontend/extra_captions.py video.mp4 --model tiny --compute-type int8

# Specify language and output directory
python frontend/extra_captions.py video.mp4 --language en --outdir my_captions --model medium

# Enable VAD filtering
python frontend/extra_captions.py video.mp4 --vad --model small --compute-type int8

Project Structure

.
├─ backend/                # FastAPI app and APIs
│  ├─ __init__.py
│  └─ api.py               # Chat, stream, clips/select, process/sample
├─ frontend/               # CLI and (future) UI assets
│  ├─ __init__.py
│  └─ extra_captions.py    # Caption extraction CLI (moved)
├─ artifacts/
│  └─ system_prompt.txt    # System prompt used for clip selection
├─ captions_out/           # Generated outputs (ignored in VCS)
├─ input_videos/           # Optional input media folder (ignored in VCS)
├─ requirements.txt        # Python dependencies
├─ setup.sh                # Installer helper
├─ .env                    # Local environment (API keys, config)
└─ README.md

Note: extract_captions.py remains as a thin shim that delegates to frontend/extra_captions.py to preserve existing commands.

Output Files

The script generates 5 files in the output directory:

transcript.txt - Full text transcript
segments.json - Array of segments with start/end times and text
words.json - Array of individual words with timestamps
transcript.srt - SRT subtitle format
transcript.vtt - VTT subtitle format

Overlaying Subtitles onto Video

After extracting captions, you can overlay them onto your video using FFmpeg:

Basic Subtitle Overlay

ffmpeg -i input_videos/input.mp4 -vf "subtitles=captions_out/transcript.srt" -c:a copy output_with_subtitles.mp4

Styled Subtitles (Recommended)

ffmpeg -i input_videos/input.mp4 -vf "subtitles=captions_out/transcript.srt:force_style='FontName=Helvetica,FontSize=24,PrimaryColour=&HFFFFFF&,OutlineColour=&H000000&,Outline=2,Alignment=2'" -c:a copy output_with_subtitles.mp4

Subtitle Styling Options

You can customize the appearance by modifying the force_style parameter:

FontName: Font family (e.g., Helvetica, Arial-Bold, Roboto)
FontSize: Font size in pixels (20-28 recommended)
PrimaryColour: Text color in BGR hex format (&HFFFFFF& = white)
OutlineColour: Outline color (&H000000& = black)
Outline: Outline thickness (1-3 recommended)
Alignment: Text position (2 = bottom center)

Example: Custom Styled Subtitles

ffmpeg -i input_videos/input.mp4 -vf "subtitles=captions_out/transcript.srt:force_style='FontName=Roboto,FontSize=26,PrimaryColour=&HFFFFFF&,OutlineColour=&H000000&,Outline=2,Alignment=2'" -c:a copy output_styled_subtitles.mp4

Model Performance

Model	Speed	Accuracy	Best For
`tiny`	~2x real-time	Good	Quick processing
`base`	~1x real-time	Better	Balanced
`small`	~0.5x real-time	Good	Recommended
`medium`	~0.3x real-time	Better	High accuracy
`large`	~0.2x real-time	Best	Maximum accuracy

Troubleshooting

Common Issues

"faster-whisper not installed"
```
pip install faster-whisper
```
"FFmpeg not found"
- macOS: brew install ffmpeg
- Ubuntu/Debian: sudo apt install ffmpeg
- Windows: Download from https://ffmpeg.org/download.html
"int8_float16 compute type not supported"
- Use --compute-type int8 for CPU-only processing
No progress bar
- This is normal for the standard Whisper library
- Check CPU usage to verify processing is happening

Performance Tips

Use --model tiny for fastest processing
Use --compute-type int8 for CPU-only processing
Use --vad to filter out silence and improve accuracy
For long videos, consider using a smaller model

License

This project is open source and available under the MIT License.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Full End-to-End AI Video Clipper

Features

Requirements

Quick Start

Option 1: Start Everything at Once

Option 2: Start Backend and Frontend Separately

Installation

Application Architecture

Backend (FastAPI)

Frontend (Node.js + Express)

File Structure

Usage

Web Interface

API Usage

Command Line Usage (Legacy)

Command Line Arguments

Examples

Project Structure

Output Files

Overlaying Subtitles onto Video

Basic Subtitle Overlay

Styled Subtitles (Recommended)

Subtitle Styling Options

Example: Custom Styled Subtitles

Model Performance

Troubleshooting

Common Issues

Performance Tips

License

About

Uh oh!

Releases

Packages

Contributors 4

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
__pycache__		__pycache__
artifacts		artifacts
backend		backend
candidate_videos		candidate_videos
captions_out		captions_out
frontend		frontend
node_modules		node_modules
outputs		outputs
public		public
test		test
uploads		uploads
.DS_Store		.DS_Store
.env		.env
.env.example		.env.example
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
output_with_subtitles.mp4		output_with_subtitles.mp4
package-lock.json		package-lock.json
package.json		package.json
requirements.txt		requirements.txt
server.js		server.js
setup.sh		setup.sh
start-all.sh		start-all.sh
start-backend.sh		start-backend.sh
start-frontend.sh		start-frontend.sh
tailwind.config.js		tailwind.config.js

vincentlo12/HackCMUSimpleDimple

Folders and files

Latest commit

History

Repository files navigation

Full End-to-End AI Video Clipper

Features

Requirements

Quick Start

Option 1: Start Everything at Once

Option 2: Start Backend and Frontend Separately

Installation

Application Architecture

Backend (FastAPI)

Frontend (Node.js + Express)

File Structure

Usage

Web Interface

API Usage

Command Line Usage (Legacy)

Command Line Arguments

Examples

Project Structure

Output Files

Overlaying Subtitles onto Video

Basic Subtitle Overlay

Styled Subtitles (Recommended)

Subtitle Styling Options

Example: Custom Styled Subtitles

Model Performance

Troubleshooting

Common Issues

Performance Tips

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Uh oh!

Languages

Packages