🎬 AI Closed Caption Generator

title	AI Caption Generator
emoji	🎬
colorFrom	blue
colorTo	purple
sdk	docker
app_port	7860
pinned	false

🎬 AI Closed Caption Generator

An automated tool to transcribe and translate video/audio content into .srt subtitles using OpenAI's Whisper model.

🚀 Live Demo

Check out the running application on Hugging Face Spaces:

🧐 About The Project

This application solves the problem of manual subtitling by leveraging state of the art AI. It takes raw video (.mp4, .mov) or audio (.mp3, .wav) files, extracts the speech using FFmpeg, and processes it through OpenAI's Whisper model to generate frame-perfect timestamps.

It is fully containerized with Docker and set up with a CI/CD pipeline via GitHub Actions for seamless deployment.

🧠 How It Works: The AI Model

This project relies on OpenAI's Whisper, a state-of-the-art automatic speech recognition (ASR) system. Specifically, we utilize the small model architecture, which offers the optimal balance between accuracy and computational efficiency for local deployment.

The Architecture

Whisper is a Transformer-based sequence-to-sequence model. It treats audio processing as a language modeling task, trained on 680,000 hours of multilingual and multitask supervised data collected from the web.

The Pipeline

Preprocessing:
- The input video/audio is processed by FFmpeg to extract the audio track.
- The audio is resampled to 16,000 Hz (mono).
Feature Extraction:
- The raw audio waveform is converted into a Log-Mel Spectrogram, a visual representation of the audio frequencies over time.
Encoder-Decoder Processing:
- The Encoder: Reads the spectrogram and extracts high-level features (patterns in speech).
- The Decoder: Predicts the next text token based on the audio features and the previous tokens. It uses an attention mechanism to focus on the specific part of the audio that corresponds to the word it is currently writing.
Timestamp Prediction:
- Unlike traditional models, Whisper is trained to predict timestamp tokens alongside text tokens. This allows the model to output precise start and end times for every segment of speech, which we format into the .srt standard.

Why the `Small` Model?

We selected the small model (~244 Million parameters) for this deployment because:

Accuracy: It significantly outperforms the tiny and base models, especially on low-resource languages and complex accents.
Efficiency: It requires approximately 2GB of VRAM/RAM, making it feasible to run on the Hugging Face Free Tier (CPU Basic) without crashing, unlike the medium or large models.

Key Features

🎥 Multi-Format Support: Handles both Video (MP4, AVI) and Audio (MP3, WAV) inputs.
📝 Auto-Transcription: Generates precise text transcripts from speech.
🌍 AI Translation: Automatically translates foreign languages (e.g., Sinhala, French) into English subtitles.
⏱️ Precision Subtitles: Exports standard .srt files ready for YouTube or VLC.
🐳 Dockerized: Runs consistently across any environment using Docker containers.
☁️ Cloud Native: Deployed on Hugging Face Spaces with automated sync.

🛠️ Built With

Python 3.9: Core logic.
OpenAI Whisper: Automatic Speech Recognition (ASR) model.
Streamlit: Interactive web frontend.
FFmpeg: Multimedia processing engine.
Docker: Containerization.
GitHub Actions: CI/CD Pipeline.

💻 Getting Started (Local)

Follow these steps to run the project locally on your machine.

Prerequisites

Python 3.8+ installed.
FFmpeg installed and added to your system PATH.
- Windows: Download here
- Mac: brew install ffmpeg
- Linux: sudo apt install ffmpeg

Installation

Clone the repository
Create a virtual environment (Optional but Recommended)
Install dependencies
Run the App

🐳 Running with Docker

If you have Docker installed, you can run the app without installing Python or FFmpeg manually.

Build the Image
Run the Container

📂 Project Structure

ClosedCaption/
├── .github/workflows/   # CI/CD Pipeline for Hugging Face
├── src/                 # Source Code
│   ├── model.py         # Whisper Model Logic
│   └── utils.py         # Timestamp Formatting Helpers
├── app.py               # Main Streamlit Application
├── Dockerfile           # Docker Configuration
├── requirements.txt     # Python Dependencies
└── README.md            # Documentation

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🎬 AI Closed Caption Generator

🚀 Live Demo

🧐 About The Project

🧠 How It Works: The AI Model

The Architecture

The Pipeline

Why the `Small` Model?

Key Features

🛠️ Built With

💻 Getting Started (Local)

Prerequisites

Installation

🐳 Running with Docker

📂 Project Structure

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
.github/workflows		.github/workflows
src		src
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
app.py		app.py
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

🎬 AI Closed Caption Generator

🚀 Live Demo

🧐 About The Project

🧠 How It Works: The AI Model

The Architecture

The Pipeline

Why the Small Model?

Key Features

🛠️ Built With

💻 Getting Started (Local)

Prerequisites

Installation

🐳 Running with Docker

📂 Project Structure

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Why the `Small` Model?

Packages