| title | AI Caption Generator |
|---|---|
| emoji | 🎬 |
| colorFrom | blue |
| colorTo | purple |
| sdk | docker |
| app_port | 7860 |
| pinned | false |
An automated tool to transcribe and translate video/audio content into
.srtsubtitles using OpenAI's Whisper model.
Check out the running application on Hugging Face Spaces:
This application solves the problem of manual subtitling by leveraging state of the art AI. It takes raw video (.mp4, .mov) or audio (.mp3, .wav) files, extracts the speech using FFmpeg, and processes it through OpenAI's Whisper model to generate frame-perfect timestamps.
It is fully containerized with Docker and set up with a CI/CD pipeline via GitHub Actions for seamless deployment.
This project relies on OpenAI's Whisper, a state-of-the-art automatic speech recognition (ASR) system. Specifically, we utilize the small model architecture, which offers the optimal balance between accuracy and computational efficiency for local deployment.
Whisper is a Transformer-based sequence-to-sequence model. It treats audio processing as a language modeling task, trained on 680,000 hours of multilingual and multitask supervised data collected from the web.
- Preprocessing:
- The input video/audio is processed by FFmpeg to extract the audio track.
- The audio is resampled to 16,000 Hz (mono).
- Feature Extraction:
- The raw audio waveform is converted into a Log-Mel Spectrogram, a visual representation of the audio frequencies over time.
- Encoder-Decoder Processing:
- The Encoder: Reads the spectrogram and extracts high-level features (patterns in speech).
- The Decoder: Predicts the next text token based on the audio features and the previous tokens. It uses an attention mechanism to focus on the specific part of the audio that corresponds to the word it is currently writing.
- Timestamp Prediction:
- Unlike traditional models, Whisper is trained to predict timestamp tokens alongside text tokens. This allows the model to output precise start and end times for every segment of speech, which we format into the
.srtstandard.
- Unlike traditional models, Whisper is trained to predict timestamp tokens alongside text tokens. This allows the model to output precise start and end times for every segment of speech, which we format into the
We selected the small model (~244 Million parameters) for this deployment because:
- Accuracy: It significantly outperforms the
tinyandbasemodels, especially on low-resource languages and complex accents. - Efficiency: It requires approximately 2GB of VRAM/RAM, making it feasible to run on the Hugging Face Free Tier (CPU Basic) without crashing, unlike the
mediumorlargemodels.
- 🎥 Multi-Format Support: Handles both Video (MP4, AVI) and Audio (MP3, WAV) inputs.
- 📝 Auto-Transcription: Generates precise text transcripts from speech.
- 🌍 AI Translation: Automatically translates foreign languages (e.g., Sinhala, French) into English subtitles.
- ⏱️ Precision Subtitles: Exports standard
.srtfiles ready for YouTube or VLC. - 🐳 Dockerized: Runs consistently across any environment using Docker containers.
- ☁️ Cloud Native: Deployed on Hugging Face Spaces with automated sync.
- Python 3.9: Core logic.
- OpenAI Whisper: Automatic Speech Recognition (ASR) model.
- Streamlit: Interactive web frontend.
- FFmpeg: Multimedia processing engine.
- Docker: Containerization.
- GitHub Actions: CI/CD Pipeline.
Follow these steps to run the project locally on your machine.
- Python 3.8+ installed.
- FFmpeg installed and added to your system PATH.
- Windows: Download here
- Mac:
brew install ffmpeg - Linux:
sudo apt install ffmpeg
-
Clone the repository
-
Create a virtual environment (Optional but Recommended)
-
Install dependencies
-
Run the App
If you have Docker installed, you can run the app without installing Python or FFmpeg manually.
-
Build the Image
-
Run the Container
ClosedCaption/
├── .github/workflows/ # CI/CD Pipeline for Hugging Face
├── src/ # Source Code
│ ├── model.py # Whisper Model Logic
│ └── utils.py # Timestamp Formatting Helpers
├── app.py # Main Streamlit Application
├── Dockerfile # Docker Configuration
├── requirements.txt # Python Dependencies
└── README.md # Documentation