This repository contains an end-to-end, highly optimized machine learning pipeline for processing and summarizing short-form video content. Utilizing the SmolVLM2-2.2B-Instruct Vision Language Model (VLM), the system ingests video files and audio transcripts to generate grounded, hallucination-free summaries and accurately categorize the content. This project was part of the seminar Efficient Training of Large Language Models.
A major focus of this project is inference optimization, benchmarking a baseline model against various acceleration techniques to reduce execution time while preserving summary quality (measured via BERTScore) and categorization accuracy.
- End-to-End Processing: Seamlessly handles video loading, audio transcription extraction, prompt formatting, VLM text generation, and post-processing.
- Automated Evaluation: Built-in benchmarking against ground-truth data using BERTScore for summary quality and Accuracy for category prediction.
- Inference Profiling: Granular time-tracking for each pipeline component (loading, transcription, generation) to identify bottlenecks.
- Optimization Strategies: Implements advanced ML optimization techniques to reduce total inference time, including:
- Model quantization (bitsandbytes)
- Data Parallelism
- Batch processing and parallel data loading
- Mixed Precision Training
- Model: SmolVLM2-2.2B-Instruct – an open-source, efficient Vision Language Model.
- Dataset: 270 short-form videos (ranging from 5 seconds to 5 minutes) in
.mp4format, accompanied by audio transcripts in.txtformat.
├── data/
│ ├── inputs/
│ │ ├── videos/ # Place .mp4 files here
│ │ └── audio_transcripts/ # Place .txt transcripts here
│ └── outputs/ # CSV reports and generated summaries saved here
├── src/
│ ├── config.py # Pipeline configuration and path management
│ ├── main.py # Main execution script
│ └── evaluate.py # BERTScore and Accuracy calculation logic
├── docs/
│ └── project_report.pdf # Detailed analysis of baseline vs. optimized performance
├── requirements.txt # Project dependencies
└── README.md
Prerequisites: Python 3.11 is strictly required for dependency compatibility.
1. Clone the repository:
git clone
cd video-summarization-pipeline2. Create and activate a virtual environment:
python3.11 -m venv venv
# On macOS/Linux
source venv/bin/activate
# On Windows
venv\Scripts\activate3. Install dependencies:
pip install -r requirements.txt📢 Note: The complete output files containing baseline and optimized performance logs, as well as the detailed 2-3 page optimization report, can be found in this Google Drive link.