Skip to content

Latest commit

Β 

History

History
66 lines (51 loc) Β· 3.2 KB

File metadata and controls

66 lines (51 loc) Β· 3.2 KB

Efficient Video Summarization Pipeline using SmolVLM2

This repository contains an end-to-end, highly optimized machine learning pipeline for processing and summarizing short-form video content. Utilizing the SmolVLM2-2.2B-Instruct Vision Language Model (VLM), the system ingests video files and audio transcripts to generate grounded, hallucination-free summaries and accurately categorize the content. This project was part of the seminar Efficient Training of Large Language Models.

A major focus of this project is inference optimization, benchmarking a baseline model against various acceleration techniques to reduce execution time while preserving summary quality (measured via BERTScore) and categorization accuracy.

πŸš€ Key Features

  • End-to-End Processing: Seamlessly handles video loading, audio transcription extraction, prompt formatting, VLM text generation, and post-processing.
  • Automated Evaluation: Built-in benchmarking against ground-truth data using BERTScore for summary quality and Accuracy for category prediction.
  • Inference Profiling: Granular time-tracking for each pipeline component (loading, transcription, generation) to identify bottlenecks.
  • Optimization Strategies: Implements advanced ML optimization techniques to reduce total inference time, including:
    • Model quantization (bitsandbytes)
    • Data Parallelism
    • Batch processing and parallel data loading
    • Mixed Precision Training

🧠 Model & Dataset

  • Model: SmolVLM2-2.2B-Instruct – an open-source, efficient Vision Language Model.
  • Dataset: 270 short-form videos (ranging from 5 seconds to 5 minutes) in .mp4 format, accompanied by audio transcripts in .txt format.

πŸ“‚ Project Structure

β”œβ”€β”€ data/
β”‚   β”œβ”€β”€ inputs/
β”‚   β”‚   β”œβ”€β”€ videos/              # Place .mp4 files here
β”‚   β”‚   └── audio_transcripts/   # Place .txt transcripts here
β”‚   └── outputs/                 # CSV reports and generated summaries saved here
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ config.py                # Pipeline configuration and path management
β”‚   β”œβ”€β”€ main.py                  # Main execution script
β”‚   └── evaluate.py              # BERTScore and Accuracy calculation logic
β”œβ”€β”€ docs/
β”‚   └── project_report.pdf       # Detailed analysis of baseline vs. optimized performance
β”œβ”€β”€ requirements.txt             # Project dependencies
└── README.md

βš™οΈ Installation & Setup

Prerequisites: Python 3.11 is strictly required for dependency compatibility.

1. Clone the repository:

git clone
cd video-summarization-pipeline

2. Create and activate a virtual environment:

python3.11 -m venv venv
# On macOS/Linux
source venv/bin/activate  
# On Windows
venv\Scripts\activate

3. Install dependencies:

pip install -r requirements.txt

πŸ“’ Note: The complete output files containing baseline and optimized performance logs, as well as the detailed 2-3 page optimization report, can be found in this Google Drive link.