A project by Harshil Pradhan
This project focuses on summarizing video lecture transcriptions using state-of-the-art deep learning models. The objective is to transform long, often noisy transcripts into concise, structured summaries. We evaluate and compare the performance of three leading transformer-based models—T5, Pegasus, and BART—and introduce an Ensemble model to enhance the quality and robustness of summaries.
The VT-SSum dataset, purpose-built for video transcript segmentation and summarization, is used to fine-tune these models. This effort makes educational content more consumable, searchable, and accessible.
- Dataset: VT-SSum
- 9,616 videos
- 125,000 transcript-summary pairs
- Transcripts are cleaned, segmented, and tokenized into input-output pairs for summarization.
- T5: Text-to-Text Transfer Transformer by Google
- Pegasus: Optimized summarization model by Google
- BART: Bidirectional Auto-Regressive Transformer by Facebook AI
- Ensemble Model (Proposed): Combines predictions from all three models via averaging or voting techniques.
- Models were fine-tuned using GPU acceleration.
- Dataset was split into 80% training and 20% validation.
- Early stopping was applied to avoid overfitting.
- ROUGE metrics were used for performance evaluation.
- Developed by Google.
- Converts all tasks into a unified text-to-text form.
- Uses an encoder-decoder architecture that performs exceptionally well in summarization tasks.
- Also by Google, designed specifically for abstractive summarization.
- Pre-training involves masking and predicting entire sentences, enabling excellent understanding of salient content.
- Delivers high-quality, coherent summaries.
- Developed by Facebook AI.
- Trained as a denoising autoencoder: corrupts inputs and learns to recover the original.
- Blends BERT-style encoding with GPT-style decoding for strong generative performance.
- Combines the predictions of T5, Pegasus, and BART using voting or averaging strategies.
- Aims to produce more consistent and accurate summaries by integrating the strengths of each individual model.
- Trained on a single P100 GPU, achieving improved ROUGE scores.
| Model | ROUGE-1 | ROUGE-2 | ROUGE-L | Execution Time (hrs) | GPU Used |
|---|---|---|---|---|---|
| T5 | 0.90 | 0.83 | 0.88 | 10–12 | NVIDIA T4 ×2 |
| Pegasus | 0.87 | 0.81 | 0.85 | 8–10 | NVIDIA T4 ×2 |
| BART | 0.86 | 0.80 | 0.84 | 6–8 | NVIDIA T4 ×2 |
| 🌟 Ensemble | 0.91 | 0.84 | 0.89 | 12 | NVIDIA P100 |
- T5 achieved the best performance among the standalone models.
- The Ensemble Model produced the highest overall ROUGE scores and more stable results across different input types.
- Ensemble utilized a single NVIDIA P100 GPU and completed in approximately 12 hours.
- 🔍 Explore additional transformer models such as LongT5, LED, or GPT variants.
- 🎯 Perform automated hyperparameter tuning using frameworks like Optuna or Ray.
- 🧠 Apply data augmentation techniques (paraphrasing, back-translation).
- 📹 Integrate multimodal data such as video frames or audio transcripts.
- 🌐 Deploy as a web service or plugin for use in e-learning platforms (e.g., Moodle, Coursera).
- T. Lv, L. Cui, M. Vasilijevic, and F. Wei, "VT-SSum: A Benchmark Dataset for Video Transcript Segmentation and Summarization," arXiv, 2021.
- A. Vaswani et al., "Attention is All You Need," NIPS, 2017.
- M. Lewis et al., "BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation," ACL, 2020.
- J. Zhang et al., "PEGASUS: Pre-training with Extracted Gap-Sentences," ICML, 2020.
If you'd like to contribute, open issues or pull requests are welcome!
🧑💻 Developed by Harshil Pradhan



