End-to-End Text Summarizer with Huggingface Model

Demo

Project_Demo_Showcase.mp4

Overview

This is an end-to-end text summarization web application based on fastAPI. It uses a finetuned version of the pegasus model to summarize the text from any source. This project involved building an end-to-end pipeline encompassing data ingestion, data transformation, model training, evaluation, and prediction, along with API integration and web hosting for seamless user interaction. Additional features include the implementation of GitHub Actions for continuous integration, robust Python logging for efficient debugging, and workflow optimization.

Motivation

Imagine a scenario where you want a specific context from a news paper article. Wouldn't it be a tedious task to read the entire article? Instead if you know the context information for specific section of text, you can decide whether to go through that specific section or not. So you can take help of this summarizer to find out which part of the newspaper or any other source, you are interested in. This approach is quicker and reliable as it uses a trained and finetuned sequence-to sequence model which retrieves the accurate context from the query.

What It Does

Before going to the functioning of the project I will brief upon the initial setup

Create requirements.txt file to install all the dependencies.
Create a series of directories and files to organize the project effectively. Configure logging to log all actions like directory and file creation or skipping existing files. Here we are allowing for modular logging.
Create a pipeline which downloads the data from hugging face and extracts it, preprocess and transform the data, train the model on the prepared data and evaluate the model's performance.
The stages are executed in a sequential manner, ensuring that each step is completed before moving to the next.
Each stage logs its initiation and completion for traceability. Errors are logged using logger.exception(), which captures both the error message and stack trace.
Each stage is wrapped in a try-except block to handle any exceptions gracefully. If an exception occurs in any stage, it is logged, and the error is raised to stop further execution. This ensures that any issue is properly captured, and the pipeline does not continue in a failed state.
Now, we have the model and tokenizer. We load these to run predictions on the unseen text. So, A prediction is created which uses this Huggingface summarizer pipeline, model, tokenizer to generate summary on a completely new text.
For all the above actions, we continuously integrate with github by commiting and pushing it to the github repository.

Now comes the part of setting up a FastAPI web application that exposes a simple endpoint for text summarization by integrating with the prediction pipeline. The FastAPI application serves a REST API with two routes:

Root Route (/): Redirects users to the API documentation (/docs).
Prediction Route (/predict): Accepts a POST request with text, processes it through the PredictionPipeline, and returns the summarized text.
The server runs on http://0.0.0.0:8080 when executed.

Getting Started

We will get started with installation and set up process. Clone the repository and open the folder using Vs Code.

Clone this repository into a local folder:

git clone https://github.com/dhanushpittala11/SummarizerText_Hf_End2End_1.git

Setup Environment using:

conda create -p venv python==3.10 -y

If conda is not installed, run this command in the terminal of the project environment.

pip install conda

Activate the environment:

conda activate venv/

Install all the required libraries and packages using the command:

pip install -r requirements.txt

Usage

Now run the script using:

python main.py
python app.py

Directory Tree

./.github
 ./.github/workflows     -> store workflow configuration files for github actions
     ./.gitkeep     
./artifacts
 ./artifacts/data_ingestion    -> stores the data files extracted from hugging face
     ./huggingface_dialogsum_dataset
     ./data.zip
 ./artifacts/data_transformation  -> stores the transformed  data - train,test and validation datasets
     ./test
     ./train
     ./validation
 ./artifacts/model_trainer       -> saves and stores the finetuned model and tokenizer
     ./pegasus-dialogsum-model
     ./tokenizer
 ./artifacts/model_evaluation     -> saves the metrics after model evaluation
     ./metrics.csv
./config
 ./config/config.yaml         -> defines the configuration for the end-to-end text summarization pipeline, specifying paths, URLs, model/tokenizer details 
./logs
 ./logs/continuos_logs.log    -> store runtime information, errors, and activity logs
./research                       
 ./research/ModelTrainer1.ipynb  ->  ipynb file for the pipeline which trains the model using the transformed data
 ./research/dialogsum.ipynb      -> ipynb file specific for fine tuning the model
 ./research/model_evaluation.ipynb -> ipynb file for the pipeline which evaluates the model
 ./research/DataTransformation.ipynb  -> ipynb file for the pipeline which splits the ingested data into train, test, validation sets
 ./research/DataIngestion1.ipynb   -> ipynb file for setting up the data ingestion pipeline
 ./research/research.ipynb
./src
 ./src/textSummarizer
     ./__pycache__
     ./components             
         ./__init__.py
         ./data_ingestion.py    -> download a file from a specified URL and extract its contents into a designated directory
         ./data_transformation.py  -> preprocesses the data by tokenizing dialogues,their summaries for model training, saves data to disk
         ./model_trainer.py   -> trains a text summarization model using the Pegasus architecture, saves the trained model and tokenizer
         ./model_evaluation.py  -> evaluates a trained text summarization model by computing ROUGE metrics on a test dataset
     ./config
         ./__init__.py
         ./configuration.py -> reads configuration and parameter files, creates necessary directories, and generates configuration objects
     ./constants
         ./__init__.py  -> contains the path for YAML files
     ./entity
         ./__init__.py  -> defines data classes to structure and manage configuration settings for different components of the pipeline
     ./logging
         ./__init__.py  -> sets logging system that outputs log messages to file and console for tracking  the text summarization process
     ./pipeline
         ./__init__.py
         ./prediction_pipeline.py -> Prediction pipeline
         ./stage1_dataIngestion_pipeline.py  -> Data Ingestion pipeline
         ./stage2_DataTransformation_pipeline.py -> Data Transformation  pipeline
         ./stage3_modeltrainer_pipeline.py  -> Model trainer pipeline
         ./stage4_modelevaluation_pipeline.py -> Model evaluation pipeline
     ./utils
         ./__init__.py  
         ./common.py -> read YAML files into ConfigBox and create directories while logging the process
     ./__init__.py
./venv  -> virtual environment for the project 
./.gitignore -> specifies files not to be tracked in version control
./app.py  -> 
./huggingface_dialogsum_dataset.zip -> dataset used in the project
./LICENSE
./main.py -> orchestrates and logs the execution of all the stages in the pipeline and handling exceptions for each stage.
./params.yaml  -> training parameters
./README.md   
./requirements.txt -> check all the dependencies, libraries, packages in this
./template.py

To Do

Deploy the app in the cloud and monitor the training, evaluation, prediction part.

Bug / Feature Request

If you find a bug (the website couldn't handle the query and / or gave undesired results), kindly open an issue here by including your search query and the expected result.

If you'd like to request a new function, feel free to do so by opening an issue here. Please include sample queries and their corresponding results.

Techstack Used

Version Control and Collaboration

Git
GitHub

Natural Language Processing and Sequence -to- Sequence Models

HuggingFace
Transformers
AutoTokenizer
nltk
Pegasus
ROUGE

Deep Learning Frameworks and Libraries

PyTorch
CUDA
nvidia-smi

Logging and Monitoring

Python Logging
Logger
WandB

Data Processing

Pandas

Configuration and File Management

yaml
ConfigBox

Web Frameworks and Deployment

FastAPI
uvicorn
Starlette

License

                GNU GENERAL PUBLIC LICENSE
                   Version 3, 29 June 2007

                        Preamble

The GNU General Public License is a free, copyleft license for software and other kinds of works.

Team

Dhanush Pittala - @Linkedin - dhanushpittala05@gmail.com

Credits

Link for the dataset used in the project.

@inproceedings{chen-etal-2021-dialogsum, title = "{D}ialog{S}um: {A} Real-Life Scenario Dialogue Summarization Dataset", author = "Chen, Yulong and Liu, Yang and Chen, Liang and Zhang, Yue", booktitle = "Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021", month = aug, year = "2021", address = "Online", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2021.findings-acl.449", doi = "10.18653/v1/2021.findings-acl.449", pages = "5062--5074",
Link for the model used.

@misc{zhang2019pegasus, title={PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization}, author={Jingqing Zhang and Yao Zhao and Mohammad Saleh and Peter J. Liu}, year={2019}, eprint={1912.08777}, archivePrefix={arXiv}, primaryClass={cs.CL} }

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

End-to-End Text Summarizer with Huggingface Model

Table of Contents

Demo

Overview

Motivation

What It Does

Getting Started

Clone this repository into a local folder:

Setup Environment using:

Activate the environment:

Install all the required libraries and packages using the command:

Usage

Now run the script using:

Directory Tree

To Do

Bug / Feature Request

Techstack Used

Version Control and Collaboration

Natural Language Processing and Sequence -to- Sequence Models

Deep Learning Frameworks and Libraries

Logging and Monitoring

Data Processing

Configuration and File Management

Web Frameworks and Deployment

License

Team

Credits

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
.github/workflows		.github/workflows
Videos		Videos
config		config
logs		logs
research		research
src/textSummarizer		src/textSummarizer
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
Training_performance.png		Training_performance.png
app.py		app.py
huggingface_dialogsum_dataset.zip		huggingface_dialogsum_dataset.zip
main.py		main.py
params.yaml		params.yaml
requirements.txt		requirements.txt
setup.py		setup.py
template.py		template.py
text_summarizer.png		text_summarizer.png

License

dhanushpittala11/SummarizerText_Hf_End2End_1

Folders and files

Latest commit

History

Repository files navigation

End-to-End Text Summarizer with Huggingface Model

Table of Contents

Demo

Overview

Motivation

What It Does

Getting Started

Clone this repository into a local folder:

Setup Environment using:

Activate the environment:

Install all the required libraries and packages using the command:

Usage

Now run the script using:

Directory Tree

To Do

Bug / Feature Request

Techstack Used

Version Control and Collaboration

Natural Language Processing and Sequence -to- Sequence Models

Deep Learning Frameworks and Libraries

Logging and Monitoring

Data Processing

Configuration and File Management

Web Frameworks and Deployment

License

Team

Credits

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages