NLP

Unsupervised based ML that builds a complete Summarisation Pipeline using LSA (Latent Semantic Analysis) and Sentiment Analysis

A sophisticated news article summarization system that leverages Latent Semantic Analysis (LSA) to automatically generate concise and informative summaries from news articles.

Overview

This project implements a complete Natural Language Processing (NLP) pipeline for text summarization using Latent Semantic Analysis. The system processes raw news articles through multiple NLP techniques to extract the most important sentences, providing users with quick and accurate summaries.

Architecture

The summarizer follows a pipeline architecture with the following components:

PlainTextParser - Article preprocessing and parsing
Tokenizer - Text segmentation and tokenization
Stemmer - Word normalization and stemming
LSASummarizer - Core summarization engine using LSA
Output Generator - Summary formatting and presentation

📊 App Flow

Component Details

1. PlainTextParser

Purpose: Extracts and cleans text content from news articles

Key Functions:

HTML/XML tag removal and content extraction
Text normalization and encoding handling
Article metadata extraction (title, author, date)
Paragraph and sentence boundary detection
Special character and noise removal

Implementation Features:

Supports multiple input formats (HTML, plain text, XML)
Handles different encoding standards
Removes advertisements and non-content elements
Preserves paragraph structure for context

2. Tokenizer

Purpose: Splits text into meaningful units for processing

Key Functions:

Sentence tokenization (sentence boundary detection)
Word tokenization (breaking sentences into words)
Punctuation handling and removal
Contraction expansion (e.g., "don't" → "do not")
Case normalization

Implementation Features:

Language-specific tokenization rules
Support for multiple languages
Customizable tokenization patterns
Handles abbreviations and special characters

3. Stemmer

Purpose: Reduces words to their root form for better semantic analysis

Key Functions:

Porter stemming algorithm implementation
Lemmatization alternatives
Stop word removal
Word normalization
Handling irregular forms

Implementation Features:

Multiple stemming algorithms supported
Custom stop word lists for different domains
Performance optimization for large texts
Language-specific stemming rules

4. LSASummarizer (Core Component)

Purpose: Performs Latent Semantic Analysis to identify important sentences

Key Functions:

Term-Document Matrix construction
Singular Value Decomposition (SVD) implementation
Sentence scoring based on semantic importance
Optimal sentence selection algorithm
Redundancy reduction

Mathematical Foundation:

Constructs term-sentence matrix A (m x n)
Computes SVD: A = UΣVᵀ
Uses V matrix to identify important sentences
Applies dimensionality reduction for efficiency

Implementation Features:

Configurable summary length (percentage or absolute)
Tuning parameters for different article types
Memory-efficient matrix operations
Parallel processing support for large articles

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
.devcontainer		.devcontainer
LICENSE		LICENSE
README.md		README.md
app.py		app.py
lsa-pipeline.png		lsa-pipeline.png
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

NLP

Overview

Architecture

📊 App Flow

Component Details

1. PlainTextParser

2. Tokenizer

3. Stemmer

4. LSASummarizer (Core Component)

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

NLP

Overview

Architecture

📊 App Flow

Component Details

1. PlainTextParser

2. Tokenizer

3. Stemmer

4. LSASummarizer (Core Component)

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages