Keyword Extractor

A web-based application that extracts the most important keywords from documents using TF-IDF (Term Frequency-Inverse Document Frequency) analysis.

Features

📄 Multiple File Formats: Supports TXT, PDF, and DOCX files
🔍 TF-IDF Analysis: Uses advanced statistical methods to identify meaningful keywords
🚫 Custom Stop Words: Exclude generic or unwanted words from results
📊 Configurable Results: Choose how many keywords to extract (1-50)
🎨 Modern UI: Clean, intuitive web interface
🔤 N-gram Support: Extracts both single words and two-word phrases

How It Works

The application uses TF-IDF (Term Frequency-Inverse Document Frequency) to identify keywords:

TF (Term Frequency): Measures how often a word appears in your document
IDF (Inverse Document Frequency): Measures how unique a word is across documents
TF-IDF Score: Combines both to find words that are frequent in your document but rare in general

This ensures you get meaningful, document-specific keywords rather than common words like "the" or "and".

Installation

Prerequisites

Python 3.8 or higher
pip (Python package manager)

Setup

Clone the repository:

git clone <your-repo-url>
cd keyword-extractor

Create a virtual environment (recommended):

python -m venv venv

# On Windows:
venv\Scripts\activate

# On macOS/Linux:
source venv/bin/activate

Install dependencies:

pip install -r requirements.txt

Usage

Start the application:

python app.py

Open your browser and go to:

http://localhost:5000

Use the application:
- Upload a document (TXT, PDF, or DOCX)
- Set the number of keywords you want to extract
- (Optional) Add custom words to exclude
- Click "Extract Keywords"

Example

For a Nvidia earnings transcript, the application might extract keywords like:

data center
artificial intelligence
gpu
gaming revenue
h100
cuda
automotive

While automatically filtering out generic terms like "company," "quarter," "growth," etc.

Configuration

Custom Stop Words

The application includes a comprehensive list of stop words:

Common English words (the, is, and, etc.)
Business jargon (quarter, growth, company, etc.)
Filler words (obviously, basically, actually, etc.)

You can add more custom stop words through the web interface.

File Size Limit

Default maximum file size is 16MB. You can modify this in app.py:

app.config['MAX_CONTENT_LENGTH'] = 16 * 1024 * 1024  # Change this value

Project Structure

keyword-extractor/
├── app.py                 # Flask application and routes
├── keyword_extractor.py   # Core keyword extraction logic
├── requirements.txt       # Python dependencies
├── README.md             # This file
├── templates/
│   └── index.html        # Web interface
└── uploads/              # Temporary file storage (auto-created)

Technical Details

TF-IDF Implementation

The application splits documents into chunks to improve IDF calculations, even for single-document analysis. This helps identify truly distinctive terms.

N-grams

The extractor looks for both:

Unigrams: Single words (e.g., "revenue")
Bigrams: Two-word phrases (e.g., "data center")

This captures important multi-word concepts that would be lost with single-word analysis.

API Endpoint

POST /extract

Extract keywords from an uploaded document.

Parameters:

file: The document file (multipart/form-data)
n_keywords: Number of keywords to extract (integer, 1-50)
exclude_words: Comma or newline-separated list of words to exclude (optional)

Response:

{
  "keywords": [
    ["keyword1", 0.1234],
    ["keyword2", 0.0987],
    ...
  ],
  "count": 10
}

Troubleshooting

"No module named 'flask'"

Make sure you've activated your virtual environment and installed dependencies:

pip install -r requirements.txt

PDF extraction fails

Some PDFs may not extract properly if they're scanned images. Use text-based PDFs for best results.

Port already in use

If port 5000 is busy, modify the port in app.py:

app.run(debug=True, host='0.0.0.0', port=5001)  # Use a different port

License

MIT License - feel free to use this project for any purpose.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Future Enhancements

Potential features for future versions:

Named Entity Recognition (NER) for identifying proper nouns
Topic modeling with LDA
Export results to CSV/JSON
Batch processing of multiple documents
Keyword visualization with word clouds
Language detection and multi-language support

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
__pycache__		__pycache__
templates		templates
.gitignore		.gitignore
GITHUB_SETUP.md		GITHUB_SETUP.md
PROJECT_OVERVIEW.md		PROJECT_OVERVIEW.md
QUICKSTART.md		QUICKSTART.md
README.md		README.md
app.py		app.py
keyword_extractor.py		keyword_extractor.py
requirements.txt		requirements.txt
sample_document.txt		sample_document.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Keyword Extractor

Features

How It Works

Installation

Prerequisites

Setup

Usage

Example

Configuration

Custom Stop Words

File Size Limit

Project Structure

Technical Details

TF-IDF Implementation

N-grams

API Endpoint

POST /extract

Troubleshooting

"No module named 'flask'"

PDF extraction fails

Port already in use

License

Contributing

Future Enhancements

About

Uh oh!

Releases

Packages

Languages

PhilSing24/keyword-extractor

Folders and files

Latest commit

History

Repository files navigation

Keyword Extractor

Features

How It Works

Installation

Prerequisites

Setup

Usage

Example

Configuration

Custom Stop Words

File Size Limit

Project Structure

Technical Details

TF-IDF Implementation

N-grams

API Endpoint

POST /extract

Troubleshooting

"No module named 'flask'"

PDF extraction fails

Port already in use

License

Contributing

Future Enhancements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages