Skip to content

A web application that extracts important keywords from documents using TF-IDF analysis

Notifications You must be signed in to change notification settings

PhilSing24/keyword-extractor

Repository files navigation

Keyword Extractor

A web-based application that extracts the most important keywords from documents using TF-IDF (Term Frequency-Inverse Document Frequency) analysis.

Features

  • 📄 Multiple File Formats: Supports TXT, PDF, and DOCX files
  • 🔍 TF-IDF Analysis: Uses advanced statistical methods to identify meaningful keywords
  • 🚫 Custom Stop Words: Exclude generic or unwanted words from results
  • 📊 Configurable Results: Choose how many keywords to extract (1-50)
  • 🎨 Modern UI: Clean, intuitive web interface
  • 🔤 N-gram Support: Extracts both single words and two-word phrases

How It Works

The application uses TF-IDF (Term Frequency-Inverse Document Frequency) to identify keywords:

  1. TF (Term Frequency): Measures how often a word appears in your document
  2. IDF (Inverse Document Frequency): Measures how unique a word is across documents
  3. TF-IDF Score: Combines both to find words that are frequent in your document but rare in general

This ensures you get meaningful, document-specific keywords rather than common words like "the" or "and".

Installation

Prerequisites

  • Python 3.8 or higher
  • pip (Python package manager)

Setup

  1. Clone the repository:
git clone <your-repo-url>
cd keyword-extractor
  1. Create a virtual environment (recommended):
python -m venv venv

# On Windows:
venv\Scripts\activate

# On macOS/Linux:
source venv/bin/activate
  1. Install dependencies:
pip install -r requirements.txt

Usage

  1. Start the application:
python app.py
  1. Open your browser and go to:
http://localhost:5000
  1. Use the application:
    • Upload a document (TXT, PDF, or DOCX)
    • Set the number of keywords you want to extract
    • (Optional) Add custom words to exclude
    • Click "Extract Keywords"

Example

For a Nvidia earnings transcript, the application might extract keywords like:

  • data center
  • artificial intelligence
  • gpu
  • gaming revenue
  • h100
  • cuda
  • automotive

While automatically filtering out generic terms like "company," "quarter," "growth," etc.

Configuration

Custom Stop Words

The application includes a comprehensive list of stop words:

  • Common English words (the, is, and, etc.)
  • Business jargon (quarter, growth, company, etc.)
  • Filler words (obviously, basically, actually, etc.)

You can add more custom stop words through the web interface.

File Size Limit

Default maximum file size is 16MB. You can modify this in app.py:

app.config['MAX_CONTENT_LENGTH'] = 16 * 1024 * 1024  # Change this value

Project Structure

keyword-extractor/
├── app.py                 # Flask application and routes
├── keyword_extractor.py   # Core keyword extraction logic
├── requirements.txt       # Python dependencies
├── README.md             # This file
├── templates/
│   └── index.html        # Web interface
└── uploads/              # Temporary file storage (auto-created)

Technical Details

TF-IDF Implementation

The application splits documents into chunks to improve IDF calculations, even for single-document analysis. This helps identify truly distinctive terms.

N-grams

The extractor looks for both:

  • Unigrams: Single words (e.g., "revenue")
  • Bigrams: Two-word phrases (e.g., "data center")

This captures important multi-word concepts that would be lost with single-word analysis.

API Endpoint

POST /extract

Extract keywords from an uploaded document.

Parameters:

  • file: The document file (multipart/form-data)
  • n_keywords: Number of keywords to extract (integer, 1-50)
  • exclude_words: Comma or newline-separated list of words to exclude (optional)

Response:

{
  "keywords": [
    ["keyword1", 0.1234],
    ["keyword2", 0.0987],
    ...
  ],
  "count": 10
}

Troubleshooting

"No module named 'flask'"

Make sure you've activated your virtual environment and installed dependencies:

pip install -r requirements.txt

PDF extraction fails

Some PDFs may not extract properly if they're scanned images. Use text-based PDFs for best results.

Port already in use

If port 5000 is busy, modify the port in app.py:

app.run(debug=True, host='0.0.0.0', port=5001)  # Use a different port

License

MIT License - feel free to use this project for any purpose.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Future Enhancements

Potential features for future versions:

  • Named Entity Recognition (NER) for identifying proper nouns
  • Topic modeling with LDA
  • Export results to CSV/JSON
  • Batch processing of multiple documents
  • Keyword visualization with word clouds
  • Language detection and multi-language support

About

A web application that extracts important keywords from documents using TF-IDF analysis

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published