A web-based application that extracts the most important keywords from documents using TF-IDF (Term Frequency-Inverse Document Frequency) analysis.
- 📄 Multiple File Formats: Supports TXT, PDF, and DOCX files
- 🔍 TF-IDF Analysis: Uses advanced statistical methods to identify meaningful keywords
- 🚫 Custom Stop Words: Exclude generic or unwanted words from results
- 📊 Configurable Results: Choose how many keywords to extract (1-50)
- 🎨 Modern UI: Clean, intuitive web interface
- 🔤 N-gram Support: Extracts both single words and two-word phrases
The application uses TF-IDF (Term Frequency-Inverse Document Frequency) to identify keywords:
- TF (Term Frequency): Measures how often a word appears in your document
- IDF (Inverse Document Frequency): Measures how unique a word is across documents
- TF-IDF Score: Combines both to find words that are frequent in your document but rare in general
This ensures you get meaningful, document-specific keywords rather than common words like "the" or "and".
- Python 3.8 or higher
- pip (Python package manager)
- Clone the repository:
git clone <your-repo-url>
cd keyword-extractor- Create a virtual environment (recommended):
python -m venv venv
# On Windows:
venv\Scripts\activate
# On macOS/Linux:
source venv/bin/activate- Install dependencies:
pip install -r requirements.txt- Start the application:
python app.py- Open your browser and go to:
http://localhost:5000
- Use the application:
- Upload a document (TXT, PDF, or DOCX)
- Set the number of keywords you want to extract
- (Optional) Add custom words to exclude
- Click "Extract Keywords"
For a Nvidia earnings transcript, the application might extract keywords like:
- data center
- artificial intelligence
- gpu
- gaming revenue
- h100
- cuda
- automotive
While automatically filtering out generic terms like "company," "quarter," "growth," etc.
The application includes a comprehensive list of stop words:
- Common English words (the, is, and, etc.)
- Business jargon (quarter, growth, company, etc.)
- Filler words (obviously, basically, actually, etc.)
You can add more custom stop words through the web interface.
Default maximum file size is 16MB. You can modify this in app.py:
app.config['MAX_CONTENT_LENGTH'] = 16 * 1024 * 1024 # Change this valuekeyword-extractor/
├── app.py # Flask application and routes
├── keyword_extractor.py # Core keyword extraction logic
├── requirements.txt # Python dependencies
├── README.md # This file
├── templates/
│ └── index.html # Web interface
└── uploads/ # Temporary file storage (auto-created)
The application splits documents into chunks to improve IDF calculations, even for single-document analysis. This helps identify truly distinctive terms.
The extractor looks for both:
- Unigrams: Single words (e.g., "revenue")
- Bigrams: Two-word phrases (e.g., "data center")
This captures important multi-word concepts that would be lost with single-word analysis.
Extract keywords from an uploaded document.
Parameters:
file: The document file (multipart/form-data)n_keywords: Number of keywords to extract (integer, 1-50)exclude_words: Comma or newline-separated list of words to exclude (optional)
Response:
{
"keywords": [
["keyword1", 0.1234],
["keyword2", 0.0987],
...
],
"count": 10
}Make sure you've activated your virtual environment and installed dependencies:
pip install -r requirements.txtSome PDFs may not extract properly if they're scanned images. Use text-based PDFs for best results.
If port 5000 is busy, modify the port in app.py:
app.run(debug=True, host='0.0.0.0', port=5001) # Use a different portMIT License - feel free to use this project for any purpose.
Contributions are welcome! Please feel free to submit a Pull Request.
Potential features for future versions:
- Named Entity Recognition (NER) for identifying proper nouns
- Topic modeling with LDA
- Export results to CSV/JSON
- Batch processing of multiple documents
- Keyword visualization with word clouds
- Language detection and multi-language support