Document Analyzer

A web application that analyzes PDF documents and provides detailed text statistics and insights.

Features

PDF document upload (drag & drop or file selection)
Text analysis including:
- Word count
- Character count (with and without spaces)
- Sentence count
- Average word length
Word frequency analysis
- Top 20 most frequent words
- Option to exclude common words (stop words)
Responsive design (mobile-friendly)
Real-time analysis with loading indicators
Interactive word frequency visualization

Tech Stack

Frontend: HTML, CSS (Tailwind CSS), JavaScript
Backend: Python, Django
Text Processing: spaCy
PDF Processing: PyPDF2
Visualization: Chart.js

Installation

Create and activate a virtual environment:

python -m venv venv
source venv/bin/activate venv\Scripts\activate

2. Clone the repository 
bash
git clone https://github.com/yuk14/document-analyzer.git
cd document-analyzer

3. Install dependencies:

```bash
pip install -r requirements.txt

4. Download spaCy model
```bash
python -m spacy download en_core_web_sm

Run the development server

python manage.py runserver

Visit http://localhost:8000/upload/ to use the application.

Usage

Upload a PDF document using drag & drop or file selection
Wait for the analysis to complete
View detailed statistics about your document
Toggle the "Exclude common words" option to filter out stop words
Explore the word frequency visualization

API Documentation

POST /upload/

Analyzes an uploaded PDF document.

Request:

Method: POST
Content-Type: multipart/form-data
Body:
- file: PDF file (required)
- exclude_stopwords: boolean (optional)

Response:

{
    "word_count": integer,
    "char_count": integer,
    "char_count_no_spaces": integer,
    "sentence_count": integer,
    "avg_word_length": float,
    "word_frequency": {
        "word1": count1,
        "word2": count2,
        ...
    }
}

Design Decisions and Trade-offs

spaCy vs NLTK
- Chose spaCy for better performance and easier setup
- Trade-off: Larger package size but better accuracy
Frontend Framework
- Used vanilla JavaScript for simplicity
- Trade-off: Less structured code but no build process required
File Storage
- Local storage in media directory
- Trade-off: Simple setup but not scalable for production
Word Frequency Analysis
- Limited to top 20 words for performance
- Trade-off: Less comprehensive but better visualization

Future Improvements

Features
- Support for PDF with images, handwritten text, scanned documents, etc.
- Support for more file formats (DOC, DOCX, TXT)
- Advanced text analysis (readability scores, sentiment analysis)
- Export analysis results
- Batch processing multiple files
Technical
- Add user authentication
- Implement file upload progress
- Add unit tests
- Use Redis for caching analysis results
- Implement cloud storage for files
UI/UX
- Add dark mode
- Implement more interactive visualizations
- Add comparison feature for multiple documents
- Add a chatbot for real-time queries

Contributing

Fork the repository
Create your feature branch (git checkout -b feature/AmazingFeature)
Commit your changes (git commit -m 'Add some AmazingFeature')
Push to the branch (git push origin feature/AmazingFeature)
Open a Pull Request

License

This project is licensed under the MIT License - see the [2025](Yukthi R) file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
analyzer		analyzer
document_analyzer		document_analyzer
snapshots		snapshots
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
manage.py		manage.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Document Analyzer

Features

Tech Stack

Installation

Usage

API Documentation

POST /upload/

Design Decisions and Trade-offs

Future Improvements

Contributing

License

About

Uh oh!

Languages

License

yuk14/document-analyzer

Folders and files

Latest commit

History

Repository files navigation

Document Analyzer

Features

Tech Stack

Installation

Usage

API Documentation

POST /upload/

Design Decisions and Trade-offs

Future Improvements

Contributing

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Languages