A web application that analyzes PDF documents and provides detailed text statistics and insights.
- PDF document upload (drag & drop or file selection)
- Text analysis including:
- Word count
- Character count (with and without spaces)
- Sentence count
- Average word length
- Word frequency analysis
- Top 20 most frequent words
- Option to exclude common words (stop words)
- Responsive design (mobile-friendly)
- Real-time analysis with loading indicators
- Interactive word frequency visualization
- Frontend: HTML, CSS (Tailwind CSS), JavaScript
- Backend: Python, Django
- Text Processing: spaCy
- PDF Processing: PyPDF2
- Visualization: Chart.js
- Create and activate a virtual environment:
python -m venv venv
source venv/bin/activate venv\Scripts\activate
2. Clone the repository
bash
git clone https://github.com/yuk14/document-analyzer.git
cd document-analyzer
3. Install dependencies:
```bash
pip install -r requirements.txt
4. Download spaCy model
```bash
python -m spacy download en_core_web_sm- Run the development server
python manage.py runserverVisit http://localhost:8000/upload/ to use the application.
- Upload a PDF document using drag & drop or file selection
- Wait for the analysis to complete
- View detailed statistics about your document
- Toggle the "Exclude common words" option to filter out stop words
- Explore the word frequency visualization
Analyzes an uploaded PDF document.
Request:
- Method: POST
- Content-Type: multipart/form-data
- Body:
- file: PDF file (required)
- exclude_stopwords: boolean (optional)
Response:
{
"word_count": integer,
"char_count": integer,
"char_count_no_spaces": integer,
"sentence_count": integer,
"avg_word_length": float,
"word_frequency": {
"word1": count1,
"word2": count2,
...
}
}-
spaCy vs NLTK
- Chose spaCy for better performance and easier setup
- Trade-off: Larger package size but better accuracy
-
Frontend Framework
- Used vanilla JavaScript for simplicity
- Trade-off: Less structured code but no build process required
-
File Storage
- Local storage in media directory
- Trade-off: Simple setup but not scalable for production
-
Word Frequency Analysis
- Limited to top 20 words for performance
- Trade-off: Less comprehensive but better visualization
-
Features
- Support for PDF with images, handwritten text, scanned documents, etc.
- Support for more file formats (DOC, DOCX, TXT)
- Advanced text analysis (readability scores, sentiment analysis)
- Export analysis results
- Batch processing multiple files
-
Technical
- Add user authentication
- Implement file upload progress
- Add unit tests
- Use Redis for caching analysis results
- Implement cloud storage for files
-
UI/UX
- Add dark mode
- Implement more interactive visualizations
- Add comparison feature for multiple documents
- Add a chatbot for real-time queries
- Fork the repository
- Create your feature branch (
git checkout -b feature/AmazingFeature) - Commit your changes (
git commit -m 'Add some AmazingFeature') - Push to the branch (
git push origin feature/AmazingFeature) - Open a Pull Request
This project is licensed under the MIT License - see the [2025](Yukthi R) file for details.