Effortlessly extract structured data from unstructured PDF documents!
This project develops a state-of-the-art Automated Data Extraction Tool designed to process PDF documents and convert their content into structured datasets. By combining the power of OCR, Natural Language Processing (NLP), and Machine Learning, the tool achieves high accuracy and efficiency in extracting information.
Whether you’re working with text-based PDFs or image-heavy documents, this tool simplifies data processing for industries like finance, healthcare, and academia.
• 🔍 Advanced OCR: Extract data from scanned or image-based PDFs using Tesseract.
• 🧠 Intelligent NLP: Segment sentences, identify parts of speech, and normalize text with spaCy.
• 📊 Structured Outputs: Convert complex data (e.g., tables, graphs) into usable formats like CSV.
• 🖼 Visualizations: Generate word clouds and clustering analysis for insights.
• 💻 User-Friendly Interface: Built with Streamlit, enabling easy uploads and data previews.
Start by cloning the repository to your local machine:
git clone https://github.com/Harish9215/Automated_Text_Extraction-and-More.git
cd Automated_Text_Extraction-and-More
Ensure you have Python installed (preferably version 3.8 or higher). Install the required libraries by running:
pip install -r requirements.txt
Run the Streamlit interface to interact with the tool:
streamlit run src/interface.py
- Open the interface in your browser (Streamlit provides the link).
- Drag and drop your PDF file into the uploader or use the Browse button.
- Click on Process to start the extraction.
• Extracted Text: Review the extracted content in the interface.
• Download Options: Save the structured data (e.g., CSV or plain text) for further use.
• Visualizations: Check out the generated word clouds and clustering visualizations for insights.
• Programming Language: Python 🐍
• Tesseract OCR for text extraction.
• spaCy for NLP.
• PyMuPDF for PDF parsing.
• Streamlit for UI development.
• Other Tools: WordCloud, TF-IDF for text vectorization, Clustering with K-Means.
• Precision: 99.23% ✔️
• Recall: 94.61% 🔥
• F1 Score: 96.85% ⚡
• ✨ Improve OCR accuracy for handwritten documents.
• 🌐 Support for multi-language PDF content.
• ⚡ Add real-time processing capabilities.