Skip to content

Harish9215/Precision_Data_Extraction_using_AI

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🚀 Automated Data Extraction from PDFs 📄

Effortlessly extract structured data from unstructured PDF documents!

🌟 Overview

This project develops a state-of-the-art Automated Data Extraction Tool designed to process PDF documents and convert their content into structured datasets. By combining the power of OCR, Natural Language Processing (NLP), and Machine Learning, the tool achieves high accuracy and efficiency in extracting information.

Whether you’re working with text-based PDFs or image-heavy documents, this tool simplifies data processing for industries like finance, healthcare, and academia.

✨ Features

• 🔍 Advanced OCR: Extract data from scanned or image-based PDFs using Tesseract.
• 🧠 Intelligent NLP: Segment sentences, identify parts of speech, and normalize text with spaCy.
• 📊 Structured Outputs: Convert complex data (e.g., tables, graphs) into usable formats like CSV.
• 🖼 Visualizations: Generate word clouds and clustering analysis for insights.
• 💻 User-Friendly Interface: Built with Streamlit, enabling easy uploads and data previews.

🚀 How to Use

Step 1: Clone the Repository

Start by cloning the repository to your local machine:

git clone https://github.com/Harish9215/Automated_Text_Extraction-and-More.git
cd Automated_Text_Extraction-and-More

Step 2: Set Up Your Environment

Ensure you have Python installed (preferably version 3.8 or higher). Install the required libraries by running:

pip install -r requirements.txt

Step 3: Launch the Application

Run the Streamlit interface to interact with the tool:

streamlit run src/interface.py

Step 4: Upload your PDF and let the magic happen! 🪄

  1. Open the interface in your browser (Streamlit provides the link).
  2. Drag and drop your PDF file into the uploader or use the Browse button.
  3. Click on Process to start the extraction.

Step 5: View and Download Results

• Extracted Text: Review the extracted content in the interface.
• Download Options: Save the structured data (e.g., CSV or plain text) for further use.
• Visualizations: Check out the generated word clouds and clustering visualizations for insights.

🛠️ Tech Stack

• Programming Language: Python 🐍
• Tesseract OCR for text extraction.
• spaCy for NLP.
• PyMuPDF for PDF parsing.
• Streamlit for UI development.
• Other Tools: WordCloud, TF-IDF for text vectorization, Clustering with K-Means.

📈 Performance Metrics

• Precision: 99.23% ✔️
• Recall: 94.61% 🔥
• F1 Score: 96.85% ⚡

📜 Future Enhancements

• ✨ Improve OCR accuracy for handwritten documents.
• 🌐 Support for multi-language PDF content.
• ⚡ Add real-time processing capabilities.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published