Welcome to the Qwen2-VL Document Classification Pipeline project! This repository showcases a powerful, streamlined pipeline for classifying various document types using the Qwen2-VL-2B-GPTQ-INT4 model.
This pipeline leverages cutting-edge AI technology to classify documents into predefined categories. It supports image and PDF input formats, processes documents with efficient natural language understanding, and outputs precise classification reports and confusion matrices.
Key Features:
- Multi-page PDF handling with vertical merging of pages for seamless processing.
- Optimized prompt engineering for domain-specific accuracy.
- Automatic evaluation with detailed reports and visualizations.
- Simple execution in Google Colab — no additional setup required!
- Requirements
- Setup
- Usage in Google Colab
- Folder Structure
- Pipeline Steps
- Results and Reporting
- Future Enhancements
- Contributing
Hardware Requirements:
- Minimum CPU RAM: 5 GB
- Minimum GPU RAM: 8 GB
- Additional memory for file storage.
No complex installations or dependencies! Open the notebook in Google Colab, upload your files, and run the cells sequentially.
-
Open the Notebook Open the project notebook in Google Colab using this link.
-
Upload Your Files Place your documents in the appropriate folders and mount your Google Drive.
-
Run the Notebook Execute the cells sequentially to:
- Load the model and processor.
- Convert PDFs to images.
- Classify documents and generate reports.
-
Download Results Results (Excel file, confusion matrix) are saved in the outputs directory for easy access.
Ensure your file structure follows this format for proper pipeline execution:
root_directory/
├── azure_files/
│ ├── bill_of_lading/
│ ├── customs_document/
│ ├── delivery_receipt/
│ ├── invoice/
│ ├── ... (other categories)
└── outputs/
├── classification_results.xlsx
├── confusion_matrix.png
-
Load Model and Processor The pipeline utilizes the Qwen2-VL-2B-GPTQ-INT4 model for document classification.
-
PDF Conversion and Image Merging Multi-page PDFs are converted to vertically stacked images to ensure seamless input to the model.
-
Prompt Engineering Employs domain-specific patterns and keywords for improved classification accuracy.
-
Evaluation
- Generates detailed confusion matrices and classification reports.
- Produces color-coded Excel files for results.
-
Visualization Heatmaps and graphical representations provide insights into model performance.
Outputs Include:
- Classification Accuracy: Per-category and overall performance.
- Confusion Matrix: Heatmap of expected vs predicted classifications.
- Excel Reports: Color-coded Excel files summarizing results.
- Integration with OCR for enhanced text extraction.
- Support for multilingual document classification.
- Fine-tuning with additional labeled datasets.
- Adoption of advanced models like LayoutLMv3 for complex layouts.
We welcome contributions from the community! To contribute:
- Fork the repository.
- Create a feature branch.
- Submit a pull request with a detailed explanation of your changes.
Start Classifying with Ease! 🚀