Skip to content

Latest commit

 

History

History
202 lines (157 loc) · 5.91 KB

File metadata and controls

202 lines (157 loc) · 5.91 KB

Enhanced Document Processor with SQL & Knowledge Graph

🚀 Latest Updates - SQL-Focused Enhancement

Key Improvements:

  1. File Hash-Based Version Control

    • Every file is hashed (SHA-256) BEFORE processing
    • Automatic detection of file changes
    • Version tracking for all document updates
    • Skip processing of unchanged files
  2. Enhanced Progress Tracking

    • Dual progress bars (overall + batch)
    • Real-time statistics (processed, skipped, failed)
    • Elapsed time tracking
    • Memory usage monitoring per file
  3. Contract Ontology Management

    • Hierarchical categorization system
    • Visual ontology tree editor
    • AI-powered auto-categorization
    • Confidence scoring for assignments
  4. Comprehensive SQL Schema

    • 10 core tables for complete data management
    • Full audit trail with processing logs
    • Daily statistics aggregation
    • Optimized indexes for 2600+ documents
  5. Knowledge Graph Enhancements

    • Filtered visualization (documents, companies, ontologies)
    • Color-coded by ontology categories
    • Relationship type visualization
    • Export capabilities

SQL Tables Overview:

  • documents - Core document storage with version tracking
  • file_hashes - Track all file versions
  • document_versions - Complete version history
  • contract_ontology - Hierarchical categorization
  • document_ontology_mapping - Document categorization
  • companies - Company management
  • document_companies - Document-company relationships
  • document_relationships - Inter-document relationships
  • processing_logs - Complete audit trail
  • processing_statistics - Performance metrics

See SQL_SCHEMA_DOCUMENTATION.md for complete schema details.


Overview

A powerful document processing system that can handle large-scale contract analysis with SQL database integration, memory management, and knowledge graph capabilities.

Features

Core Capabilities

  • Large-Scale Processing: Process 2600+ contracts efficiently with memory management
  • SQL Database Integration: Store and query processed documents in PostgreSQL/MySQL/SQLite
  • Knowledge Graph: Visualize document relationships and company connections
  • Memory Management: Automatic memory optimization and garbage collection
  • Settings Persistence: Save and load configurations via JSON
  • Document Tracking: Skip already processed files, track processing status

Document Processing

  • Support for PDF, DOCX, DOC, XLSX, XLS files
  • AI-powered content analysis using OpenAI or Azure OpenAI
  • Extract contract details, vendor assessments, and technical specifications
  • Automatic relationship detection between documents

Data Governance

  • Company ID and Company Group tracking
  • Contract Work (CW) number assignment
  • Document ontology and categorization
  • Comprehensive audit trail and processing logs

Installation

  1. Install required dependencies:
pip install -r requirements.txt
  1. Set up environment variables:
# For OpenAI
export OPENAI_API_KEY="your-api-key"

# For Azure OpenAI
export AZURE_ENDPOINT="your-endpoint"
export AZURE_DEPLOYMENT="your-deployment"
export AZURE_API_KEY="your-api-key"
  1. Set up database (PostgreSQL example):
createdb contract_processor

Usage

  1. Run the application:
python contract-processor.py
  1. Configure settings via the Settings menu:

    • Database connection (PostgreSQL/MySQL/SQLite)
    • API configuration (OpenAI/Azure)
    • Processing parameters (batch size, memory limits)
    • Company metadata defaults
  2. Process documents:

    • Select directory containing contracts
    • Enter CW number, Company ID, and Company Group
    • Choose file types to process
    • Click "Start Processing"
  3. View results:

    • Excel output in the configured output directory
    • SQL database with full processing history
    • Knowledge graph visualization in the application

Database Schema

Tables

  • documents: Stores document metadata and processing status
  • companies: Company information and groupings
  • document_relationships: Links between related documents
  • processing_logs: Audit trail of all processing activities
  • document_ontology: Document categorization hierarchy

Key Features

  • File hash-based duplicate detection
  • Processing status tracking (pending/processing/completed/failed)
  • Contract date extraction and storage
  • Company-document relationship mapping

Memory Management

The system automatically manages memory for large-scale processing:

  • Dynamic batch size adjustment based on available memory
  • Process pool isolation for memory-intensive operations
  • Automatic garbage collection between batches
  • Configurable memory limits

Knowledge Graph

The integrated knowledge graph provides:

  • Visual representation of document relationships
  • Company-document connections
  • Interactive graph exploration
  • Export to PNG/PDF formats

Configuration

Settings are persisted in contract_processor_settings.json:

{
  "settings": {
    "max_workers": 4,
    "batch_size": 10,
    "memory_limit_mb": 4096,
    "db_type": "postgresql",
    "db_host": "localhost",
    "db_port": 5432,
    "skip_processed_files": true,
    "auto_detect_relationships": true
  }
}

API Support

  • OpenAI: Direct integration with OpenAI API
  • Azure OpenAI: Support for Azure-hosted OpenAI services
  • Configurable model parameters and endpoints

Error Handling

  • Comprehensive error logging
  • Graceful handling of processing failures
  • Resume capability for interrupted processing
  • Permission error handling

Performance

Optimized for processing large document sets:

  • Concurrent processing with configurable workers
  • Streaming file processing for large documents
  • Redis caching support (optional)
  • Neo4j integration for advanced graph operations (optional)

License

MIT License - See LICENSE file for details

Author

Martin Bacigal, 01/2025 @ https://procureai.tech