Skip to content

Latest commit

 

History

History
393 lines (289 loc) · 10.9 KB

File metadata and controls

393 lines (289 loc) · 10.9 KB

Contract Processing System - Complete Documentation

System Overview

The Contract Processing System is a comprehensive AI-powered document analysis platform designed for large-scale contract processing with advanced features including:

  • AI-Powered Analysis: OpenAI/Azure OpenAI integration for intelligent document extraction
  • Database Integration: Multi-database support (PostgreSQL, MySQL, SQLite) with comprehensive schema
  • Memory Management: Optimized for processing 2600+ documents with intelligent memory management
  • Knowledge Graph: Visual representation of document relationships and company connections
  • Version Control: File hash-based tracking to avoid reprocessing unchanged documents
  • Ontology Management: Hierarchical categorization system for contract classification
  • GUI Interface: Tkinter-based user interface with progress tracking and real-time monitoring

Architecture

Core Components

contract-processor.py          # Main application entry point
├── EnhancedDocumentProcessorApp  # GUI application
├── ContentAnalyzer              # AI analysis engine
├── DocumentProcessor            # Document processing logic
└── ResultsExporter             # Excel export functionality

database_manager.py            # Database operations and schema
├── DatabaseManager            # Core database operations
├── DatabaseConfig             # Database configuration
└── SQLAlchemy Models          # Data models (Document, Company, etc.)

settings_manager.py            # Application settings management
├── SettingsManager            # Settings persistence
├── AppSettings                # Settings data structure
└── SettingsDialog             # Settings GUI

memory_manager.py              # Memory optimization
├── MemoryMonitor              # System memory monitoring
├── BatchManager               # Intelligent batching
└── ProcessingPool             # Process pool management

extraction_templates.py        # AI extraction templates
├── ExtractionTemplateManager  # Template management
├── ExtractionTemplate         # Template definition
└── FileReadabilityChecker     # File validation

Data Flow

  1. Document Input: User selects directory with contract files
  2. File Validation: System checks file readability and calculates hashes
  3. Duplicate Detection: Compares file hashes to avoid reprocessing
  4. AI Analysis: Extracts contract details using OpenAI/Azure
  5. Database Storage: Stores results with full audit trail
  6. Relationship Detection: Identifies connections between documents
  7. Knowledge Graph: Builds visual representation of relationships
  8. Export: Generates Excel reports and database queries

Database Schema

Core Tables

documents

  • Primary document storage with version tracking
  • File metadata, processing status, and AI analysis results
  • Contract details (dates, values, deliverables)

companies

  • Company information and hierarchical relationships
  • Industry classification and metadata

contract_ontology

  • Hierarchical categorization system
  • Color coding and visualization support

document_relationships

  • Links between related documents
  • Relationship types and confidence scores

processing_logs

  • Complete audit trail of all processing activities
  • Performance metrics and error tracking

file_hashes

  • Version control through file hash tracking
  • Prevents reprocessing of unchanged files

Key Features

  • UUID Support: Cross-database compatibility
  • JSON Fields: Flexible metadata storage
  • Indexes: Optimized for large datasets
  • Foreign Keys: Referential integrity
  • Triggers: Automatic timestamp updates

Configuration

Environment Variables

# OpenAI Configuration
OPENAI_API_KEY=your-openai-api-key

# Azure OpenAI Configuration
AZURE_ENDPOINT=your-azure-endpoint
AZURE_DEPLOYMENT=your-deployment-name
AZURE_API_KEY=your-azure-api-key

Settings File (contract_processor_settings.json)

{
  "settings": {
    "max_workers": 4,
    "batch_size": 10,
    "memory_limit_mb": 4096,
    "db_type": "postgresql",
    "db_host": "localhost",
    "db_port": 5432,
    "skip_processed_files": true,
    "auto_detect_relationships": true
  }
}

Usage Guide

Installation

  1. Install Dependencies:

    pip install -r requirements.txt
  2. Database Setup:

    # PostgreSQL
    createdb contract_processor
    psql contract_processor < sql_schema.sql
    
    # SQLite (automatic)
    python contract-processor.py
  3. Configuration:

    • Set environment variables for API keys
    • Configure database connection in settings
    • Set output directory and processing parameters

Running the Application

python contract-processor.py

Processing Workflow

  1. Select Directory: Choose folder containing contract files
  2. Configure Metadata: Enter CW number, Company ID, Company Group
  3. Select File Types: Choose which file types to process
  4. Start Processing: System processes files with progress tracking
  5. Review Results: View processed documents and knowledge graph
  6. Export Data: Generate Excel reports or query database

GUI Features

  • Progress Tracking: Dual progress bars (overall + batch)
  • Memory Monitoring: Real-time memory usage display
  • Settings Dialog: Comprehensive configuration options
  • Knowledge Graph: Interactive visualization of relationships
  • Ontology Editor: Hierarchical category management

AI Analysis

Supported File Types

  • PDF: Using PyPDF for text extraction
  • DOCX: Using python-docx library
  • DOC: Using mammoth for conversion
  • XLSX/XLS: Using openpyxl and pyxlsb

Analysis Templates

Contract Details Template

  • Start/End dates
  • Contract duration
  • Key deliverables
  • Payment terms
  • Termination clauses

Vendor Assessment Template

  • Vendor capabilities
  • Risk factors
  • Compliance status
  • Performance metrics

Technical Specifications Template

  • Technical requirements
  • Performance metrics
  • Implementation timeline
  • Maintenance requirements

AI Configuration

  • Retry Logic: Exponential backoff with tenacity
  • Token Limits: Configurable per template
  • Temperature: 0.7 for balanced creativity/accuracy
  • Model Selection: GPT-4 or Azure OpenAI models

Memory Management

Optimization Strategies

  1. Dynamic Batching: Adjusts batch size based on memory usage
  2. Process Pool Isolation: Separate processes for memory-intensive operations
  3. Garbage Collection: Automatic cleanup between batches
  4. Streaming Processing: Large files processed in chunks
  5. Memory Monitoring: Real-time usage tracking

Configuration

MemoryConfig(
    memory_limit_mb=4096,
    batch_size_adjustment=True,
    min_batch_size=1,
    max_batch_size=50,
    gc_threshold_percent=80.0
)

Performance Optimization

Database Optimization

  • Connection Pooling: Configurable pool sizes
  • Indexes: Optimized for common queries
  • Batch Operations: Efficient bulk inserts/updates
  • Async Operations: Non-blocking database operations

Processing Optimization

  • Concurrent Processing: Multi-worker support
  • File Hash Caching: Avoids reprocessing
  • Memory-Efficient Parsing: Streaming for large files
  • Progress Tracking: Real-time status updates

Error Handling

Comprehensive Error Management

  • File Readability: Pre-processing validation
  • API Failures: Retry logic with exponential backoff
  • Database Errors: Connection retry and rollback
  • Memory Issues: Automatic garbage collection
  • Permission Errors: Graceful handling of access issues

Logging

  • Structured Logging: JSON format for analysis
  • Error Tracking: Detailed error messages and stack traces
  • Performance Metrics: Processing time and memory usage
  • Audit Trail: Complete processing history

Testing

System Testing

Run the test suite to verify all components:

python test_system.py

Test Coverage

  • Database Connectivity: Connection and schema validation
  • Settings Management: Configuration loading/saving
  • Memory Management: Monitoring and optimization
  • Import Validation: All required dependencies

Deployment

Production Considerations

  1. Database: Use PostgreSQL for production workloads
  2. Memory: Ensure sufficient RAM (8GB+ recommended)
  3. Storage: Adequate space for processed documents
  4. API Limits: Monitor OpenAI/Azure rate limits
  5. Backup: Regular database backups

Scaling

  • Horizontal Scaling: Multiple processing instances
  • Database Sharding: Partition by company or date
  • Caching: Redis for frequently accessed data
  • Load Balancing: Distribute processing across nodes

Troubleshooting

Common Issues

  1. Memory Errors: Reduce batch size or increase memory limit
  2. API Timeouts: Increase timeout settings or check network
  3. Database Connection: Verify connection parameters
  4. File Permissions: Ensure read access to document directory
  5. Missing Dependencies: Install all required packages

Debug Mode

Enable debug logging for detailed troubleshooting:

logging.basicConfig(level=logging.DEBUG)

API Reference

DatabaseManager

# Initialize database
db_manager = DatabaseManager(config)
db_manager.initialize()

# Create/update document
doc, is_new = await db_manager.create_or_update_document(
    file_path=path,
    file_hash=hash,
    cw_number="CW-001"
)

# Update processing status
await db_manager.update_document_processing(
    doc.id, 'completed',
    analysis_results=results
)

SettingsManager

# Load settings
settings_manager = SettingsManager()
settings = settings_manager.settings

# Save settings
settings_manager.save_settings()

# Get configurations
db_config = settings_manager.get_database_config()
endpoint_config = settings_manager.get_endpoint_config()

MemoryMonitor

# Start monitoring
monitor = MemoryMonitor(config)
monitor.start_monitoring()

# Get memory info
info = monitor.get_memory_info()
print(f"Memory usage: {info['process_rss_mb']:.1f}MB")

Contributing

Code Style

  • Type Hints: All functions include type annotations
  • Docstrings: Comprehensive documentation
  • Error Handling: Proper exception management
  • Logging: Structured logging throughout

Development Setup

  1. Virtual Environment: Use isolated Python environment
  2. Pre-commit Hooks: Code formatting and linting
  3. Testing: Run tests before committing
  4. Documentation: Update docs for new features

License

MIT License - See LICENSE file for details

Author

Martin Bacigal, 01/2025 @ https://procureai.tech


This documentation covers the complete Contract Processing System. For specific implementation details, refer to the individual module documentation and source code.