The Contract Processing System is a comprehensive AI-powered document analysis platform designed for large-scale contract processing with advanced features including:
- AI-Powered Analysis: OpenAI/Azure OpenAI integration for intelligent document extraction
- Database Integration: Multi-database support (PostgreSQL, MySQL, SQLite) with comprehensive schema
- Memory Management: Optimized for processing 2600+ documents with intelligent memory management
- Knowledge Graph: Visual representation of document relationships and company connections
- Version Control: File hash-based tracking to avoid reprocessing unchanged documents
- Ontology Management: Hierarchical categorization system for contract classification
- GUI Interface: Tkinter-based user interface with progress tracking and real-time monitoring
contract-processor.py # Main application entry point
├── EnhancedDocumentProcessorApp # GUI application
├── ContentAnalyzer # AI analysis engine
├── DocumentProcessor # Document processing logic
└── ResultsExporter # Excel export functionality
database_manager.py # Database operations and schema
├── DatabaseManager # Core database operations
├── DatabaseConfig # Database configuration
└── SQLAlchemy Models # Data models (Document, Company, etc.)
settings_manager.py # Application settings management
├── SettingsManager # Settings persistence
├── AppSettings # Settings data structure
└── SettingsDialog # Settings GUI
memory_manager.py # Memory optimization
├── MemoryMonitor # System memory monitoring
├── BatchManager # Intelligent batching
└── ProcessingPool # Process pool management
extraction_templates.py # AI extraction templates
├── ExtractionTemplateManager # Template management
├── ExtractionTemplate # Template definition
└── FileReadabilityChecker # File validation
- Document Input: User selects directory with contract files
- File Validation: System checks file readability and calculates hashes
- Duplicate Detection: Compares file hashes to avoid reprocessing
- AI Analysis: Extracts contract details using OpenAI/Azure
- Database Storage: Stores results with full audit trail
- Relationship Detection: Identifies connections between documents
- Knowledge Graph: Builds visual representation of relationships
- Export: Generates Excel reports and database queries
- Primary document storage with version tracking
- File metadata, processing status, and AI analysis results
- Contract details (dates, values, deliverables)
- Company information and hierarchical relationships
- Industry classification and metadata
- Hierarchical categorization system
- Color coding and visualization support
- Links between related documents
- Relationship types and confidence scores
- Complete audit trail of all processing activities
- Performance metrics and error tracking
- Version control through file hash tracking
- Prevents reprocessing of unchanged files
- UUID Support: Cross-database compatibility
- JSON Fields: Flexible metadata storage
- Indexes: Optimized for large datasets
- Foreign Keys: Referential integrity
- Triggers: Automatic timestamp updates
# OpenAI Configuration
OPENAI_API_KEY=your-openai-api-key
# Azure OpenAI Configuration
AZURE_ENDPOINT=your-azure-endpoint
AZURE_DEPLOYMENT=your-deployment-name
AZURE_API_KEY=your-azure-api-key{
"settings": {
"max_workers": 4,
"batch_size": 10,
"memory_limit_mb": 4096,
"db_type": "postgresql",
"db_host": "localhost",
"db_port": 5432,
"skip_processed_files": true,
"auto_detect_relationships": true
}
}-
Install Dependencies:
pip install -r requirements.txt
-
Database Setup:
# PostgreSQL createdb contract_processor psql contract_processor < sql_schema.sql # SQLite (automatic) python contract-processor.py
-
Configuration:
- Set environment variables for API keys
- Configure database connection in settings
- Set output directory and processing parameters
python contract-processor.py- Select Directory: Choose folder containing contract files
- Configure Metadata: Enter CW number, Company ID, Company Group
- Select File Types: Choose which file types to process
- Start Processing: System processes files with progress tracking
- Review Results: View processed documents and knowledge graph
- Export Data: Generate Excel reports or query database
- Progress Tracking: Dual progress bars (overall + batch)
- Memory Monitoring: Real-time memory usage display
- Settings Dialog: Comprehensive configuration options
- Knowledge Graph: Interactive visualization of relationships
- Ontology Editor: Hierarchical category management
- PDF: Using PyPDF for text extraction
- DOCX: Using python-docx library
- DOC: Using mammoth for conversion
- XLSX/XLS: Using openpyxl and pyxlsb
- Start/End dates
- Contract duration
- Key deliverables
- Payment terms
- Termination clauses
- Vendor capabilities
- Risk factors
- Compliance status
- Performance metrics
- Technical requirements
- Performance metrics
- Implementation timeline
- Maintenance requirements
- Retry Logic: Exponential backoff with tenacity
- Token Limits: Configurable per template
- Temperature: 0.7 for balanced creativity/accuracy
- Model Selection: GPT-4 or Azure OpenAI models
- Dynamic Batching: Adjusts batch size based on memory usage
- Process Pool Isolation: Separate processes for memory-intensive operations
- Garbage Collection: Automatic cleanup between batches
- Streaming Processing: Large files processed in chunks
- Memory Monitoring: Real-time usage tracking
MemoryConfig(
memory_limit_mb=4096,
batch_size_adjustment=True,
min_batch_size=1,
max_batch_size=50,
gc_threshold_percent=80.0
)- Connection Pooling: Configurable pool sizes
- Indexes: Optimized for common queries
- Batch Operations: Efficient bulk inserts/updates
- Async Operations: Non-blocking database operations
- Concurrent Processing: Multi-worker support
- File Hash Caching: Avoids reprocessing
- Memory-Efficient Parsing: Streaming for large files
- Progress Tracking: Real-time status updates
- File Readability: Pre-processing validation
- API Failures: Retry logic with exponential backoff
- Database Errors: Connection retry and rollback
- Memory Issues: Automatic garbage collection
- Permission Errors: Graceful handling of access issues
- Structured Logging: JSON format for analysis
- Error Tracking: Detailed error messages and stack traces
- Performance Metrics: Processing time and memory usage
- Audit Trail: Complete processing history
Run the test suite to verify all components:
python test_system.py- Database Connectivity: Connection and schema validation
- Settings Management: Configuration loading/saving
- Memory Management: Monitoring and optimization
- Import Validation: All required dependencies
- Database: Use PostgreSQL for production workloads
- Memory: Ensure sufficient RAM (8GB+ recommended)
- Storage: Adequate space for processed documents
- API Limits: Monitor OpenAI/Azure rate limits
- Backup: Regular database backups
- Horizontal Scaling: Multiple processing instances
- Database Sharding: Partition by company or date
- Caching: Redis for frequently accessed data
- Load Balancing: Distribute processing across nodes
- Memory Errors: Reduce batch size or increase memory limit
- API Timeouts: Increase timeout settings or check network
- Database Connection: Verify connection parameters
- File Permissions: Ensure read access to document directory
- Missing Dependencies: Install all required packages
Enable debug logging for detailed troubleshooting:
logging.basicConfig(level=logging.DEBUG)# Initialize database
db_manager = DatabaseManager(config)
db_manager.initialize()
# Create/update document
doc, is_new = await db_manager.create_or_update_document(
file_path=path,
file_hash=hash,
cw_number="CW-001"
)
# Update processing status
await db_manager.update_document_processing(
doc.id, 'completed',
analysis_results=results
)# Load settings
settings_manager = SettingsManager()
settings = settings_manager.settings
# Save settings
settings_manager.save_settings()
# Get configurations
db_config = settings_manager.get_database_config()
endpoint_config = settings_manager.get_endpoint_config()# Start monitoring
monitor = MemoryMonitor(config)
monitor.start_monitoring()
# Get memory info
info = monitor.get_memory_info()
print(f"Memory usage: {info['process_rss_mb']:.1f}MB")- Type Hints: All functions include type annotations
- Docstrings: Comprehensive documentation
- Error Handling: Proper exception management
- Logging: Structured logging throughout
- Virtual Environment: Use isolated Python environment
- Pre-commit Hooks: Code formatting and linting
- Testing: Run tests before committing
- Documentation: Update docs for new features
MIT License - See LICENSE file for details
Martin Bacigal, 01/2025 @ https://procureai.tech
This documentation covers the complete Contract Processing System. For specific implementation details, refer to the individual module documentation and source code.