This document describes the comprehensive SQL schema designed for the contract processing system that handles 2600+ documents with version tracking, ontology management, and knowledge graph capabilities.
- File Hash-Based Version Control: Every file is hashed before processing to detect changes
- Processing Status Tracking: Complete audit trail of document processing
- Contract Ontology: Hierarchical categorization system for contracts
- Relationship Management: Track relationships between documents and companies
- Performance Statistics: Daily processing metrics and KPIs
The main table storing all document information with version tracking.
Key Fields:
id: Primary keydocument_id: UUID for unique document identificationcurrent_file_hash: SHA-256 hash of the current file versionprocessing_status: pending, processing, completed, failed, reprocessingprocessing_version: Incremented with each file changecontract_start_date,contract_end_date: Extracted contract datesanalysis_results: JSON storage for AI analysis results
Processing Workflow:
- File hash calculated before processing
- Check if document exists by hash
- If file changed (different hash), create new version
- Track processing status throughout lifecycle
Tracks all file hashes for version control.
Purpose:
- Detect when files have been modified
- Maintain history of all file versions
- Enable skip-processing for unchanged files
Key Fields:
document_id: Links to documents tablefile_hash: SHA-256 hashis_current: Boolean flag for current versioncalculated_at: Timestamp of hash calculation
Complete version history for each document.
Tracks:
- Version number (auto-incremented)
- What changed (content_update, metadata_update, reprocessed)
- Processing results for each version
- Who made the change and when
Hierarchical categorization system for contracts.
Default Structure:
ROOT (All Contracts)
├── PROCUREMENT
│ ├── PROCUREMENT.GOODS (Goods Procurement)
│ └── PROCUREMENT.SERVICES (Services Procurement)
├── LEGAL
│ ├── LEGAL.NDA (Confidentiality)
│ └── LEGAL.IP (Intellectual Property)
└── OPERATIONAL
├── OPERATIONAL.FACILITIES (Facilities)
└── OPERATIONAL.IT (IT Services)
Features:
- Hierarchical structure with parent-child relationships
- Color coding for visualization
- Keywords and rules for automatic classification
- Active/inactive status for categories
Maps documents to ontology categories.
Key Features:
- Confidence scores (0.00 to 1.00)
- Primary category designation
- Tracking of assignment method (AI, user, rule-based)
Stores company information with hierarchical structure.
Features:
- Parent-subsidiary relationships
- Company groups for organization
- Industry and country metadata
Links documents to companies with roles.
Roles:
- vendor
- client
- prime_contractor
- subcontractor
- witness
Tracks relationships between documents.
Relationship Types:
- amendment
- renewal
- supersedes
- references
- related
Complete audit trail of all processing activities.
Tracks:
- Every action (created, processing_started, completed, failed)
- Duration and memory usage
- Error messages and details
- User/system that performed action
Daily aggregated statistics for dashboards.
Metrics:
- Total documents processed
- Success/failure counts
- Average processing time
- Unique companies discovered
- New relationships found
# For each file:
file_hash = calculate_sha256(file_path)
existing = check_document_exists(file_hash)
if existing and not needs_reprocessing:
skip_file()
else:
add_to_processing_queue()if new_document:
# Create new document record
document = create_document(file_hash, metadata)
version = 1
else:
# File changed - create new version
version = increment_version()
update_document_hash(new_hash)# Update status
update_status('processing')
# Process content
result = analyze_document(content)
# Update with results
update_status('completed', results=result, duration=time)
# Auto-assign ontology
assign_to_ontology(document_id, detected_category)# Link to companies
link_to_company(document_id, company_id)
# Detect relationships
find_related_documents(content_analysis)
# Update statistics
update_daily_statistics()SELECT d.*, fh.file_hash
FROM documents d
JOIN file_hashes fh ON d.id = fh.document_id
WHERE fh.file_hash = :hash AND fh.is_current = true;SELECT d.*, dv.version_number, dv.analysis_results
FROM documents d
JOIN document_versions dv ON d.id = dv.document_id
WHERE d.id = :doc_id
AND dv.version_number = d.processing_version;SELECT d.*, co.category_name
FROM documents d
JOIN document_ontology_mapping dom ON d.id = dom.document_id
JOIN contract_ontology co ON dom.ontology_id = co.id
WHERE co.category_code = :category_code
AND dom.is_primary = true;SELECT *,
CASE
WHEN contract_end_date < CURRENT_DATE THEN 'expired'
WHEN contract_end_date < CURRENT_DATE + INTERVAL '30 days' THEN 'expiring_soon'
ELSE 'active'
END as status
FROM documents
WHERE contract_end_date IS NOT NULL
ORDER BY contract_end_date;- Efficient Processing: Skip already-processed files using hash comparison
- Version Control: Track all changes to documents over time
- Audit Trail: Complete history of who did what and when
- Scalability: Optimized indexes for 2600+ documents
- Flexibility: JSON fields for storing varied analysis results
- Relationships: Track complex document and company relationships
- Analytics: Built-in statistics for performance monitoring
The application uses this schema to:
- Prevent Duplicate Processing: Check hash before processing
- Track Changes: Automatically version documents when files change
- Organize Contracts: Use ontology for categorization
- Build Knowledge Graphs: Visualize relationships
- Monitor Performance: Track processing times and success rates
- Ensure Data Integrity: Maintain audit trails
- Always calculate file hash before processing
- Use transactions for multi-table updates
- Index frequently queried fields
- Regular cleanup of old processing logs
- Monitor processing statistics for performance issues
- Use confidence scores for AI-assigned categories
- Maintain referential integrity with foreign keys