Skip to content

Latest commit

 

History

History
257 lines (209 loc) · 7.15 KB

File metadata and controls

257 lines (209 loc) · 7.15 KB

SQL Schema Documentation for Contract Processing System

Overview

This document describes the comprehensive SQL schema designed for the contract processing system that handles 2600+ documents with version tracking, ontology management, and knowledge graph capabilities.

Key Features

  • File Hash-Based Version Control: Every file is hashed before processing to detect changes
  • Processing Status Tracking: Complete audit trail of document processing
  • Contract Ontology: Hierarchical categorization system for contracts
  • Relationship Management: Track relationships between documents and companies
  • Performance Statistics: Daily processing metrics and KPIs

Database Tables

1. documents (Core Table)

The main table storing all document information with version tracking.

Key Fields:

  • id: Primary key
  • document_id: UUID for unique document identification
  • current_file_hash: SHA-256 hash of the current file version
  • processing_status: pending, processing, completed, failed, reprocessing
  • processing_version: Incremented with each file change
  • contract_start_date, contract_end_date: Extracted contract dates
  • analysis_results: JSON storage for AI analysis results

Processing Workflow:

  1. File hash calculated before processing
  2. Check if document exists by hash
  3. If file changed (different hash), create new version
  4. Track processing status throughout lifecycle

2. file_hashes

Tracks all file hashes for version control.

Purpose:

  • Detect when files have been modified
  • Maintain history of all file versions
  • Enable skip-processing for unchanged files

Key Fields:

  • document_id: Links to documents table
  • file_hash: SHA-256 hash
  • is_current: Boolean flag for current version
  • calculated_at: Timestamp of hash calculation

3. document_versions

Complete version history for each document.

Tracks:

  • Version number (auto-incremented)
  • What changed (content_update, metadata_update, reprocessed)
  • Processing results for each version
  • Who made the change and when

4. contract_ontology

Hierarchical categorization system for contracts.

Default Structure:

ROOT (All Contracts)
├── PROCUREMENT
│   ├── PROCUREMENT.GOODS (Goods Procurement)
│   └── PROCUREMENT.SERVICES (Services Procurement)
├── LEGAL
│   ├── LEGAL.NDA (Confidentiality)
│   └── LEGAL.IP (Intellectual Property)
└── OPERATIONAL
    ├── OPERATIONAL.FACILITIES (Facilities)
    └── OPERATIONAL.IT (IT Services)

Features:

  • Hierarchical structure with parent-child relationships
  • Color coding for visualization
  • Keywords and rules for automatic classification
  • Active/inactive status for categories

5. document_ontology_mapping

Maps documents to ontology categories.

Key Features:

  • Confidence scores (0.00 to 1.00)
  • Primary category designation
  • Tracking of assignment method (AI, user, rule-based)

6. companies

Stores company information with hierarchical structure.

Features:

  • Parent-subsidiary relationships
  • Company groups for organization
  • Industry and country metadata

7. document_companies

Links documents to companies with roles.

Roles:

  • vendor
  • client
  • prime_contractor
  • subcontractor
  • witness

8. document_relationships

Tracks relationships between documents.

Relationship Types:

  • amendment
  • renewal
  • supersedes
  • references
  • related

9. processing_logs

Complete audit trail of all processing activities.

Tracks:

  • Every action (created, processing_started, completed, failed)
  • Duration and memory usage
  • Error messages and details
  • User/system that performed action

10. processing_statistics

Daily aggregated statistics for dashboards.

Metrics:

  • Total documents processed
  • Success/failure counts
  • Average processing time
  • Unique companies discovered
  • New relationships found

Processing Workflow

1. Pre-Processing Phase

# For each file:
file_hash = calculate_sha256(file_path)
existing = check_document_exists(file_hash)

if existing and not needs_reprocessing:
    skip_file()
else:
    add_to_processing_queue()

2. Document Creation/Update

if new_document:
    # Create new document record
    document = create_document(file_hash, metadata)
    version = 1
else:
    # File changed - create new version
    version = increment_version()
    update_document_hash(new_hash)

3. Processing Phase

# Update status
update_status('processing')

# Process content
result = analyze_document(content)

# Update with results
update_status('completed', results=result, duration=time)

# Auto-assign ontology
assign_to_ontology(document_id, detected_category)

4. Post-Processing

# Link to companies
link_to_company(document_id, company_id)

# Detect relationships
find_related_documents(content_analysis)

# Update statistics
update_daily_statistics()

Key SQL Queries

Check if Document Needs Processing

SELECT d.*, fh.file_hash
FROM documents d
JOIN file_hashes fh ON d.id = fh.document_id
WHERE fh.file_hash = :hash AND fh.is_current = true;

Get Document with Current Version

SELECT d.*, dv.version_number, dv.analysis_results
FROM documents d
JOIN document_versions dv ON d.id = dv.document_id
WHERE d.id = :doc_id
AND dv.version_number = d.processing_version;

Find Documents by Ontology

SELECT d.*, co.category_name
FROM documents d
JOIN document_ontology_mapping dom ON d.id = dom.document_id
JOIN contract_ontology co ON dom.ontology_id = co.id
WHERE co.category_code = :category_code
AND dom.is_primary = true;

Contract Expiration Report

SELECT *,
    CASE
        WHEN contract_end_date < CURRENT_DATE THEN 'expired'
        WHEN contract_end_date < CURRENT_DATE + INTERVAL '30 days' THEN 'expiring_soon'
        ELSE 'active'
    END as status
FROM documents
WHERE contract_end_date IS NOT NULL
ORDER BY contract_end_date;

Benefits of This Design

  1. Efficient Processing: Skip already-processed files using hash comparison
  2. Version Control: Track all changes to documents over time
  3. Audit Trail: Complete history of who did what and when
  4. Scalability: Optimized indexes for 2600+ documents
  5. Flexibility: JSON fields for storing varied analysis results
  6. Relationships: Track complex document and company relationships
  7. Analytics: Built-in statistics for performance monitoring

Usage in Application

The application uses this schema to:

  1. Prevent Duplicate Processing: Check hash before processing
  2. Track Changes: Automatically version documents when files change
  3. Organize Contracts: Use ontology for categorization
  4. Build Knowledge Graphs: Visualize relationships
  5. Monitor Performance: Track processing times and success rates
  6. Ensure Data Integrity: Maintain audit trails

Best Practices

  1. Always calculate file hash before processing
  2. Use transactions for multi-table updates
  3. Index frequently queried fields
  4. Regular cleanup of old processing logs
  5. Monitor processing statistics for performance issues
  6. Use confidence scores for AI-assigned categories
  7. Maintain referential integrity with foreign keys