This document describes the architecture of a comprehensive machine learning provenance tracking system with blockchain integration. The system provides immutable, tamper-evident audit trails for ML training processes by storing Merkle tree hashes on multiple blockchain networks.
┌─────────────────────────────────────────────────────────────┐
│ ML Training Pipeline │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────────┐ │
│ │ Data │ │ Model │ │ Training │ │
│ │ Provenance │ │ Provenance │ │ Provenance │ │
│ └──────────────┘ └──────────────┘ └──────────────────┘ │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ ProvenanceTracker │
│ ┌─────────────────┐ ┌─────────────────┐ ┌──────────────┐ │
│ │ Merkle Tree │ │ Blockchain │ │ Provenance │ │
│ │ Generation │ │ Integration │ │ Data │ │
│ └─────────────────┘ └─────────────────┘ └──────────────┘ │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ BlockchainManager │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────────┐ │
│ │ Ethereum │ │ Bitcoin │ │ IPFS │ │
│ │ Interface │ │ Interface │ │ Interface │ │
│ └──────────────┘ └──────────────┘ └──────────────────┘ │
└─────────────────────────────────────────────────────────────┘
The main orchestrator that coordinates all provenance tracking activities.
Key Responsibilities:
- Initialize and manage blockchain connections
- Coordinate data, model, and training tracking
- Store Merkle root hashes on blockchain networks
- Generate and verify blockchain proofs
- Generate comprehensive reports
Key Methods:
class ProvenanceTracker:
def __init__(self, base_dir="artifacts", config: Optional[Dict[str, Any]] = None)
def track_data(self, train_data, test_data)
def track_model(self, model)
def store_merkle_on_blockchain_before_training(self, training_config)
def store_merkle_on_blockchain_after_training(self, training_results)
def verify_blockchain_provenance(self)
def save_blockchain_report(self, output_path=None)
def get_blockchain_status(self)Manages multiple blockchain network interfaces and provides unified operations.
Key Responsibilities:
- Initialize blockchain interfaces (Ethereum, Bitcoin, IPFS)
- Store hashes on multiple networks
- Verify hashes across networks
- Handle network-specific errors and retries
Key Methods:
class BlockchainManager:
def __init__(self, config: Dict[str, Any])
def store_merkle_hash(self, merkle_root_hash, metadata, networks=None)
def verify_merkle_hash(self, merkle_root_hash, transaction_ids)
def get_transaction_info(self, transaction_ids)- Connects to Ethereum nodes via Web3
- Supports smart contract interactions
- Handles gas estimation and transaction signing
- Connects to Bitcoin nodes via RPC
- Uses OP_RETURN for data storage
- Supports testnet and mainnet
- Connects to IPFS daemon via HTTP API
- Stores content-addressed data
- Provides CID-based verification
Components:
MLProvenanceMerkleTree: Main Merkle tree implementationMerkleNode: Individual tree nodesHashFactory: Configurable hash algorithm support
Tree Structure:
Root Hash
├── Data Node
│ ├── Training Data Hash
│ └── Test Data Hash
├── Model Node
│ ├── Architecture Hash
│ └── Weights Hash
└── Training Node
├── Epoch 1 Node
│ ├── Model State Hash
│ ├── Metrics Hash
│ └── Privacy Metrics Hash
├── Epoch 2 Node
│ ├── Model State Hash
│ ├── Metrics Hash
│ └── Privacy Metrics Hash
└── ... (subsequent epochs)
Data + Model → Merkle Tree → Root Hash → Blockchain Storage
- Data Tracking: Generate hashes for training and test data
- Model Tracking: Generate hashes for model architecture
- Merkle Tree Construction: Build initial tree with data and model nodes
- Blockchain Storage: Store root hash on configured networks
- Transaction Recording: Store transaction IDs for verification
Training Process → Epoch Updates → Merkle Tree Updates
- Epoch Tracking: Track metrics, model state, and privacy budget
- Tree Updates: Add epoch nodes to Merkle tree
- Hash Generation: Generate new root hash after each epoch
- Local Storage: Store updated tree locally
Final Model + Results → Merkle Tree → Root Hash → Blockchain Storage
- Final State: Capture final model state and metrics
- Tree Completion: Complete Merkle tree with all epochs
- Blockchain Storage: Store final root hash on networks
- Verification: Verify both pre and post-training hashes
Stored Hashes → Blockchain Verification → Integrity Report
- Hash Retrieval: Retrieve stored hashes from blockchain
- Local Verification: Recompute hashes locally
- Cross-Network Verification: Verify across multiple networks
- Report Generation: Generate comprehensive verification report
- Type: Decentralized storage network
- Storage Method: Content-addressed storage
- Advantages: No fees, high availability, decentralized
- Use Case: Development, testing, backup storage
- Type: Smart contract platform
- Storage Method: Smart contract state
- Advantages: Immutable, programmable, global consensus
- Use Case: Production environments, regulatory compliance
- Type: Cryptocurrency blockchain
- Storage Method: OP_RETURN transactions
- Advantages: Maximum security, long-term stability
- Use Case: High-security requirements, long-term storage
For production Ethereum deployments, a smart contract can be used:
contract MLProvenance {
mapping(bytes32 => bool) public storedHashes;
mapping(bytes32 => uint256) public timestamps;
mapping(bytes32 => string) public metadata;
event HashStored(bytes32 indexed merkleRoot, string metadata, uint256 timestamp);
function storeHash(bytes32 merkleRoot, string memory metadataStr) public {
storedHashes[merkleRoot] = true;
timestamps[merkleRoot] = block.timestamp;
metadata[merkleRoot] = metadataStr;
emit HashStored(merkleRoot, metadataStr, block.timestamp);
}
function verifyHash(bytes32 merkleRoot) public view returns (bool) {
return storedHashes[merkleRoot];
}
}{
"blockchain": {
"enabled": true,
"networks": ["ipfs", "ethereum"],
"ipfs": {
"enabled": true,
"url": "http://localhost:5001",
"timeout": 30,
"retry_attempts": 3
},
"ethereum": {
"enabled": true,
"rpc_url": "http://127.0.0.1:8545",
"private_key": "your_private_key",
"contract_address": null,
"gas_limit": 300000,
"gas_price": "auto"
},
"storage_options": {
"store_before_training": true,
"store_after_training": true,
"store_epoch_checkpoints": false
}
}
}config = {
"epochs": 5,
"batch_size": 64,
"learning_rate": 0.001,
"hash_algorithm": "blake3",
"blockchain": {
"networks": ["ipfs", "ethereum"],
"ipfs": {"url": "http://localhost:5001"},
"ethereum": {
"rpc_url": "http://127.0.0.1:8545",
"private_key": "your_private_key"
}
}
}- Store private keys in environment variables
- Use different keys for development and production
- Implement proper access controls
- Use HTTPS for RPC endpoints
- Validate blockchain responses
- Implement retry mechanisms with exponential backoff
- Only store hashes, not raw data
- Consider metadata sensitivity
- Implement access controls for blockchain data
- IPFS: Fastest, no fees, good for development
- Ethereum: Medium speed, gas fees, production-ready
- Bitcoin: Slowest, low fees, maximum security
- Cache verification results
- Batch operations when possible
- Use appropriate gas limits for Ethereum
- Implement connection pooling
- Automatic retry with exponential backoff
- Fallback to local storage if blockchain unavailable
- Graceful degradation of functionality
- Hash verification before and after storage
- Cross-network verification
- Comprehensive error reporting
- Blockchain status monitoring
- Transaction success tracking
- Performance metrics collection
src/ml_provenance/
├── provenance/
│ ├── blockchain.py # Blockchain integration
│ ├── tracker.py # Main provenance tracker
│ ├── merkle_tree.py # Merkle tree implementation
│ ├── verifier.py # Verification system
│ └── hash_config.py # Hash algorithm configuration
├── training/
│ └── train.py # Training with blockchain integration
└── utils/
└── ... # Utility functions
configs/
├── blockchain_config.json # Blockchain configuration
└── training_config_*.json # Training configurations
scripts/
├── setup_local_geth.sh # Local Ethereum setup
├── demo_blockchain_provenance.py # Blockchain demo
└── ... # Other utility scripts
- Multi-signature support: Require multiple signatures for critical operations
- Time-locked contracts: Automatic verification at specific intervals
- Cross-chain verification: Verify hashes across different blockchain networks
- Zero-knowledge proofs: Privacy-preserving verification
- Automated compliance: Regulatory compliance reporting
- CI/CD pipelines: Automated blockchain verification in deployment
- Model registries: Integration with ML model registries
- Audit systems: Integration with external audit systems
- Legal frameworks: Compliance with data governance regulations
This architecture provides a robust, scalable foundation for blockchain-enabled ML provenance tracking with support for multiple networks and comprehensive verification capabilities.