Skip to content

phanhongan/ml-systems-evaluation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

🏭 ML Systems Evaluation Framework

An evaluation framework for Industrial AI systems, applying Site Reliability Engineering (SRE) principles to machine learning evaluation.

🎯 The Problem

Industrial AI systems face unique challenges that traditional ML evaluation approaches don't address:

πŸ›‘οΈ Safety-Critical Requirements

  • 🚨 Zero Tolerance Failures: Aircraft landing systems, medical diagnostics, autonomous vehicles where errors can be catastrophic
  • πŸ“‹ Regulatory Standards: Aviation (DO-178C), healthcare, financial services require continuous compliance validation
  • 🌊 Environmental Constraints: Underwater devices, extreme temperatures, harsh conditions that affect system reliability

πŸ’° Business-Critical Operations

  • πŸ’Έ Immediate Financial Impact: Manufacturing quality control, fraud detection where failures cost millions instantly
  • ⚑ Real-time Decision Making: Supply chain optimization, trading systems where delays cause cascading failures
  • πŸ‘₯ Public Safety: Systems where failures affect public safety, requiring continuous monitoring and rapid response

❌ Traditional ML Evaluation Gaps

  • 🚫 No Safety Validation: Standard ML evaluation doesn't assess catastrophic failure scenarios
  • πŸ“œ Missing Regulatory Compliance: No built-in validation against industry-specific standards
  • 🌑️ Inadequate Environmental Monitoring: Doesn't account for harsh operating conditions
  • πŸ“Š No Business Impact Metrics: Technical metrics don't connect to business outcomes

🎯 Why This Framework Matters

πŸ”§ SRE Principles for Industrial AI

This framework treats Industrial AI systems as critical infrastructure, applying SRE concepts:

  • πŸ›‘οΈ Safety-First Error Budgets: Acceptable failure rates with zero tolerance for catastrophic failures
  • πŸ“‹ Regulatory SLOs: Service Level Objectives that include compliance requirements
  • 🌊 Environmental Observability: Monitoring that accounts for harsh operating conditions
  • πŸ’° Business-Critical Reliability: Focus on preventing immediate financial and safety impacts
  • 🚨 Rapid Incident Response: Structured approach to handling safety-critical and business-critical failures

πŸ”„ Continuous Evaluation Lifecycle

Evaluation isn't just a final checkpointβ€”it's a continuous feedback mechanism that informs every stage of the ML lifecycle:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Development   β”‚    β”‚   Deployment    β”‚    β”‚   Production    β”‚
β”‚                 β”‚    β”‚                 β”‚    β”‚                 β”‚
β”‚ β€’ Model Design  │───▢│ β€’ A/B Testing   │───▢│ β€’ Real-time     β”‚
β”‚ β€’ Data Pipeline β”‚    β”‚ β€’ Canary Deploy β”‚    β”‚ β€’ Monitoring    β”‚
β”‚ β€’ Architecture  β”‚    β”‚ β€’ Rollback Plan β”‚    β”‚ β€’ Alerting      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β–²                       β–²                       β”‚
         β”‚                       β”‚                       β–Ό
         β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                 β”‚
                    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                    β”‚   Evaluation    β”‚
                    β”‚                 β”‚
                    β”‚ β€’ Performance   β”‚
                    β”‚ β€’ Reliability   β”‚
                    β”‚ β€’ Safety        β”‚
                    β”‚ β€’ Compliance    β”‚
                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸ” How Evaluation Informs the ML Lifecycle

πŸ”¬ Development Phase Insights

  • πŸ—οΈ Model Architecture: Performance bottlenecks reveal design flaws
  • πŸ“Š Data Quality: Drift detection identifies training data issues
  • πŸ”§ Feature Engineering: Model degradation points to feature relevance changes
  • βš™οΈ Hyperparameter Tuning: Real-world performance guides optimization

πŸš€ Deployment Phase Validation

  • πŸ§ͺ A/B Testing: Structured comparison of model versions
  • 🐦 Canary Deployments: Gradual rollout with continuous evaluation
  • πŸ”„ Rollback Triggers: Automatic reversion based on SLO violations
  • πŸ“ˆ Infrastructure Scaling: Performance metrics guide resource allocation

🏭 Production Phase Monitoring

  • 🚨 Real-time Alerts: Immediate notification of SLO violations
  • πŸ“ˆ Trend Analysis: Long-term performance degradation detection
  • 🚨 Incident Response: Structured approach to ML system failures
  • πŸ“Š Capacity Planning: Resource needs based on usage patterns

πŸ”„ Feedback Loop Benefits

  • πŸ”„ Model Retraining: Triggered by drift detection and performance degradation
  • πŸ“Š Data Pipeline Updates: Informed by data quality issues
  • πŸ—οΈ Architecture Evolution: Driven by scalability and reliability needs
  • πŸ“ˆ Process Improvement: Continuous refinement of ML operations

πŸ’Ό Business Impact

πŸ›‘οΈ Risk Mitigation

  • 🚨 Prevent Catastrophic Failures: Early detection of safety-critical system issues
  • πŸ“‹ Regulatory Compliance: Continuous validation for regulated industries
  • πŸ›‘οΈ Brand Protection: Avoid public incidents that damage reputation
  • πŸ’° Financial Loss Prevention: Catch issues before they impact revenue

πŸ† Operational Excellence

  • πŸ”§ Proactive Maintenance: Fix issues before they become incidents
  • ⚑ Resource Optimization: Right-size infrastructure based on actual usage
  • πŸ‘₯ Team Efficiency: Automated monitoring reduces manual oversight
  • πŸ“Š Data-Driven Decisions: Metrics guide strategic ML investments

πŸš€ Competitive Advantage

  • ⚑ Faster Iteration: Rapid feedback enables quick model improvements
  • πŸ† Higher Quality: Continuous evaluation maintains performance standards
  • 🀝 Customer Trust: Reliable ML systems build user confidence
  • πŸ’‘ Innovation Velocity: Safe experimentation with new ML approaches

✨ Key Features

  • πŸ›‘οΈ Safety-Critical Evaluation: Zero tolerance for catastrophic failures with specialized safety metrics
  • πŸ“‹ Regulatory Compliance: Built-in validation against industry standards (DO-178C for aviation)
  • 🌊 Environmental Monitoring: Specialized collectors for harsh operating conditions
  • πŸ’° Business-Critical Reliability: SRE principles applied to systems with immediate financial impact
  • πŸ€– LLM-Powered Intelligence: Pattern recognition, natural language configuration, and report enhancement
  • πŸ€– Autonomous Agents: Future-ready architecture for proactive monitoring, alerting, and scheduling
  • πŸ”Œ Extensible Architecture: Plugin-based collectors and evaluators for domain-specific requirements
  • ⚑ Real-time & Batch: Online and offline evaluation for continuous monitoring
  • πŸ“‹ Standards Enforcement: Configurable quality gates with regulatory compliance checks

🎯 Supported Industries

The framework supports multiple industrial sectors with ready-to-use configurations and industry-specific requirements. Each industry has its own directory with detailed examples and documentation:

🐟 Aquaculture

  • Species Classification: Sonar-based fish species identification and environmental hazard detection
  • Key Features: Environmental monitoring, regulatory compliance, resource optimization
  • Examples: examples/industries/aquaculture/

✈️ Aviation

  • Safety-Critical Systems: Aircraft landing and flight control assistance
  • Key Features: DO-178C compliance, sub-500ms response times, environmental adaptation
  • Examples: examples/industries/aviation/

πŸ”’ Cybersecurity

  • Agentic Security Operations: Multi-agent AI workflows for alert triage, investigation, and response
  • Key Features: Cost-optimized LLM integration, RAG-powered threat intelligence, explainable AI decisions, multi-TB data processing
  • Examples: examples/industries/cybersecurity/

⚑ Energy

  • Energy Optimization Recommendations: ML-driven recommendations for facility energy consumption and cost reduction
  • Key Features: Real-time energy monitoring, cost reduction tracking, multi-facility support
  • Examples: examples/industries/energy/

🏭 Manufacturing

  • Predictive Maintenance: Equipment failure prediction with VAE anomaly detection
  • Demand Forecasting: Supply chain optimization and production planning
  • Key Features: ISO compliance, cost optimization, real-time monitoring
  • Examples: examples/industries/manufacturing/

🚒 Maritime

  • Collision Avoidance: Vessel collision detection and navigation safety
  • Key Features: COLREGs compliance, real-time alerts, multi-vessel tracking
  • Examples: examples/industries/maritime/

πŸ”¬ Semiconductor

  • Digital Twins: Manufacturing process monitoring and yield prediction
  • Key Features: Real-time process control, quality metrics, equipment monitoring
  • Examples: examples/industries/semiconductor/

πŸ“‹ Additional Examples

See the examples/industries/ directory for complete configuration files covering all scenarios. Each industry directory contains detailed README files with specific use cases, requirements, and implementation guidance.

For an overview of all examples, templates, and tutorials, see the examples/.

πŸ—οΈ Architecture

The framework follows a hybrid architecture that combines deterministic components with LLM-powered intelligence:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Data Sources  β”‚    β”‚   Collectors    β”‚    β”‚   Evaluators    β”‚
β”‚                 β”‚    β”‚                 β”‚    β”‚                 β”‚
β”‚ β€’ Logs          │───▢│ β€’ Online        │───▢│ β€’ Reliability   β”‚
β”‚ β€’ Metrics       β”‚    β”‚ β€’ Offline       β”‚    β”‚ β€’ Performance   β”‚
β”‚ β€’ Telemetry     β”‚    β”‚ β€’ Custom        β”‚    β”‚ β€’ Safety        β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                                        β”‚
                                                        β–Ό
                                              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                                              β”‚    Reports      β”‚
                                              β”‚                 β”‚
                                              β”‚ β€’ SLI/SLO       β”‚
                                              β”‚ β€’ Incidents     β”‚
                                              β”‚ β€’ Trends        β”‚
                                              β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                                        β”‚
                                                        β–Ό
                                              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                                              β”‚   LLM Layer     β”‚
                                              β”‚                 β”‚
                                              β”‚ β€’ Analysis      β”‚
                                              β”‚ β€’ Assistant     β”‚
                                              β”‚ β€’ Enhancement   β”‚
                                              β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                                        β”‚
                                                        β–Ό
                                              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                                              β”‚     Agents      β”‚
                                              β”‚                 β”‚
                                              β”‚ β€’ RL Agent      β”‚
                                              β”‚ β€’ Monitoring*   β”‚
                                              β”‚ β€’ Alerting*     β”‚
                                              β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

πŸ“‹ Architecture Details: For detailed technical architecture information, component interactions, and implementation specifics, see Architecture Overview.

πŸš€ Quick Start

πŸ“¦ Installation

# Clone the repository
git clone <repository-url>
cd ml-systems-evaluation

# Install in development mode
uv sync --group dev

# (Optional) Activate the virtual environment managed by UV
uv shell

# For production installs (main dependencies only)
# uv sync --group main

Note: This project uses UV for dependency management and packaging. See pyproject.toml for the full, up-to-date list of dependencies. For detailed installation instructions, see docs/user-guides/installation.md.

🎯 Getting Started (For Industrial ML Engineers)

# 1. Create a new configuration file
ml-eval create-config --output my-system.yaml --system-name "My ML System" --industry manufacturing

# 2. Validate your configuration
ml-eval validate my-system.yaml

# 3. Run health check on your system
ml-eval health my-system.yaml

# 4. Collect data from your system
ml-eval collect my-system.yaml --output collected-data.json

# 5. Evaluate your system metrics
ml-eval evaluate my-system.yaml --data collected-data.json --output results.json

# 6. Generate reports
ml-eval report my-system.yaml --results results.json --output reports.json

# 7. Run complete evaluation pipeline
ml-eval run my-system.yaml --output complete-results.json

⚑ Quick Commands

# Show all available commands
ml-eval --help

# Create new configurations for different industries
ml-eval create-config --output aviation-system.yaml --system-name "Aircraft Landing System" --industry aviation --criticality safety_critical
ml-eval create-config --output security-system.yaml --system-name "Security Operations" --industry cybersecurity --criticality business_critical

# Validate configurations (use existing example files)
ml-eval validate examples/industries/aviation/aircraft-landing.yaml
ml-eval validate examples/industries/maritime/collision-avoidance.yaml
ml-eval validate examples/industries/manufacturing/predictive-maintenance.yaml
ml-eval validate examples/industries/semiconductor/etching-digital-twins.yaml
ml-eval validate examples/industries/cybersecurity/security-operations.yaml

# Run health checks
ml-eval health examples/industries/aviation/aircraft-landing.yaml

# List configured components
ml-eval list-collectors examples/industries/manufacturing/predictive-maintenance.yaml
ml-eval list-evaluators examples/industries/cybersecurity/security-operations.yaml
ml-eval list-reports examples/industries/aviation/aircraft-landing.yaml

# Run evaluations
ml-eval run examples/industries/aviation/aircraft-landing.yaml --output aviation-results.json
ml-eval run examples/industries/cybersecurity/security-operations.yaml --output security-results.json

πŸ”§ Core Components

πŸ“Š Collectors

  • ⚑ OnlineCollector: Real-time metrics from running systems
  • πŸ“ OfflineCollector: Historical data from logs and databases
  • 🌊 EnvironmentalCollector: Specialized monitoring for harsh conditions (temperature, pressure, etc.)
  • πŸ“‹ RegulatoryCollector: Compliance metrics for industry standards
  • πŸ”Œ CustomCollector: Extensible interface for domain-specific metrics

πŸ” Evaluators

  • πŸ›‘οΈ ReliabilityEvaluator: SLI/SLO compliance and error budgets with safety thresholds
  • 🚨 SafetyEvaluator: Critical system safety validation with zero-tolerance checks
  • πŸ“‹ RegulatoryEvaluator: Compliance validation against industry standards
  • 🌊 EnvironmentalEvaluator: Performance assessment under harsh conditions
  • πŸ“ˆ DriftEvaluator: Data and model drift detection with business impact assessment

πŸ€– LLM Integration Layer

  • πŸ€– LLMAnalysisEngine: Pattern recognition and drift detection
  • πŸ€– LLMAssistantEngine: Natural language configuration and troubleshooting assistance
  • πŸ€– LLMEnhancementEngine: Report enhancement and business impact translation

πŸ€– Autonomous Agents

  • πŸ€– RLAgent: Adaptive decision-making with LLM integration and safety constraints
  • πŸ€– MonitoringAgent 🚧: Autonomous real-time monitoring and health checks (planned)
  • πŸ€– AlertingAgent 🚧: Alert prioritization and routing (planned)

πŸ“‹ Agent Details: For comprehensive agent implementation details, RL loop architecture, and usage examples, see Architecture Overview.

πŸ“Š Reports

  • πŸ›‘οΈ ReliabilityReport: Error budgets, SLO compliance, incident analysis
  • 🚨 SafetyReport: Safety-critical metrics and compliance status
  • πŸ“‹ RegulatoryReport: Compliance validation and audit trails
  • πŸ’° BusinessImpactReport: Technical metrics connected to business outcomes

πŸ”§ SRE Integration

πŸ“‹ Service Level Objectives (SLOs)

For SLO configuration guidance, see the SLO Configuration Guide. The framework supports:

  • πŸ›‘οΈ Safety-Critical SLOs: Zero-tolerance thresholds for catastrophic failures
  • πŸ’° Business-Critical SLOs: Performance targets with immediate financial impact
  • 🌊 Environmental SLOs: Adaptation to harsh operating conditions
  • πŸ“‹ Regulatory SLOs: Compliance with industry standards (DO-178C, COLREGs, etc.)

🚨 Error Budget Policies

  • πŸ›‘οΈ Safety-First Alerts: Immediate notification for safety-critical budget violations
  • πŸ“‹ Regulatory Compliance: Automatic audit trail for compliance violations
  • πŸ’° Business Impact Assessment: Connect budget exhaustion to financial impact
  • 🌊 Environmental Adaptation: Adjust thresholds based on operating conditions

πŸ”§ Additional Features

πŸ›‘οΈ Safety-Critical Development

The framework enables Industrial AI development with built-in safety and compliance:

  • Safety-First SLOs: Zero-tolerance thresholds for catastrophic failures
  • Real-time Validation: Continuous safety validation during development
  • Regulatory Compliance: Built-in validation against industry standards (DO-178C, etc.)
  • Emergency Protocols: Automatic system shutdown for safety violations

πŸ€– LLM-Powered Intelligence

Enhanced analysis and decision support capabilities:

  • Pattern Recognition: Drift detection and anomaly identification
  • Natural Language Configuration: Generate configurations from plain English requirements
  • Report Enhancement: Add business context and insights to technical reports
  • Smart Troubleshooting: AI-powered problem diagnosis and solution recommendations

πŸ€– Autonomous Agents

Currently Available:

  • πŸ€– RL Agent: Adaptive decision-making, resource allocation, and threshold optimization with LLM integration

Planned Capabilities:

  • Proactive Monitoring: Autonomous system health monitoring and issue detection
  • Alert Management: Smart alert prioritization and context-aware notifications
  • Dynamic Scheduling: Autonomous task scheduling and resource optimization

πŸ”Œ Extensibility

  • Custom Collectors: Domain-specific data collection interfaces
  • Custom Evaluators: Specialized evaluation logic for industry requirements
  • Custom Reports: Tailored reporting formats and outputs
  • LLM Integration: Support for multiple LLM providers and custom models

πŸ“‹ Code Examples: For detailed code examples, usage patterns, and implementation guides, see Architecture Overview and Getting Started Guide.

πŸ› οΈ Development

πŸ“š Documentation

The project includes comprehensive documentation in a simplified format:

Quick Start with Documentation:

# Build Sphinx documentation
make docs-sphinx

# Serve documentation locally
make docs-sphinx-serve

# View built documentation
open docs_sphinx/build/html/index.html

πŸ“ Project Structure

See PROJECT_STRUCTURE.md for the most up-to-date and detailed project structure.

πŸ”§ Modular Architecture

The framework is designed with a modular architecture for easy maintenance and extension:

  • core/: Central framework components with type safety and validation
  • collectors/: Modular data collection with industrial focus
  • evaluators/: Specialized evaluation engines for different aspects
  • reports/: Reporting for different stakeholders
  • llm/: LLM integration layer with analysis, assistant, and enhancement engines
  • agents/: Autonomous agents (RL Agent implemented, monitoring/alerting planned)
  • cli/: User-friendly command-line interface for system engineers
  • config/: Configuration management for complex systems

πŸ“‹ Technical Details: For component interfaces, data flow diagrams, and extension points, see Architecture Overview.

πŸ‘¨β€πŸ’» Developer-Friendly Features

The refactored framework provides several developer-friendly features:

🏭 Industry-Specific Templates

  • Ready-to-use configurations for 6 industrial sectors
  • Multiple template types per industry
  • Industry-specific SLOs with appropriate safety and compliance standards

πŸ–₯️ Industrial-Focused CLI

  • Clear, industry-specific help messages
  • Step-by-step guidance tailored for ML engineers in industrial sectors
  • Detailed examples with explanations for each industry use case
  • Error messages with actionable suggestions

πŸ”§ Modular Design

  • Easy to add new commands or templates
  • Clear separation of concerns
  • Maintainable codebase with modular CLI architecture
  • Extensible architecture for custom requirements

🏭 Industrial Focus

  • Safety-critical and business-critical system support
  • Regulatory compliance templates (DO-178C for aviation safety systems)
  • Environmental monitoring for harsh conditions
  • Business impact assessment and reporting

πŸ§ͺ Running Tests

# Using Makefile (recommended)
make test
make test-verbose
make test-coverage

# Or manually
pytest tests/ -v
pytest tests/safety/ -v  # Safety-critical tests
pytest tests/industry/ -v  # Industry-specific tests

Note: For detailed testing instructions, see docs/developer/testing.md

🀝 Contributing

We welcome contributions! Please see our Development Guide for information about:

  • πŸ”§ Code Quality Tools: Ruff
  • πŸ§ͺ Testing Practices: Unit, integration, and end-to-end tests
  • πŸ”„ Development Workflow: Setup, coding standards, and CI/CD
  • πŸ“ Code Style Guidelines: Python style, naming conventions, documentation

⚑ Quick Development Setup

# Using Makefile (recommended)
make install-dev
make check
make test
make build

# Or manually
uv sync --extra dev
uv run ruff check .
uv run ruff format .
uv run pytest
uv build

πŸ“ Code Quality Standards

The project enforces strict code quality standards:

  • 🦊 Ruff: Fast linting, formatting, type checking, and import sorting with Black-compatible settings

All code must pass these checks before merging.

πŸ“„ License

MIT License - see LICENSE file for details.

About

An evaluation framework for Industrial AI systems, applying SRE principles to ML evaluation

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages