An evaluation framework for Industrial AI systems, applying Site Reliability Engineering (SRE) principles to machine learning evaluation.
Industrial AI systems face unique challenges that traditional ML evaluation approaches don't address:
- π¨ Zero Tolerance Failures: Aircraft landing systems, medical diagnostics, autonomous vehicles where errors can be catastrophic
- π Regulatory Standards: Aviation (DO-178C), healthcare, financial services require continuous compliance validation
- π Environmental Constraints: Underwater devices, extreme temperatures, harsh conditions that affect system reliability
- πΈ Immediate Financial Impact: Manufacturing quality control, fraud detection where failures cost millions instantly
- β‘ Real-time Decision Making: Supply chain optimization, trading systems where delays cause cascading failures
- π₯ Public Safety: Systems where failures affect public safety, requiring continuous monitoring and rapid response
- π« No Safety Validation: Standard ML evaluation doesn't assess catastrophic failure scenarios
- π Missing Regulatory Compliance: No built-in validation against industry-specific standards
- π‘οΈ Inadequate Environmental Monitoring: Doesn't account for harsh operating conditions
- π No Business Impact Metrics: Technical metrics don't connect to business outcomes
This framework treats Industrial AI systems as critical infrastructure, applying SRE concepts:
- π‘οΈ Safety-First Error Budgets: Acceptable failure rates with zero tolerance for catastrophic failures
- π Regulatory SLOs: Service Level Objectives that include compliance requirements
- π Environmental Observability: Monitoring that accounts for harsh operating conditions
- π° Business-Critical Reliability: Focus on preventing immediate financial and safety impacts
- π¨ Rapid Incident Response: Structured approach to handling safety-critical and business-critical failures
Evaluation isn't just a final checkpointβit's a continuous feedback mechanism that informs every stage of the ML lifecycle:
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
β Development β β Deployment β β Production β
β β β β β β
β β’ Model Design βββββΆβ β’ A/B Testing βββββΆβ β’ Real-time β
β β’ Data Pipeline β β β’ Canary Deploy β β β’ Monitoring β
β β’ Architecture β β β’ Rollback Plan β β β’ Alerting β
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
β² β² β
β β βΌ
βββββββββββββββββββββββββΌββββββββββββββββββββββββ
β
βββββββββββββββββββ
β Evaluation β
β β
β β’ Performance β
β β’ Reliability β
β β’ Safety β
β β’ Compliance β
βββββββββββββββββββ
- ποΈ Model Architecture: Performance bottlenecks reveal design flaws
- π Data Quality: Drift detection identifies training data issues
- π§ Feature Engineering: Model degradation points to feature relevance changes
- βοΈ Hyperparameter Tuning: Real-world performance guides optimization
- π§ͺ A/B Testing: Structured comparison of model versions
- π¦ Canary Deployments: Gradual rollout with continuous evaluation
- π Rollback Triggers: Automatic reversion based on SLO violations
- π Infrastructure Scaling: Performance metrics guide resource allocation
- π¨ Real-time Alerts: Immediate notification of SLO violations
- π Trend Analysis: Long-term performance degradation detection
- π¨ Incident Response: Structured approach to ML system failures
- π Capacity Planning: Resource needs based on usage patterns
- π Model Retraining: Triggered by drift detection and performance degradation
- π Data Pipeline Updates: Informed by data quality issues
- ποΈ Architecture Evolution: Driven by scalability and reliability needs
- π Process Improvement: Continuous refinement of ML operations
- π¨ Prevent Catastrophic Failures: Early detection of safety-critical system issues
- π Regulatory Compliance: Continuous validation for regulated industries
- π‘οΈ Brand Protection: Avoid public incidents that damage reputation
- π° Financial Loss Prevention: Catch issues before they impact revenue
- π§ Proactive Maintenance: Fix issues before they become incidents
- β‘ Resource Optimization: Right-size infrastructure based on actual usage
- π₯ Team Efficiency: Automated monitoring reduces manual oversight
- π Data-Driven Decisions: Metrics guide strategic ML investments
- β‘ Faster Iteration: Rapid feedback enables quick model improvements
- π Higher Quality: Continuous evaluation maintains performance standards
- π€ Customer Trust: Reliable ML systems build user confidence
- π‘ Innovation Velocity: Safe experimentation with new ML approaches
- π‘οΈ Safety-Critical Evaluation: Zero tolerance for catastrophic failures with specialized safety metrics
- π Regulatory Compliance: Built-in validation against industry standards (DO-178C for aviation)
- π Environmental Monitoring: Specialized collectors for harsh operating conditions
- π° Business-Critical Reliability: SRE principles applied to systems with immediate financial impact
- π€ LLM-Powered Intelligence: Pattern recognition, natural language configuration, and report enhancement
- π€ Autonomous Agents: Future-ready architecture for proactive monitoring, alerting, and scheduling
- π Extensible Architecture: Plugin-based collectors and evaluators for domain-specific requirements
- β‘ Real-time & Batch: Online and offline evaluation for continuous monitoring
- π Standards Enforcement: Configurable quality gates with regulatory compliance checks
The framework supports multiple industrial sectors with ready-to-use configurations and industry-specific requirements. Each industry has its own directory with detailed examples and documentation:
- Species Classification: Sonar-based fish species identification and environmental hazard detection
- Key Features: Environmental monitoring, regulatory compliance, resource optimization
- Examples:
examples/industries/aquaculture/
- Safety-Critical Systems: Aircraft landing and flight control assistance
- Key Features: DO-178C compliance, sub-500ms response times, environmental adaptation
- Examples:
examples/industries/aviation/
- Agentic Security Operations: Multi-agent AI workflows for alert triage, investigation, and response
- Key Features: Cost-optimized LLM integration, RAG-powered threat intelligence, explainable AI decisions, multi-TB data processing
- Examples:
examples/industries/cybersecurity/
- Energy Optimization Recommendations: ML-driven recommendations for facility energy consumption and cost reduction
- Key Features: Real-time energy monitoring, cost reduction tracking, multi-facility support
- Examples:
examples/industries/energy/
- Predictive Maintenance: Equipment failure prediction with VAE anomaly detection
- Demand Forecasting: Supply chain optimization and production planning
- Key Features: ISO compliance, cost optimization, real-time monitoring
- Examples:
examples/industries/manufacturing/
- Collision Avoidance: Vessel collision detection and navigation safety
- Key Features: COLREGs compliance, real-time alerts, multi-vessel tracking
- Examples:
examples/industries/maritime/
- Digital Twins: Manufacturing process monitoring and yield prediction
- Key Features: Real-time process control, quality metrics, equipment monitoring
- Examples:
examples/industries/semiconductor/
See the examples/industries/ directory for complete configuration files covering all scenarios. Each industry directory contains detailed README files with specific use cases, requirements, and implementation guidance.
For an overview of all examples, templates, and tutorials, see the examples/.
The framework follows a hybrid architecture that combines deterministic components with LLM-powered intelligence:
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
β Data Sources β β Collectors β β Evaluators β
β β β β β β
β β’ Logs βββββΆβ β’ Online βββββΆβ β’ Reliability β
β β’ Metrics β β β’ Offline β β β’ Performance β
β β’ Telemetry β β β’ Custom β β β’ Safety β
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
β
βΌ
βββββββββββββββββββ
β Reports β
β β
β β’ SLI/SLO β
β β’ Incidents β
β β’ Trends β
βββββββββββββββββββ
β
βΌ
βββββββββββββββββββ
β LLM Layer β
β β
β β’ Analysis β
β β’ Assistant β
β β’ Enhancement β
βββββββββββββββββββ
β
βΌ
βββββββββββββββββββ
β Agents β
β β
β β’ RL Agent β
β β’ Monitoring* β
β β’ Alerting* β
βββββββββββββββββββ
π Architecture Details: For detailed technical architecture information, component interactions, and implementation specifics, see Architecture Overview.
# Clone the repository
git clone <repository-url>
cd ml-systems-evaluation
# Install in development mode
uv sync --group dev
# (Optional) Activate the virtual environment managed by UV
uv shell
# For production installs (main dependencies only)
# uv sync --group main
Note: This project uses UV for dependency management and packaging. See pyproject.toml for the full, up-to-date list of dependencies. For detailed installation instructions, see docs/user-guides/installation.md.
# 1. Create a new configuration file
ml-eval create-config --output my-system.yaml --system-name "My ML System" --industry manufacturing
# 2. Validate your configuration
ml-eval validate my-system.yaml
# 3. Run health check on your system
ml-eval health my-system.yaml
# 4. Collect data from your system
ml-eval collect my-system.yaml --output collected-data.json
# 5. Evaluate your system metrics
ml-eval evaluate my-system.yaml --data collected-data.json --output results.json
# 6. Generate reports
ml-eval report my-system.yaml --results results.json --output reports.json
# 7. Run complete evaluation pipeline
ml-eval run my-system.yaml --output complete-results.json# Show all available commands
ml-eval --help
# Create new configurations for different industries
ml-eval create-config --output aviation-system.yaml --system-name "Aircraft Landing System" --industry aviation --criticality safety_critical
ml-eval create-config --output security-system.yaml --system-name "Security Operations" --industry cybersecurity --criticality business_critical
# Validate configurations (use existing example files)
ml-eval validate examples/industries/aviation/aircraft-landing.yaml
ml-eval validate examples/industries/maritime/collision-avoidance.yaml
ml-eval validate examples/industries/manufacturing/predictive-maintenance.yaml
ml-eval validate examples/industries/semiconductor/etching-digital-twins.yaml
ml-eval validate examples/industries/cybersecurity/security-operations.yaml
# Run health checks
ml-eval health examples/industries/aviation/aircraft-landing.yaml
# List configured components
ml-eval list-collectors examples/industries/manufacturing/predictive-maintenance.yaml
ml-eval list-evaluators examples/industries/cybersecurity/security-operations.yaml
ml-eval list-reports examples/industries/aviation/aircraft-landing.yaml
# Run evaluations
ml-eval run examples/industries/aviation/aircraft-landing.yaml --output aviation-results.json
ml-eval run examples/industries/cybersecurity/security-operations.yaml --output security-results.json- β‘ OnlineCollector: Real-time metrics from running systems
- π OfflineCollector: Historical data from logs and databases
- π EnvironmentalCollector: Specialized monitoring for harsh conditions (temperature, pressure, etc.)
- π RegulatoryCollector: Compliance metrics for industry standards
- π CustomCollector: Extensible interface for domain-specific metrics
- π‘οΈ ReliabilityEvaluator: SLI/SLO compliance and error budgets with safety thresholds
- π¨ SafetyEvaluator: Critical system safety validation with zero-tolerance checks
- π RegulatoryEvaluator: Compliance validation against industry standards
- π EnvironmentalEvaluator: Performance assessment under harsh conditions
- π DriftEvaluator: Data and model drift detection with business impact assessment
- π€ LLMAnalysisEngine: Pattern recognition and drift detection
- π€ LLMAssistantEngine: Natural language configuration and troubleshooting assistance
- π€ LLMEnhancementEngine: Report enhancement and business impact translation
- π€ RLAgent: Adaptive decision-making with LLM integration and safety constraints
- π€ MonitoringAgent π§: Autonomous real-time monitoring and health checks (planned)
- π€ AlertingAgent π§: Alert prioritization and routing (planned)
π Agent Details: For comprehensive agent implementation details, RL loop architecture, and usage examples, see Architecture Overview.
- π‘οΈ ReliabilityReport: Error budgets, SLO compliance, incident analysis
- π¨ SafetyReport: Safety-critical metrics and compliance status
- π RegulatoryReport: Compliance validation and audit trails
- π° BusinessImpactReport: Technical metrics connected to business outcomes
For SLO configuration guidance, see the SLO Configuration Guide. The framework supports:
- π‘οΈ Safety-Critical SLOs: Zero-tolerance thresholds for catastrophic failures
- π° Business-Critical SLOs: Performance targets with immediate financial impact
- π Environmental SLOs: Adaptation to harsh operating conditions
- π Regulatory SLOs: Compliance with industry standards (DO-178C, COLREGs, etc.)
- π‘οΈ Safety-First Alerts: Immediate notification for safety-critical budget violations
- π Regulatory Compliance: Automatic audit trail for compliance violations
- π° Business Impact Assessment: Connect budget exhaustion to financial impact
- π Environmental Adaptation: Adjust thresholds based on operating conditions
The framework enables Industrial AI development with built-in safety and compliance:
- Safety-First SLOs: Zero-tolerance thresholds for catastrophic failures
- Real-time Validation: Continuous safety validation during development
- Regulatory Compliance: Built-in validation against industry standards (DO-178C, etc.)
- Emergency Protocols: Automatic system shutdown for safety violations
Enhanced analysis and decision support capabilities:
- Pattern Recognition: Drift detection and anomaly identification
- Natural Language Configuration: Generate configurations from plain English requirements
- Report Enhancement: Add business context and insights to technical reports
- Smart Troubleshooting: AI-powered problem diagnosis and solution recommendations
Currently Available:
- π€ RL Agent: Adaptive decision-making, resource allocation, and threshold optimization with LLM integration
Planned Capabilities:
- Proactive Monitoring: Autonomous system health monitoring and issue detection
- Alert Management: Smart alert prioritization and context-aware notifications
- Dynamic Scheduling: Autonomous task scheduling and resource optimization
- Custom Collectors: Domain-specific data collection interfaces
- Custom Evaluators: Specialized evaluation logic for industry requirements
- Custom Reports: Tailored reporting formats and outputs
- LLM Integration: Support for multiple LLM providers and custom models
π Code Examples: For detailed code examples, usage patterns, and implementation guides, see Architecture Overview and Getting Started Guide.
The project includes comprehensive documentation in a simplified format:
- π Markdown Documentation: Primary documentation with user guides, developer guides, and industry-specific examples
- π§ Sphinx Documentation: Auto-generated API documentation and navigation
- π§ Development Guide: Development setup and contribution guidelines
Quick Start with Documentation:
# Build Sphinx documentation
make docs-sphinx
# Serve documentation locally
make docs-sphinx-serve
# View built documentation
open docs_sphinx/build/html/index.htmlSee PROJECT_STRUCTURE.md for the most up-to-date and detailed project structure.
The framework is designed with a modular architecture for easy maintenance and extension:
core/: Central framework components with type safety and validationcollectors/: Modular data collection with industrial focusevaluators/: Specialized evaluation engines for different aspectsreports/: Reporting for different stakeholdersllm/: LLM integration layer with analysis, assistant, and enhancement enginesagents/: Autonomous agents (RL Agent implemented, monitoring/alerting planned)cli/: User-friendly command-line interface for system engineersconfig/: Configuration management for complex systems
π Technical Details: For component interfaces, data flow diagrams, and extension points, see Architecture Overview.
The refactored framework provides several developer-friendly features:
- Ready-to-use configurations for 6 industrial sectors
- Multiple template types per industry
- Industry-specific SLOs with appropriate safety and compliance standards
- Clear, industry-specific help messages
- Step-by-step guidance tailored for ML engineers in industrial sectors
- Detailed examples with explanations for each industry use case
- Error messages with actionable suggestions
- Easy to add new commands or templates
- Clear separation of concerns
- Maintainable codebase with modular CLI architecture
- Extensible architecture for custom requirements
- Safety-critical and business-critical system support
- Regulatory compliance templates (DO-178C for aviation safety systems)
- Environmental monitoring for harsh conditions
- Business impact assessment and reporting
# Using Makefile (recommended)
make test
make test-verbose
make test-coverage
# Or manually
pytest tests/ -v
pytest tests/safety/ -v # Safety-critical tests
pytest tests/industry/ -v # Industry-specific testsNote: For detailed testing instructions, see docs/developer/testing.md
We welcome contributions! Please see our Development Guide for information about:
- π§ Code Quality Tools: Ruff
- π§ͺ Testing Practices: Unit, integration, and end-to-end tests
- π Development Workflow: Setup, coding standards, and CI/CD
- π Code Style Guidelines: Python style, naming conventions, documentation
# Using Makefile (recommended)
make install-dev
make check
make test
make build
# Or manually
uv sync --extra dev
uv run ruff check .
uv run ruff format .
uv run pytest
uv buildThe project enforces strict code quality standards:
- π¦ Ruff: Fast linting, formatting, type checking, and import sorting with Black-compatible settings
All code must pass these checks before merging.
MIT License - see LICENSE file for details.