A production-grade streaming pipeline that detects persistent regime shifts in infrastructure metrics using statistical validation instead of opaque ML models.
Note: This project is an independent signals engineering library and is not associated with BlackIce Forensics, Databricks, or other security vendors.
Requires Python 3.10+.
# Install via pip
pip install blackiceUse RegimeDetector to embed stability logic directly into your Python services.
from blackice import RegimeDetector, RegimeState
# 1. Initialize detector (defaults: window=60, sigma=3.0)
detector = RegimeDetector()
# 2. Feed data points (e.g., from Kafka/Prometheus)
# Returns a DetectionEvent object
event = detector.update(45.5)
# 3. Act on findings
if event.is_anomaly:
print(f"Instability detected: {event.reason}")
print(f"Severity: {event.zscore:.2f}σ")
if event.state == RegimeState.SHIFTED:
# Use explicit Enums for type-safety
trigger_pager(f"Regime shift confirmled: {event.duration}s")Run the full analysis pipeline on your own data.
blackice --data <logs.csv> --machine <server_id> --reportUniversal Input Requirements:
- Input: A CSV file with columns
['machine_id', 'timestamp', 'cpu_util', 'mem_util']. - Output: Generates an incident report at
reports/analysis_<server_id>.md. - Logic: Filters noise using the persistence logic defined in
configs/default.yaml.
Optimized Parameters, Deterministic Runtime.
BLACKICE includes an Offline Learning module that optimizes configuration parameters (Decision Boundaries) using historical data. This allows the system to adapt to specific infrastructure characteristics without running opaque models in production.
Use the training script to find the optimal z_threshold and window_size for your data:
# Finds best parameters using Grid Search over 48 combinations
python train_model.py data/machine_usage.csv --output configs/learned.yamlWhat happens:
- Objective Function: Minimizes a loss function where
False_Alertsare penalized 5x more thanDetection_Layency. - Optimizer: Runs the pipeline on historical data with varying parameters (Grid Search).
- Output: Generates a production-ready
learned.yamlconfig file.
Then run the detector with the learned config:
blackice --config configs/learned.yaml ...Infrastructure monitoring is plagued by alert fatigue. Traditional threshold-based alerting generates noise from transient spikes, while complex ML models introduce opacity and drift.
BLACKICE addresses this by shifting focus from point anomalies to persistent regime shifts. It acknowledges that infrastructure data is inherently noisy and bursty. Instead of training complex models to predict every spike, BLACKICE uses rigorous statistical persistence validation to distinguish between:
- Transient Instability: Burstiness that returns to baseline (filtered out).
- Structural Deviation: Shifts that persist beyond a confidence interval (reported).
The system avoids "AI magic" in favor of deterministic, explainable signal processing that can be debugged by an SRE at 3 AM.
BLACKICE is engineered as a streaming processing pipeline, not a batch analysis script. It operates in O(1) memory per metric tracker.
- Baseline Modeling: Welford's algorithm for numerically stable, streaming mean/variance without history retention.
- Deviation Detection: Real-time z-score computation against rolling baselines.
- Persistence Validation: Deterministic filter requiring deviations to sustain magnitude and duration thresholds to trigger state changes.
- State Machine: Formal transition logic (NORMAL → UNSTABLE → SHIFTED) providing clean audit trails.
- RegimeDetector: High-level facade that encapsulates the entire engine into a single, easy-to-use API.
- Metrics/Reporting: Generates label-free stability metrics and automated incident analysis files.
- Streaming-First Design: Processes infinite streams chunk-by-chunk using minimal resources.
- Constant Memory: O(window_size) memory complexity regardless of dataset size.
- Label-Free Metrics: Quality metrics (detection latency, spike rejection) computed without ground truth labels.
- Noise Rejection: Aggressive persistence layer filters 80-90% of transient noise typical in cloud workloads.
- Automated Reporting: Instantly generates production-grade Markdown incident reports.
- CLI-Driven: Unix-philosophy operational interface.
blackice/
├── configs/ # YAML configuration (Learned & Defaults)
├── notebooks/ # Analysis and visualization
├── reports/ # Generated incident reports
├── scripts/ # Testing & CI Verification
│ ├── test_blackice.py # Integration suite
│ ├── test_cli_smoke.py # Operational smoke tests
│ └── test_determinism.py # O(1) Invariant checks
├── src/
│ └── blackice/ # Core library
│ ├── learning/ # [NEW] Offline ML Module
│ │ ├── objective.py # Loss Function (SRE-weighted)
│ │ └── optimizer.py # Grid Search Trainer
│ ├── baseline.py # Streaming statistics
│ ├── cli.py # Production CLI entry point
│ ├── detector.py # High-level RegimeDetector API
│ ├── deviation.py # Signal detection
│ ├── metrics.py # Stability metrics
│ ├── persistence.py # Noise filtering logic
│ ├── pipeline.py # Orchestration
│ └── state.py # Regime state machine
├── train_model.py # [NEW] ML Training Entrypoint
└── pyproject.toml # Project Metadata & Dependencies
BLACKICE generates structured incident reports designed for engineering transparency.
- Executive Summary: Immediate text verdict (HEALTHY/UNHEALTHY) based on confirmed shifts.
- Signal Summary: Detailed breakdown of CPU/Memory behavior patterns.
- Detection Statistics: Tables showing total events vs. confirmed shifts (often 100% rejection rate for stable but bursty machines).
- Infra Interpretation: Automated diagnosis of workload characteristics (e.g., "High variance but stable").
A "Health" verdict often accompanies high instability counts. This is correct behavior: it proves the system successfully identified thousands of spikes as non-critical noise, preventing thousands of false pages.
- Conservative by Design: We prefer missing a subtle shift over waking an engineer for a false positive.
- Deterministic > Probabilistic: Given the same input, the system must produce the exact same state transitions.
- Explainability > Complexity: Every regime shift has a clear reason (e.g., "Deviation persisted for >10 points"), traceable back to specific timestamps.
- Deterministic output for identical input streams
- O(1) update time per data point
- Bounded memory usage (O(window_size))
- No Online Training (Learning is decoupled & offline)
- Learned Configs, Deterministic Execution
- Not a forecasting system
- Not a root-cause analysis engine
- Not a replacement for TSDBs or Prometheus
For Contributors: To setup a development environment and run tests, please see DEVELOPMENT.md.
Created by Mihir Maru.
