Skip to content

Latest commit

 

History

History
118 lines (83 loc) · 2.2 KB

File metadata and controls

118 lines (83 loc) · 2.2 KB

Quickstart

Get from zero to your first detected anomaly in five commands.

Prerequisites

  • Python 3.11+
  • A dbt project with run_results.json or a JSONL file of pipeline runs

Installation

pip install pipeline-anomaly-detector

For development (tests, notebook):

pip install "pipeline-anomaly-detector[dev]"

Step 1 — Collect pipeline runs

From a dbt project:

pad collect \
  --source dbt \
  --dbt-dir "./target/run_results.json" \
  --since 2024-01-01 \
  --output runs.jsonl

From a generic JSONL file:

pad collect \
  --source generic \
  --input my_pipeline_runs.json \
  --output runs.jsonl

From Airflow:

pad collect \
  --source airflow \
  --airflow-db "sqlite:///~/airflow/airflow.db" \
  --since 2024-01-01 \
  --output runs.jsonl

Step 2 — Train an anomaly detector

pad train \
  --input runs.jsonl \
  --detector ensemble \
  --output ./models

This trains an EnsembleDetector (ZScore + IsolationForest) on your collected runs and saves the model to ./models/.

Available detector types:

Value Description
ensemble Weighted combination of ZScore + IsolationForest (recommended)
zscore Fast, interpretable z-score baseline
isolation_forest sklearn IsolationForest, handles non-linear anomalies

Step 3 — Score a batch of new runs

pad score-batch \
  --input new_runs.jsonl \
  --model ./models/global_ensemble_20240115T120000Z.joblib \
  --db scores.db

The results are printed to the terminal and persisted to scores.db.


Step 4 — Explain an anomaly

pad explain \
  --run-id anomaly_duration_000 \
  --model ./models/global_ensemble_20240115T120000Z.joblib \
  --db scores.db

This prints a Rich panel with the anomaly score bar, contributing features, and is_anomaly status.


Step 5 — List saved models

pad models list --store-dir ./models

What's next?