AEFDI · teehari79 · Oct 5, 2025 · Oct 5, 2025 · Oct 6, 2025 · Oct 6, 2025
diff --git a/README.md b/README.md
@@ -36,18 +36,54 @@ To install the `energy-fault-detector` package, run: `pip install energy-fault-d
 
 
 ## Quick fault detection
-For a quick demo on a specific dataset, run:
+The `quick_fault_detector` CLI now supports dedicated training and prediction workflows:
 
-```quick_fault_detector <path_to_your_dataset.csv>```
+- **Train and evaluate a model** (default mode). This trains a new autoencoder, evaluates it on the provided
+  test slice, and reports where the model artefacts were stored.
 
-For more options, run ```quick_fault_detector -h```.
+  ```bash
+  quick_fault_detector <path_to_training_data.csv> --mode train [--options options.yaml]
+  ```
 
-For an example using one of the CARE2Compare datasets, run:
-```quick_fault_detector <path_to_c2c_dataset.csv> --c2c_example```
+- **Run predictions with an existing model**. Supply the dataset to score alongside the directory that contains the
+  saved model files returned from a previous training run.
+
+  ```bash
+  quick_fault_detector <path_to_evaluation_data.csv> --mode predict --model_path <path_to_saved_model> [--options options.yaml]
+  ```
+
+  The `--model_path` argument is mandatory in predict mode.
+
+Prediction artefacts (anomaly scores, reconstructions, and detected events) are written to the directory specified by
+`--results_dir` (defaults to `./results`). For an example using one of the CARE2Compare datasets, run:
+
+```bash
+quick_fault_detector <path_to_c2c_dataset.csv> --c2c_example
+```
 
 For more information, have a look at the notebook [Quick Failure Detection](./notebooks/Example%20-%20Quick%20Failure%20Detection.ipynb)
 
 
+## REST prediction API
+
+The project ships with a lightweight FastAPI application that exposes the prediction workflow via HTTP. The service
+resolves models by name and version using the directory structure described in
+[`energy_fault_detector/api/service_config.yaml`](energy_fault_detector/api/service_config.yaml).
+
+Start the API with:
+
+```bash
+uvicorn energy_fault_detector.api.app:app --reload
+```
+
+By default the service reads its configuration from the bundled `service_config.yaml`. Provide the
+`EFD_SERVICE_CONFIG` environment variable to point to a custom YAML file when you want to adapt the model root
+directory, override default ignore patterns, or tweak other runtime parameters. Predictions are triggered with a `POST`
+request to `/predict` and expect a JSON payload containing at least the `model_name` and `data_path` fields. Optional
+fields such as `model_version`, `ignore_features`, and `asset_name` refine which artefacts are used and how the results
+are stored.
+
+
 ## Fault detection in 5 lines of code
 
 ```python

diff --git a/asset_dataset_splitter.py b/asset_dataset_splitter.py
@@ -0,0 +1,224 @@
+"""Utility for splitting asset datasets into train and prediction CSV files.
+
+The module provides a command line interface that accepts the path to a
+directory containing ``.csv`` files separated by semicolons. Each file is
+expected to contain records for one or more assets identified by the
+``asset_id`` column.
+
+The script creates two output files per asset:
+
+``train_<asset>.csv``
+    Contains all records marked as ``train`` in the ``train_test`` column and
+    all records whose ``status_type_id`` is either 0 or 2.
+
+``predict_<asset>.csv``
+    Contains all records whose ``status_type_id`` is 1, 3, 4 or 5 and all
+    records whose ``train_test`` value is ``prediction`` regardless of the
+    ``status_type_id`` value.
+
+Before saving, the script removes helper columns as well as columns that
+contain ``_max``, ``_min`` or ``_std`` from the resulting files.
+"""
+
+from __future__ import annotations
+
+import argparse
+from pathlib import Path
+from typing import Dict, Iterable, MutableMapping
+
+import pandas as pd
+
+import csv
+
+
+STATUS_TYPE_NORMAL = {"0", "2"}
+STATUS_TYPE_ANOMALY = {"1", "3", "4", "5"}
+DROP_COLUMNS = {
+    "asset_id",
+    "train_test",
+    "train_test_bool",
+    "status_type_id",
+    "status_type_bool",
+}
+DROP_COLUMN_SUBSTRINGS = ("_max", "_min", "_std")
+
+
+def _detect_delimiter(csv_file: Path) -> str:
+    """Return the detected delimiter for ``csv_file`` with ``;`` as fallback."""
+
+    with open(csv_file, "r", encoding="utf-8", errors="ignore") as handle:
+        sample = handle.read(2048)
+        sniffer = csv.Sniffer()
+        try:
+            dialect = sniffer.sniff(sample)
+            return dialect.delimiter
+        except csv.Error:
+            return ";"
+
+
+def _iter_asset_frames(
+    csv_file: Path, *, chunksize: int | None = 100_000
+) -> Iterable[tuple[str, pd.DataFrame]]:
+    """Yield (asset_id, dataframe) pairs from ``csv_file`` without loading everything into memory."""
+
+    delimiter = _detect_delimiter(csv_file)
+    reader = pd.read_csv(csv_file, sep=delimiter, dtype=str, chunksize=chunksize)
+
+    # ``pd.read_csv`` returns ``DataFrame`` when ``chunksize`` is ``None``.
+    if isinstance(reader, pd.DataFrame):
+        reader = [reader]
+
+    for chunk in reader:
+        if chunk.empty:
+            continue
+
+        if "asset_id" not in chunk.columns:
+            raise ValueError(
+                f"File '{csv_file}' does not contain required column 'asset_id'."
+            )
+
+        for asset_id, group in chunk.groupby("asset_id", sort=False):
+            yield str(asset_id), group.reset_index(drop=True)
+
+
+def _clean_columns(df: pd.DataFrame) -> pd.DataFrame:
+    """Remove helper columns and columns containing ``_max``/``_min``/``_std``."""
+
+    to_drop = [col for col in df.columns if col in DROP_COLUMNS]
+    to_drop.extend(
+        col for col in df.columns if any(substr in col for substr in DROP_COLUMN_SUBSTRINGS)
+    )
+    return df.drop(columns=to_drop, errors="ignore")
+
+
+def _get_lowercase_series(df: pd.DataFrame, column: str) -> pd.Series:
+    series = df.get(column)
+    if series is None:
+        return pd.Series("", index=df.index, dtype=object)
+    return series.fillna("").astype(str).str.lower()
+
+
+def _get_status_series(df: pd.DataFrame) -> pd.Series:
+    series = df.get("status_type_id")
+    if series is None:
+        return pd.Series("", index=df.index, dtype=object)
+    return series.fillna("").astype(str)
+
+
+def _build_train_frame(df: pd.DataFrame) -> pd.DataFrame:
+    """Filter rows belonging to the training split."""
+
+    train_series = _get_lowercase_series(df, "train_test")
+    status_series = _get_status_series(df)
+
+    is_train = train_series == "train"
+    is_normal_status = status_series.isin(STATUS_TYPE_NORMAL)
+    return df[is_train | is_normal_status].copy()
+
+
+def _build_predict_frame(df: pd.DataFrame) -> pd.DataFrame:
+    """Filter rows belonging to the prediction split."""
+
+    train_series = _get_lowercase_series(df, "train_test")
+    status_series = _get_status_series(df)
+
+    is_prediction = train_series == "prediction"
+    is_anomaly_status = status_series.isin(STATUS_TYPE_ANOMALY)
+    return df[is_prediction | is_anomaly_status].copy()
+
+
+def _append_asset_frames(
+    asset_id: str,
+    df: pd.DataFrame,
+    output_dir: Path,
+    header_written: MutableMapping[Path, bool],
+) -> None:
+    """Append data for ``asset_id`` to the corresponding train/predict CSV files."""
+
+    if df.empty:
+        return
+
+    train_df = _clean_columns(_build_train_frame(df))
+    predict_df = _clean_columns(_build_predict_frame(df))
+
+    if not train_df.empty:
+        train_path = output_dir / f"train_{asset_id}.csv"
+        train_df.to_csv(
+            train_path,
+            sep=";",
+            index=False,
+            mode="a",
+            header=not header_written.get(train_path, False),
+        )
+        header_written[train_path] = True
+
+    if not predict_df.empty:
+        predict_path = output_dir / f"predict_{asset_id}.csv"
+        predict_df.to_csv(
+            predict_path,
+            sep=";",
+            index=False,
+            mode="a",
+            header=not header_written.get(predict_path, False),
+        )
+        header_written[predict_path] = True
+
+
+def split_asset_datasets(
+    input_dir: Path,
+    output_dir: Path | None = None,
+    *,
+    chunksize: int | None = 100_000,
+) -> None:
+    """Split datasets per asset into train and prediction CSV files.
+
+    Args:
+        input_dir: Directory containing the source ``.csv`` files.
+        output_dir: Optional directory to store the results. When ``None`` the
+            input directory is used.
+        chunksize: Maximum number of rows per chunk read from the source files.
+            ``None`` loads the entire file at once.
+    """
+
+    if not input_dir.is_dir():
+        raise NotADirectoryError(f"Input directory '{input_dir}' does not exist or is not a directory")
+
+    output_dir = output_dir or input_dir
+    output_dir.mkdir(parents=True, exist_ok=True)
+
+    # Remove previously generated files to avoid appending to stale data.
+    for existing_file in output_dir.glob("train_*.csv"):
+        existing_file.unlink()
+    for existing_file in output_dir.glob("predict_*.csv"):
+        existing_file.unlink()
+
+    header_written: Dict[Path, bool] = {}
+
+    for csv_file in sorted(input_dir.glob("*.csv")):
+        for asset_id, asset_df in _iter_asset_frames(csv_file, chunksize=chunksize):
+            _append_asset_frames(asset_id, asset_df, output_dir, header_written)
+
+
+def _parse_args() -> argparse.Namespace:
+    parser = argparse.ArgumentParser(description="Split asset datasets into train and prediction files")
+    parser.add_argument(
+        "input_dir",
+        type=Path,
+        help="Path to the directory containing the source CSV files",
+    )
+    parser.add_argument(
+        "--output-dir",
+        type=Path,
+        default=None,
+        help="Optional output directory. Defaults to the input directory.",
+    )
+    return parser.parse_args()
+
+
+def main() -> None:
+    args = _parse_args()
+    split_asset_datasets(args.input_dir, args.output_dir)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/docs/config_example.yaml b/docs/config_example.yaml
@@ -63,4 +63,11 @@ root_cause_analysis:  # (optional) if not specified, no root_cause_analysis (ARC
   alpha: 0.8
   init_x_bias: recon
   num_iter: 200
-
+  ignore_features:
+    # Patterns apply to anomaly scoring and ARCANA. Wildcards such as "windspeed*" are supported.
+    # list exact column names or use wildcards such as "windspeed*"
+    - wind_speed_59_avg
+    - power_58_avg
+    - wind_speed_60_avg
+    - wind_speed_61_avg
+    - power_62_avg
diff --git a/docs/usage_examples.rst b/docs/usage_examples.rst
@@ -98,6 +98,42 @@ algorithm. An example:
 .. include:: config_example.yaml
    :literal:
 
+To keep specific sensors out of both anomaly scoring *and* ARCANA you can edit the configuration YAML that you pass to
+:py:obj:`Config <energy_fault_detector.config.Config>` (for example ``configs/base_config.yaml`` or a copy of it). Inside
+the ``root_cause_analysis`` section add or update the optional ``ignore_features`` list. Any column names listed here are
+zeroed before the anomaly score is calculated and remain fixed during ARCANA optimisation, so they never trigger
+anomalies and are never suggested as a root cause.
+
+Column names must match the data frame headers used during training or prediction. You can either provide the exact
+column name or use Unix shell-style wildcards to target multiple columns at once (for example ``windspeed*`` matches
+``windspeed_avg`` and ``windspeed_peak``). If a configured pattern does not match any columns a warning is logged so you
+can correct the entry.
+
+.. code-block:: yaml
+
+   root_cause_analysis:
+     ignore_features:
+       - windspeed*
+       - output_power
+
+When the configuration above is used, wind speed signals such as ``windspeed_avg`` are kept for model training but their
+reconstruction error is ignored for anomaly scores and in ARCANA runs. The same YAML file can be supplied when calling
+:py:meth:`FaultDetector.fit <energy_fault_detector.fault_detector.FaultDetector.fit>` for training or
+:py:meth:`FaultDetector.predict <energy_fault_detector.fault_detector.FaultDetector.predict>` for inference so the same
+set of exclusions is applied in both stages.
+To prevent ARCANA from adjusting certain sensors you can edit the configuration YAML that you pass to
+:py:obj:`Config <energy_fault_detector.config.Config>` (for example ``configs/base_config.yaml`` or a copy of it). Inside
+the ``root_cause_analysis`` section add or update the optional ``ignore_features`` list. Any column names listed here
+remain fixed during optimisation and are therefore never suggested as a root cause.
+
+Column names must match the data frame headers used at prediction time. You can either provide the exact column name or
+use Unix shell-style wildcards to target multiple columns at once (for example ``windspeed*`` matches
+``windspeed_avg`` and ``windspeed_peak``). If a configured pattern does not match any columns a warning is logged so you
+can correct the entry.
+The optional ``ignore_features`` list inside ``root_cause_analysis`` can be used to prevent ARCANA from adjusting
+specific sensors (for example ``windspeed`` or ``output_power``). These features will be kept fixed during the
+optimisation and therefore won't be reported as a root cause.
+
 To update the configuration 'on the fly' (for example for hyperparameter optimization), you provide a new
 configuration dictionary via the ``update_config`` method: