Skip to content

Conversation

@jverbus
Copy link
Contributor

@jverbus jverbus commented Mar 8, 2025

Refactor for Extended Isolation Forest (v4.0) – Branch Summary

Overview

This branch reorganizes our Isolation Forest codebase into a clearer, more extensible structure and bumps the version to 4.0 due to non-backward-compatible changes. It also standardizes our ScalaDoc formatting. These changes are setting the stage for a future addition of Extended Isolation Forest capability.

Key Changes

New Core Package

  • SharedTrainLogic
    Centralizes tree-building and sampling logic that can be shared across different Isolation Forest variants.

  • IsolationTreeBase & NodeBase
    Introduced IsolationTreeBase.scala and new NodeBase traits to unify and simplify the structure of tree logic.

  • File Moves
    Moved core functionality—e.g., BaggedPoint.scala, Utils.scala, and model read/write code—into a dedicated core package.

Refined Parameters

  • Param Trait Renaming
    IsolationForestParams was renamed to IsolationForestParamsBase for a more modular design.

  • Duplicate Logic Removal
    Eliminated redundant logic in IsolationForest.scala by delegating to SharedTrainLogic.

Build & Testing Updates

  • versions.gradle Introduction
    Centralized Spark/Scala version management to simplify cross-version builds.

  • Directory Restructuring
    Adjusted test package layout, placing relevant tests under the core directory.

  • Integration Tests for ONNX Conversion

    • Two-part approach:
      1. A Scala test trains a model on the mammography dataset, scores it, and exports both the Avro+metadata model and the flattened CSV of Spark outlier scores.
      2. A separate Python test (in the isolation-forest-onnx module) loads those artifacts, converts the model to ONNX, and compares the ONNX inferences to Spark’s scores.
    • This end-to-end test ensures that ONNX predictions match Spark predictions to within a small tolerance (<1e-5).

Version Bump to 4.0.*

Reflects the potential for non-backward-compatible changes introduced by this refactor.

ScalaDoc Standardization

Updated all ScalaDoc comments to adhere to a consistent multi-line style, improving readability and consistency.

Validation

Comparison with Expected Values from Literature

Dataset Expected Mean AUROC (from Liu et al.) New Observed Mean AUROC
Http (KDDCUP99) 1.00 0.99967 ± 0.00011
ForestCover 0.88 0.882 ± 0.006
Mulcross 0.97 0.9910 ± 0.0009
Smtp (KDDCUP99) 0.88 0.9099 ± 0.0014
Shuttle 1.00 0.99708 ± 0.00018
Mammography 0.86 0.8649 ± 0.0015
Annthyroid 0.82 0.813 ± 0.004
Satellite 0.71 0.717 ± 0.008
Pima 0.67 0.668 ± 0.004
Breastw 0.99 0.9864 ± 0.0003
Arrhythmia 0.80 0.8064 ± 0.0019
Ionosphere 0.85 0.8443 ± 0.0002

ONNX Integration Tests

As described above, we now have ONNX integration tests that verify end-to-end compatibility. The agreement between Spark inference and inference using the same model converted to ONNX is in close agreement. On the mammography.csv dataset, we see the following aggregating over all scored instances:

Spark vs ONNX: maxDiff=0.000000238154, minDiff=0.000000000001 avgDiff=0.000000089145, medianDiff=0.000000064966.


Conclusion:
The library’s internal refactor plus new ONNX integration tests provide greater clarity, stronger test coverage, and confirm strong agreement between Spark-based Isolation Forest scores and ONNX inference.

jverbus added 10 commits March 7, 2025 01:45
…ing files to their place in the new structure and refactored the code so that it builds and tests pass.
… bump to v4.0

Core extraction & file renaming
- Introduce IsolationTreeBase.scala and SharedTrainLogic.scala in core to unify tree building, sampling, and threshold logic.
- Rename IsolationForestParams → IsolationForestParamsBase.
- Move BaggedPoint.scala, Utils.scala, and model read/write code under core package.
- Nodes.scala adopts new NodeBase traits for external/internal nodes.
- Remove duplicate logic from IsolationForest.scala in favor of SharedTrainLogic.

Build & testing updates
- Introduce versions.gradle for uniform Scala & Spark version management.
- Adjust test package layout for new core directory structure.

Bump version to 4.0.*
- The refactor potentially introduces non-backward-compatible changes, requiring a major version increment.
- Split the integration tests into two parts:
  - A Scala test that trains an Isolation Forest on the mammography dataset, scores the data, and exports both the model (in Avro/metadata format) and the flattened CSV scores.
  - A Python test that loads the exported artifacts, converts the model to ONNX, runs inference via onnxruntime, and compares the outlier scores against Spark's.
- Set environment variables for Spark/Scala versions to coordinate across tests.

This commit adds critical integration test coverage and ensures consistency between Spark and ONNX predictions.
…n before isolation-forest-onnx (python) tests.
Copy link

@angie-z angie-z left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work!

@jverbus jverbus merged commit cdc4bf4 into master Mar 8, 2025
13 checks passed
@jverbus jverbus deleted the refactor_for_eif branch March 10, 2025 00:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants