Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Refactor for Extended Isolation Forest (v4.0) – Branch Summary
Overview
This branch reorganizes our Isolation Forest codebase into a clearer, more extensible structure and bumps the version to 4.0 due to non-backward-compatible changes. It also standardizes our ScalaDoc formatting. These changes are setting the stage for a future addition of Extended Isolation Forest capability.
Key Changes
New Core Package
SharedTrainLogic
Centralizes tree-building and sampling logic that can be shared across different Isolation Forest variants.
IsolationTreeBase & NodeBase
Introduced
IsolationTreeBase.scalaand newNodeBasetraits to unify and simplify the structure of tree logic.File Moves
Moved core functionality—e.g.,
BaggedPoint.scala,Utils.scala, and model read/write code—into a dedicatedcorepackage.Refined Parameters
Param Trait Renaming
IsolationForestParamswas renamed toIsolationForestParamsBasefor a more modular design.Duplicate Logic Removal
Eliminated redundant logic in
IsolationForest.scalaby delegating toSharedTrainLogic.Build & Testing Updates
versions.gradleIntroductionCentralized Spark/Scala version management to simplify cross-version builds.
Directory Restructuring
Adjusted test package layout, placing relevant tests under the
coredirectory.Integration Tests for ONNX Conversion
isolation-forest-onnxmodule) loads those artifacts, converts the model to ONNX, and compares the ONNX inferences to Spark’s scores.Version Bump to 4.0.*
Reflects the potential for non-backward-compatible changes introduced by this refactor.
ScalaDoc Standardization
Updated all ScalaDoc comments to adhere to a consistent multi-line style, improving readability and consistency.
Validation
Comparison with Expected Values from Literature
ONNX Integration Tests
As described above, we now have ONNX integration tests that verify end-to-end compatibility. The agreement between Spark inference and inference using the same model converted to ONNX is in close agreement. On the
mammography.csvdataset, we see the following aggregating over all scored instances:Spark vs ONNX: maxDiff=0.000000238154, minDiff=0.000000000001 avgDiff=0.000000089145, medianDiff=0.000000064966.Conclusion:
The library’s internal refactor plus new ONNX integration tests provide greater clarity, stronger test coverage, and confirm strong agreement between Spark-based Isolation Forest scores and ONNX inference.