Skip to content

Load reference datasets through metric config#12

Open
jb3rndt wants to merge 2 commits intomainfrom
feat/reference-loading
Open

Load reference datasets through metric config#12
jb3rndt wants to merge 2 commits intomainfrom
feat/reference-loading

Conversation

@jb3rndt
Copy link
Copy Markdown
Collaborator

@jb3rndt jb3rndt commented Mar 30, 2026

Right now, if a metric requires a reference dataset (e.g., ground truth), it must be specified in the data loading config. The loader will then use the loading parameters of the main dataset to load the reference dataset. However, 1. not every metric requires a reference dataset and 2. this way of passing this dataset is inflexible (see the validity_outOfVocabulary metric that may receive a Set of words instead of a dataset) and lacks additional csv parsing parameters.

Thus, this PR removes the automatic loading of reference datasets and instead requires those to be defined by a metric config.

Additionally, a few python typing issues in the profiling submodule have been fixed.

Copilot AI review requested due to automatic review settings March 30, 2026 15:37
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR shifts “reference dataset” handling away from the data loader/orchestrator and into metric-specific configuration, by removing the orchestrator’s automatic reference loading and updating metric assess() signatures accordingly.

Changes:

  • Removed orchestrator-managed reference dataset loading/passing; metrics are expected to source reference data via metric config instead.
  • Added a metric config model for validity_outOfVocabulary to accept a reference vocabulary (DataFrame/set/None).
  • Performed small refactors in profiling utilities (typing tweaks, to_numpy() usage, serialization import changes).

Reviewed changes

Copilot reviewed 21 out of 21 changed files in this pull request and generated 15 comments.

Show a summary per file
File Description
metis/utils/data_profiling/single_column/value_distribution/quartiles.py Adjusted return typing for quartiles/IQR.
metis/utils/data_profiling/single_column/value_distribution/histogram.py Switched quantile extraction to to_numpy().
metis/utils/data_profiling/single_column/patterns_and_data_types/numeric_precision.py Refactored decimal-string conversion; introduced an accidental duplicate-return bug.
metis/utils/data_profiling/single_column/domain_classification/domain.py Minor cleanup; changed Series name handling.
metis/utils/data_profiling/single_column/cardinalities/null_values.py Simplified return values for null stats (but changed scalar types).
metis/utils/data_profiling/single_column/cardinalities/distinct_values.py Simplified return values (but changed scalar types).
metis/profiling/importers/jaccard_importer.py Removed unused import; tightened profile typing.
metis/profiling/data_profile_manager.py Moved numpy/pandas/dataketch imports to module scope; simplified deserialization path.
metis/metric/validity/validity_outOfVocabulary_config.py New config model to supply the reference vocabulary.
metis/metric/validity/validity_outOfVocabulary.py Loads reference vocabulary from metric config instead of an assess() arg.
metis/metric/timeliness/timeliness_heinrich.py Removed reference from signature (docstring not updated).
metis/metric/minimality/minimality_duplicateCount.py Removed unused reference from signature.
metis/metric/metric.py Removed reference from base signature (docstring still mentions it).
metis/metric/correctness/correctness_heinrich.py Removed reference from signature but implementation still uses reference (runtime break).
metis/metric/consistency/consistency_ruleBasedPipino.py Removed reference from signature (docstring not updated).
metis/metric/consistency/consistency_ruleBasedHinrichs.py Removed reference from signature (docstring not updated).
metis/metric/consistency/consistency_countFDViolations.py Removed unused reference from signature.
metis/metric/completeness/completeness_nullRatio.py Removed reference from signature (docstring not updated).
metis/metric/completeness/completeness_nullAndDMVRatio.py Removed reference from signature (docstring not updated).
metis/dq_orchestrator.py Removed reference dataframe loading and stopped passing reference into metrics.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 21 out of 21 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@jb3rndt jb3rndt requested a review from lisehr April 3, 2026 08:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants