Load reference datasets through metric config by jb3rndt · Pull Request #12 · HPI-Information-Systems/Metis

jb3rndt · 2026-03-30T15:37:04Z

Right now, if a metric requires a reference dataset (e.g., ground truth), it must be specified in the data loading config. The loader will then use the loading parameters of the main dataset to load the reference dataset. However, 1. not every metric requires a reference dataset and 2. this way of passing this dataset is inflexible (see the validity_outOfVocabulary metric that may receive a Set of words instead of a dataset) and lacks additional csv parsing parameters.

Thus, this PR removes the automatic loading of reference datasets and instead requires those to be defined by a metric config.

Additionally, a few python typing issues in the profiling submodule have been fixed.

Copilot

Pull request overview

This PR shifts “reference dataset” handling away from the data loader/orchestrator and into metric-specific configuration, by removing the orchestrator’s automatic reference loading and updating metric assess() signatures accordingly.

Changes:

Removed orchestrator-managed reference dataset loading/passing; metrics are expected to source reference data via metric config instead.
Added a metric config model for validity_outOfVocabulary to accept a reference vocabulary (DataFrame/set/None).
Performed small refactors in profiling utilities (typing tweaks, to_numpy() usage, serialization import changes).

Reviewed changes

Copilot reviewed 21 out of 21 changed files in this pull request and generated 15 comments.

Show a summary per file

File	Description
metis/utils/data_profiling/single_column/value_distribution/quartiles.py	Adjusted return typing for quartiles/IQR.
metis/utils/data_profiling/single_column/value_distribution/histogram.py	Switched quantile extraction to `to_numpy()`.
metis/utils/data_profiling/single_column/patterns_and_data_types/numeric_precision.py	Refactored decimal-string conversion; introduced an accidental duplicate-return bug.
metis/utils/data_profiling/single_column/domain_classification/domain.py	Minor cleanup; changed Series name handling.
metis/utils/data_profiling/single_column/cardinalities/null_values.py	Simplified return values for null stats (but changed scalar types).
metis/utils/data_profiling/single_column/cardinalities/distinct_values.py	Simplified return values (but changed scalar types).
metis/profiling/importers/jaccard_importer.py	Removed unused import; tightened profile typing.
metis/profiling/data_profile_manager.py	Moved numpy/pandas/dataketch imports to module scope; simplified deserialization path.
metis/metric/validity/validity_outOfVocabulary_config.py	New config model to supply the reference vocabulary.
metis/metric/validity/validity_outOfVocabulary.py	Loads reference vocabulary from metric config instead of an `assess()` arg.
metis/metric/timeliness/timeliness_heinrich.py	Removed `reference` from signature (docstring not updated).
metis/metric/minimality/minimality_duplicateCount.py	Removed unused `reference` from signature.
metis/metric/metric.py	Removed `reference` from base signature (docstring still mentions it).
metis/metric/correctness/correctness_heinrich.py	Removed `reference` from signature but implementation still uses `reference` (runtime break).
metis/metric/consistency/consistency_ruleBasedPipino.py	Removed `reference` from signature (docstring not updated).
metis/metric/consistency/consistency_ruleBasedHinrichs.py	Removed `reference` from signature (docstring not updated).
metis/metric/consistency/consistency_countFDViolations.py	Removed unused `reference` from signature.
metis/metric/completeness/completeness_nullRatio.py	Removed `reference` from signature (docstring not updated).
metis/metric/completeness/completeness_nullAndDMVRatio.py	Removed `reference` from signature (docstring not updated).
metis/dq_orchestrator.py	Removed reference dataframe loading and stopped passing `reference` into metrics.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

metis/metric/completeness/completeness_nullAndDMVRatio.py

metis/metric/consistency/consistency_ruleBasedPipino.py

metis/utils/data_profiling/single_column/cardinalities/null_values.py

metis/utils/data_profiling/single_column/value_distribution/quartiles.py

metis/utils/data_profiling/single_column/domain_classification/domain.py

metis/metric/validity/validity_outOfVocabulary.py

metis/metric/timeliness/timeliness_heinrich.py

metis/metric/consistency/consistency_ruleBasedHinrichs.py

metis/utils/data_profiling/single_column/cardinalities/null_values.py

metis/metric/completeness/completeness_nullRatio.py

…etrics individual config

Copilot

Pull request overview

Copilot reviewed 21 out of 21 changed files in this pull request and generated no new comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot AI review requested due to automatic review settings March 30, 2026 15:37

Copilot started reviewing on behalf of jb3rndt March 30, 2026 15:37 View session

Copilot AI reviewed Mar 30, 2026

View reviewed changes

jb3rndt added 2 commits April 1, 2026 19:38

Remove reference loading so this can be more freely handled in each m…

dce1b89

…etrics individual config

Resolve typing issues in profiling module

726b98f

jb3rndt force-pushed the feat/reference-loading branch from 2971390 to 726b98f Compare April 1, 2026 17:38

jb3rndt requested a review from Copilot April 1, 2026 17:43

Copilot started reviewing on behalf of jb3rndt April 1, 2026 17:44 View session

Copilot AI reviewed Apr 1, 2026

View reviewed changes

jb3rndt requested a review from lisehr April 3, 2026 08:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Load reference datasets through metric config#12

Load reference datasets through metric config#12
jb3rndt wants to merge 2 commits intomainfrom
feat/reference-loading

jb3rndt commented Mar 30, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jb3rndt commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jb3rndt commented Mar 30, 2026 •

edited

Loading