Open
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
This PR shifts “reference dataset” handling away from the data loader/orchestrator and into metric-specific configuration, by removing the orchestrator’s automatic reference loading and updating metric assess() signatures accordingly.
Changes:
- Removed orchestrator-managed reference dataset loading/passing; metrics are expected to source reference data via metric config instead.
- Added a metric config model for
validity_outOfVocabularyto accept a reference vocabulary (DataFrame/set/None). - Performed small refactors in profiling utilities (typing tweaks,
to_numpy()usage, serialization import changes).
Reviewed changes
Copilot reviewed 21 out of 21 changed files in this pull request and generated 15 comments.
Show a summary per file
| File | Description |
|---|---|
| metis/utils/data_profiling/single_column/value_distribution/quartiles.py | Adjusted return typing for quartiles/IQR. |
| metis/utils/data_profiling/single_column/value_distribution/histogram.py | Switched quantile extraction to to_numpy(). |
| metis/utils/data_profiling/single_column/patterns_and_data_types/numeric_precision.py | Refactored decimal-string conversion; introduced an accidental duplicate-return bug. |
| metis/utils/data_profiling/single_column/domain_classification/domain.py | Minor cleanup; changed Series name handling. |
| metis/utils/data_profiling/single_column/cardinalities/null_values.py | Simplified return values for null stats (but changed scalar types). |
| metis/utils/data_profiling/single_column/cardinalities/distinct_values.py | Simplified return values (but changed scalar types). |
| metis/profiling/importers/jaccard_importer.py | Removed unused import; tightened profile typing. |
| metis/profiling/data_profile_manager.py | Moved numpy/pandas/dataketch imports to module scope; simplified deserialization path. |
| metis/metric/validity/validity_outOfVocabulary_config.py | New config model to supply the reference vocabulary. |
| metis/metric/validity/validity_outOfVocabulary.py | Loads reference vocabulary from metric config instead of an assess() arg. |
| metis/metric/timeliness/timeliness_heinrich.py | Removed reference from signature (docstring not updated). |
| metis/metric/minimality/minimality_duplicateCount.py | Removed unused reference from signature. |
| metis/metric/metric.py | Removed reference from base signature (docstring still mentions it). |
| metis/metric/correctness/correctness_heinrich.py | Removed reference from signature but implementation still uses reference (runtime break). |
| metis/metric/consistency/consistency_ruleBasedPipino.py | Removed reference from signature (docstring not updated). |
| metis/metric/consistency/consistency_ruleBasedHinrichs.py | Removed reference from signature (docstring not updated). |
| metis/metric/consistency/consistency_countFDViolations.py | Removed unused reference from signature. |
| metis/metric/completeness/completeness_nullRatio.py | Removed reference from signature (docstring not updated). |
| metis/metric/completeness/completeness_nullAndDMVRatio.py | Removed reference from signature (docstring not updated). |
| metis/dq_orchestrator.py | Removed reference dataframe loading and stopped passing reference into metrics. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
metis/utils/data_profiling/single_column/cardinalities/null_values.py
Outdated
Show resolved
Hide resolved
metis/utils/data_profiling/single_column/value_distribution/quartiles.py
Outdated
Show resolved
Hide resolved
metis/utils/data_profiling/single_column/domain_classification/domain.py
Outdated
Show resolved
Hide resolved
2971390 to
726b98f
Compare
Contributor
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 21 out of 21 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Right now, if a metric requires a reference dataset (e.g., ground truth), it must be specified in the data loading config. The loader will then use the loading parameters of the main dataset to load the reference dataset. However, 1. not every metric requires a reference dataset and 2. this way of passing this dataset is inflexible (see the validity_outOfVocabulary metric that may receive a Set of words instead of a dataset) and lacks additional csv parsing parameters.
Thus, this PR removes the automatic loading of reference datasets and instead requires those to be defined by a metric config.
Additionally, a few python typing issues in the profiling submodule have been fixed.