-
Notifications
You must be signed in to change notification settings - Fork 3
feat: SQL-backed feature validation framework #136
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…idations - Add validation interfaces (FeatureValidationError, FeatureValidator, SqlFeatureCorrector) - Implement several validations (LawAndOrderValidator) - Retrofit the existing checks in SqlBackedFeature.compute and in feature_from_query to raise the new FeatureValidationError - Add an algorithm (validator_loop) that tries to fix validation problems using a callback Fixes #376.
…eded for categorical features to work correctly
59489f2 to
a7291d5
Compare
…p schemas in it, so that they are definitely deleted when the process exits.
|
@leonidb I fixed the categorical feature bug, and made the change I wanted to make to DuckdbManager. This PR is now ready for review. |
| - No errors are raised | ||
| - Feature does not always return null, or always NaN, or always the same value | ||
| - Feature does not access input or secondary columns that it does not declare | ||
| - Output is the same per row regarding of the (natural) ordering of the input table |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: regardless
| The index_column_name MAY NOT shadow a column that appears in the 'real' primary input table, even if the query | ||
| doesn't use that column. This restriction may be lifted in the future. | ||
| Note that `primary_table` is a string, not a DuckdbName, because it is register()ed with the connection |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT registered*
| Does not check if the feature depends on the natural ordering of any of the secondary tables. | ||
| Reordering the secondary tables without making a full copy of each or defining a query that reads and fully sorts each | ||
| is an unsolved problem. And making a copy of each of them is too expensive in the general case. So we don't validate it | ||
| for now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this even a requirement?
|
|
||
| async def _test_row_by_row(self, feature: Feature, input: Dataset, conn: DuckDBPyConnection, | ||
| all_rows_result: pl.Series) -> None: | ||
| row_indexes = set(random.Random(42).sample(range(input.data.height), min(self.rows_to_compute_individually, input.data.height))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we have a minimum input.data.height for the validation?
we don't want validation to pass since it was called on input of height 1..
| args = tuple(args_by_name[name] for name in feature.params.names) | ||
| row_result = await self._acompute(feature, args, conn) | ||
| if row_result != all_rows_result[index] and not \ | ||
| (isinstance(row_result, float) and math.isnan(row_result) and math.isnan(all_rows_result[index])): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why the special case? is there a diffrent representation between acompute and acompute_batch representation of nan?
| for index in row_indexes: | ||
| args_by_name = input.data.row(index, named=True) | ||
| args = tuple(args_by_name[name] for name in feature.params.names) | ||
| row_result = await self._acompute(feature, args, conn) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not in parallel?
| max_global_retries: int | ||
| max_local_retries: int | ||
| corrector: SqlFeatureCorrector | ||
| validators: Sequence[FeatureValidator] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
apparently research use a different corrector per Validator, maybe we can put the corrector inside the FeatureValidator class? what do you think?
This design also works, but the corrector needs to support all errors - I do not know what is better...
| PSEUDOCOLUMN = 2 | ||
| NONE = 3 | ||
|
|
||
| def test_rowid_nature(conn: DuckDBPyConnection, name: str) -> RowidNature: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where is this used?
What does this PR do?
Add a framework and implementation for validating both black-box and SQL-query-based features.
Changes
FeatureValidationError,FeatureValidator,SqlFeatureCorrector)LawAndOrderValidator)SqlBackedFeature.compute and in feature_from_query to raise the newFeatureValidationErrorvalidator_loop.py) that tries to fix validation problems using a callbackRemaining work and other notes
FeatureFromQueryCtorprotocol that would, I assume, callcategorical_feature_from_queryand then (if it succeeded) call the resulting feature on some dataset to collect the values it returns andattrs.evolve()the feature to have a correct categories list. Without this, subsequent validations will fail becausecategorical_feature_from_queryby itself returns a feature with a deliberately wrong categories list, since it doesn't compute it on any data. Any errors raised in this new wrapper function would need to be surfaced as FeatureValidationErrors.Please let me know if this is a reasonable plan or if something else is needed.
rows_to_compute_individuallyargument because doing it for all rows is likely to be expensive.)Related Issues
Fixes SparkBeyond/aoa#376