Skip to content

Conversation

@danarmak
Copy link
Collaborator

@danarmak danarmak commented Dec 23, 2025

What does this PR do?

Add a framework and implementation for validating both black-box and SQL-query-based features.

Changes

  • Add validation interfaces (FeatureValidationError, FeatureValidator, SqlFeatureCorrector)
  • Implement several validations (LawAndOrderValidator)
  • Retrofit the existing checks in SqlBackedFeature.compute and in feature_from_query to raise the new FeatureValidationError
  • Add an algorithm (validator_loop.py) that tries to fix validation problems using a callback

Remaining work and other notes

  1. To build a categorical feature and use it with this validation framework you will need to write a new function matching the FeatureFromQueryCtor protocol that would, I assume, call categorical_feature_from_query and then (if it succeeded) call the resulting feature on some dataset to collect the values it returns and attrs.evolve() the feature to have a correct categories list. Without this, subsequent validations will fail because categorical_feature_from_query by itself returns a feature with a deliberately wrong categories list, since it doesn't compute it on any data. Any errors raised in this new wrapper function would need to be surfaced as FeatureValidationErrors.
    Please let me know if this is a reasonable plan or if something else is needed.
  2. Query timeouts are a separate ticket (which I'll work on soon); until then non-terminating queries will remain a problem. (Yes, you can write a non-terminating query on small input data if you try.)
  3. I unified the Law and Order validators to reduce the number of feature computes.
  4. The caller of the validator is responsible for choosing which and how much data to pass in; the validator will not sample it further (but it does have a rows_to_compute_individually argument because doing it for all rows is likely to be expensive.)

Related Issues

Fixes SparkBeyond/aoa#376

…idations

- Add validation interfaces (FeatureValidationError, FeatureValidator, SqlFeatureCorrector)
- Implement several validations (LawAndOrderValidator)
- Retrofit the existing checks in SqlBackedFeature.compute and in feature_from_query to raise the new FeatureValidationError
- Add an algorithm (validator_loop) that tries to fix validation problems using a callback

Fixes #376.
@danarmak danarmak marked this pull request as draft December 23, 2025 15:56
@danarmak danarmak requested a review from leonidb December 23, 2025 15:56
…eded for categorical features to work correctly
@danarmak danarmak force-pushed the feat/376-feature-validations branch from 59489f2 to a7291d5 Compare December 24, 2025 11:02
…p schemas in it, so that they are definitely deleted when the process exits.
@danarmak danarmak marked this pull request as ready for review December 24, 2025 11:28
@danarmak
Copy link
Collaborator Author

@leonidb I fixed the categorical feature bug, and made the change I wanted to make to DuckdbManager. This PR is now ready for review.

- No errors are raised
- Feature does not always return null, or always NaN, or always the same value
- Feature does not access input or secondary columns that it does not declare
- Output is the same per row regarding of the (natural) ordering of the input table
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: regardless

The index_column_name MAY NOT shadow a column that appears in the 'real' primary input table, even if the query
doesn't use that column. This restriction may be lifted in the future.
Note that `primary_table` is a string, not a DuckdbName, because it is register()ed with the connection
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT registered*

Does not check if the feature depends on the natural ordering of any of the secondary tables.
Reordering the secondary tables without making a full copy of each or defining a query that reads and fully sorts each
is an unsolved problem. And making a copy of each of them is too expensive in the general case. So we don't validate it
for now.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this even a requirement?


async def _test_row_by_row(self, feature: Feature, input: Dataset, conn: DuckDBPyConnection,
all_rows_result: pl.Series) -> None:
row_indexes = set(random.Random(42).sample(range(input.data.height), min(self.rows_to_compute_individually, input.data.height)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we have a minimum input.data.height for the validation?
we don't want validation to pass since it was called on input of height 1..

args = tuple(args_by_name[name] for name in feature.params.names)
row_result = await self._acompute(feature, args, conn)
if row_result != all_rows_result[index] and not \
(isinstance(row_result, float) and math.isnan(row_result) and math.isnan(all_rows_result[index])):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why the special case? is there a diffrent representation between acompute and acompute_batch representation of nan?

for index in row_indexes:
args_by_name = input.data.row(index, named=True)
args = tuple(args_by_name[name] for name in feature.params.names)
row_result = await self._acompute(feature, args, conn)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not in parallel?

max_global_retries: int
max_local_retries: int
corrector: SqlFeatureCorrector
validators: Sequence[FeatureValidator]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

apparently research use a different corrector per Validator, maybe we can put the corrector inside the FeatureValidator class? what do you think?
This design also works, but the corrector needs to support all errors - I do not know what is better...

PSEUDOCOLUMN = 2
NONE = 3

def test_rowid_nature(conn: DuckDBPyConnection, name: str) -> RowidNature:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where is this used?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants