-
Notifications
You must be signed in to change notification settings - Fork 2
[core] FlatVersioningEngine #381
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
This stack of pull requests is managed by Graphite. Learn more about stacking. |
0b0e3fb to
2c7c93a
Compare
Coverage Report (Python 3.11) •
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Coverage Report (Python 3.13) •
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Test Results (Python 3.10)1 969 tests 1 947 ✅ 6m 13s ⏱️ Results for commit 8554f08. ♻️ This comment has been updated with latest results. |
Coverage Report (Python 3.10) •
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR introduces a dict-based versioning engine that avoids struct field access operations for databases that don't natively support them (PostgreSQL, SQLite). The approach uses flattened column naming conventions (e.g., metaxy_provenance_by_field__field1) instead of creating actual struct columns, enabling compatibility with SQL databases that lack struct support.
Key changes:
- Adds
DictBasedVersioningEnginebase class with flattened column approach - Adds
IbisDictBasedVersioningEngineimplementation for Ibis backends - Extends
VersioningEnginebase class withaccess_provenance_field()method for backend-agnostic field access
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| tests/versioning/test_dict_based_engine_feasibility.py | Comprehensive feasibility tests verifying dict-based approach works as struct replacement |
| src/metaxy/versioning/dict_based.py | New implementation of dict-based versioning engine with flattened column approach |
| src/metaxy/versioning/engine.py | Adds access_provenance_field() method and updates call sites to use new abstraction |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
danielgafni
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In principle this is doing the right thing, but we need better architecture to avoid that much code duplication.
This basically re-implements the entire versioning engine.
Maybe let's think if it's possible to only alter some code paths instead of redoing everything from scratch? We need to share code to reduce risks in such a core component.
4adbeef to
205f19a
Compare
what do you think about this new version with more re-use? |
480cefc to
f11bf5c
Compare
danielgafni
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright, I think the implementation is good, but this cannot be merged until the docstrings are cleaned up (I'd encourage you to remove unnecessary docstrings, especially from trivial methods).
And please rename dict to flat everywhere, not just the (ex) mixin.
f11bf5c to
c4f3f9e
Compare
fcafca9 to
7aab718
Compare
danielgafni
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All good except the weird getattr
Let's fix it and I'll approve the PR, but let's not merge it yet until the whole stack is ready (maybe you will want to get back to it and change something)
On a side note, I really think this is a bad use case for inheritance.
We are basically overwriting the same action done differently in the child class.
It should not be this way.
We should have two separate classes, a struct-based engine and a flat one.
The flat should not be a child class of the struct one.
But feel free to keep it as is for now, unless you really want to get this detail right (which maybe would be nice).
7aab718 to
58bd2c4
Compare
I changed the class hierarchy. |
58bd2c4 to
fc32523
Compare
danielgafni
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright, just some typing issues, otherwise looking good
| sample_components = [] | ||
| for field_name in field_names: | ||
| expr = self.access_provenance_field(struct_column, field_name) | ||
| sample_components.append(expr.fill_null("")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess the fill_null is for cases when the set of fields has changed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ensures each field hash contribution is a string even when the upstream field is null/missing.
fc32523 to
f89ce09
Compare
danielgafni
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another Frame you've missed
I really don't want to be talking to unfiltered AI, that's a waste of time
3dc5b8a to
8796207
Compare
|
Check again - I hope it is fine now. |
2bd72e4 to
c8de19a
Compare
79f6e97 to
316a4cd
Compare
…oning - Add _preprocess_samples_for_resolve_update hook - Add _post_process_resolve_update_result hook - Add _post_process_polars_frame hook for type conversions - Support building struct from flattened columns in write_metadata - Handle data_version derivation from flattened columns These hooks enable stores without native struct support (like PostgreSQL with JSONB) to convert between flattened and struct representations.
316a4cd to
8554f08
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 8 out of 8 changed files in this pull request and generated 5 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| struct_column=METAXY_DATA_VERSION_BY_FIELD, | ||
| hash_column=METAXY_DATA_VERSION, | ||
| ) | ||
| from metaxy.models.constants import METAXY_PROVENANCE |
Copilot
AI
Dec 23, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Redundant import: METAXY_PROVENANCE is already imported at the top of the file (line 44). This duplicate import should be removed.
| from metaxy.models.constants import METAXY_PROVENANCE |
| ) | ||
| for field_name, source_col in field_columns.items() | ||
| } | ||
| return df.rename(rename_map) # ty: ignore[invalid-argument-type] |
Copilot
AI
Dec 23, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Type annotation comment contains a typo: "ty: ignore" should be "type: ignore" to properly suppress type checker warnings.
| return df.rename(rename_map) # ty: ignore[invalid-argument-type] | |
| return df.rename(rename_map) # type: ignore[invalid-argument-type] |
| for field_name in sorted(field_names) | ||
| ] | ||
| sample_components = [] | ||
| for field_name in sorted(field_names): |
Copilot
AI
Dec 23, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The field_names list is already sorted on line 656, but then sorted again on line 662 in the loop. The second sort is redundant and can be removed for better performance.
| [ | ||
| nw.col(col).alias(col.split("__", 1)[1]) | ||
| for col in prov_flat | ||
| if "__" in col |
Copilot
AI
Dec 23, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The condition if "__" in col on line 1537 is redundant because the list comprehension on line 1526-1528 already filters for columns that start with {METAXY_PROVENANCE_BY_FIELD}__, which by definition contain "__". This extra check can be removed.
| if "__" in col |
| if METAXY_DATA_VERSION_BY_FIELD not in columns: | ||
| df = df.with_columns( | ||
| nw.col(METAXY_PROVENANCE_BY_FIELD).alias(METAXY_DATA_VERSION_BY_FIELD) | ||
| has_flattened = any( | ||
| col.startswith(f"{METAXY_DATA_VERSION_BY_FIELD}__") | ||
| or col.startswith(f"{METAXY_PROVENANCE_BY_FIELD}__") | ||
| for col in columns | ||
| ) | ||
| df = df.with_columns(nw.col(METAXY_PROVENANCE).alias(METAXY_DATA_VERSION)) | ||
| elif METAXY_DATA_VERSION not in columns: | ||
| if not has_flattened and METAXY_PROVENANCE_BY_FIELD in columns: | ||
| df = df.with_columns( | ||
| nw.col(METAXY_PROVENANCE_BY_FIELD).alias( | ||
| METAXY_DATA_VERSION_BY_FIELD | ||
| ) | ||
| ) | ||
|
|
||
| if METAXY_DATA_VERSION not in columns: | ||
| df = self.hash_struct_version_column( | ||
| plan=self._resolve_feature_plan(feature_key), | ||
| df=df, | ||
| struct_column=METAXY_DATA_VERSION_BY_FIELD, | ||
| hash_column=METAXY_DATA_VERSION, | ||
| ) |
Copilot
AI
Dec 23, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Logic issue: When flattened provenance columns exist (metaxy_provenance_by_field__*) but flattened data_version columns (metaxy_data_version_by_field__*) don't exist, the code skips creating the data_version struct/columns (lines 1596-1601) but then tries to hash from the non-existent METAXY_DATA_VERSION_BY_FIELD column on line 1607.
The fix should either:
- Copy the flattened provenance columns to flattened data_version column names before line 1603, or
- Rename the columns in the hash_struct_version_column call to use provenance columns instead when data_version columns are missing

No description provided.