Skip to content

Conversation

@geoHeil
Copy link
Collaborator

@geoHeil geoHeil commented Nov 26, 2025

No description provided.

Copy link
Collaborator Author

geoHeil commented Nov 26, 2025

This stack of pull requests is managed by Graphite. Learn more about stacking.

@geoHeil geoHeil mentioned this pull request Nov 26, 2025
@geoHeil geoHeil marked this pull request as ready for review November 26, 2025 14:11
@geoHeil geoHeil mentioned this pull request Nov 26, 2025
@geoHeil geoHeil force-pushed the 11-26-dict-based_versioning_engine branch from 0b0e3fb to 2c7c93a Compare November 26, 2025 14:13
@github-actions
Copy link
Contributor

github-actions bot commented Nov 26, 2025

Coverage

Coverage Report (Python 3.12) •
FileStmtsMissCoverMissing
__init__.py20195%1, 3–5, 10–11, 23, 30, 36, 41, 47–48, 63, 66, 88, 108, 112–113, 116
metadata_store
   base.py49716267%3, 5–9, 11–15, 17–19, 25–27, 32, 36, 48–50, 55–57, 64, 67, 86, 88, 93, 98, 104, 109, 120, 148, 155, 158, 160, 230–231, 244–245, 258, 387, 448, 471, 557–558, 567–568, 577–578, 587, 601–602, 606–607, 611, 628, 645, 747, 796, 825, 827–828, 883, 960, 1043–1044, 1087–1088, 1091–1092, 1109, 1139–1140, 1145–1146, 1167, 1172, 1184, 1189, 1221, 1238, 1256, 1267, 1281, 1305, 1319, 1323, 1336, 1346–1347, 1371, 1398, 1433, 1467, 1526, 1529–1530, 1540, 1542, 1625, 1641, 1648, 1656, 1662, 1666, 1686, 1733, 1762, 1787, 1821, 1847, 1860, 1862, 1919, 1998, 2000, 2003–2004, 2007–2008, 2012, 2022, 2035, 2037, 2039–2041, 2046–2047, 2050–2051, 2053–2055, 2057, 2060–2061, 2064–2067, 2070, 2079–2080, 2082–2083, 2086, 2089, 2091–2093, 2097–2098, 2101–2103, 2106, 2108, 2112
versioning
   __init__.py440%20, 24–25, 33
   engine.py2115374%1, 3–7, 9–10, 12–13, 18–23, 31, 47, 54–55, 57, 59–60, 67–68, 73, 82, 99–104, 108, 122, 140, 176, 242, 305, 427, 433, 470, 569, 604, 608, 632, 677, 730, 749, 752, 774, 778, 783
   flat_engine.py25196%80
   ibis.py59296%72, 179
   polars.py471959%3–4, 6–9, 11–12, 18, 34–35, 38, 60, 84–85, 119, 152–153, 179
TOTAL8649341460% 

Tests Skipped Failures Errors Time
1969 22 💤 0 ❌ 0 🔥 6m 16s ⏱️

@github-actions
Copy link
Contributor

github-actions bot commented Nov 26, 2025

Coverage

Coverage Report (Python 3.11) •
FileStmtsMissCoverMissing
__init__.py20195%1, 3–5, 10–11, 23, 30, 36, 41, 47–48, 63, 66, 88, 108, 112–113, 116
metadata_store
   base.py49716267%3, 5–9, 11–15, 17–19, 25–27, 32, 36, 48–50, 55–57, 64, 67, 86, 88, 93, 98, 104, 109, 120, 148, 155, 158, 160, 230–231, 244–245, 258, 387, 448, 471, 557–558, 567–568, 577–578, 587, 601–602, 606–607, 611, 628, 645, 747, 796, 825, 827–828, 883, 960, 1043–1044, 1087–1088, 1091–1092, 1109, 1139–1140, 1145–1146, 1167, 1172, 1184, 1189, 1221, 1238, 1256, 1267, 1281, 1305, 1319, 1323, 1336, 1346–1347, 1371, 1398, 1433, 1467, 1526, 1529–1530, 1540, 1542, 1625, 1641, 1648, 1656, 1662, 1666, 1686, 1733, 1762, 1787, 1821, 1847, 1860, 1862, 1919, 1998, 2000, 2003–2004, 2007–2008, 2012, 2022, 2035, 2037, 2039–2041, 2046–2047, 2050–2051, 2053–2055, 2057, 2060–2061, 2064–2067, 2070, 2079–2080, 2082–2083, 2086, 2089, 2091–2093, 2097–2098, 2101–2103, 2106, 2108, 2112
versioning
   __init__.py440%20, 24–25, 33
   engine.py2115374%1, 3–7, 9–10, 12–13, 18–23, 31, 47, 54–55, 57, 59–60, 67–68, 73, 82, 99–104, 108, 122, 140, 176, 242, 305, 427, 433, 470, 569, 604, 608, 632, 677, 730, 749, 752, 774, 778, 783
   flat_engine.py25196%80
   ibis.py59296%72, 179
   polars.py471959%3–4, 6–9, 11–12, 18, 34–35, 38, 60, 84–85, 119, 152–153, 179
TOTAL8649341460% 

Tests Skipped Failures Errors Time
1969 22 💤 0 ❌ 0 🔥 6m 7s ⏱️

@github-actions
Copy link
Contributor

github-actions bot commented Nov 26, 2025

Coverage

Coverage Report (Python 3.13) •
FileStmtsMissCoverMissing
__init__.py20195%1, 3–5, 10–11, 23, 30, 36, 41, 47–48, 63, 66, 88, 108, 112–113, 116
metadata_store
   base.py49716267%3, 5–9, 11–15, 17–19, 25–27, 32, 36, 48–50, 55–57, 64, 67, 86, 88, 93, 98, 104, 109, 120, 148, 155, 158, 160, 230–231, 244–245, 258, 387, 448, 471, 557–558, 567–568, 577–578, 587, 601–602, 606–607, 611, 628, 645, 747, 796, 825, 827–828, 883, 960, 1043–1044, 1087–1088, 1091–1092, 1109, 1139–1140, 1145–1146, 1167, 1172, 1184, 1189, 1221, 1238, 1256, 1267, 1281, 1305, 1319, 1323, 1336, 1346–1347, 1371, 1398, 1433, 1467, 1526, 1529–1530, 1540, 1542, 1625, 1641, 1648, 1656, 1662, 1666, 1686, 1733, 1762, 1787, 1821, 1847, 1860, 1862, 1919, 1998, 2000, 2003–2004, 2007–2008, 2012, 2022, 2035, 2037, 2039–2041, 2046–2047, 2050–2051, 2053–2055, 2057, 2060–2061, 2064–2067, 2070, 2079–2080, 2082–2083, 2086, 2089, 2091–2093, 2097–2098, 2101–2103, 2106, 2108, 2112
versioning
   __init__.py440%20, 24–25, 33
   engine.py2115374%1, 3–7, 9–10, 12–13, 18–23, 31, 47, 54–55, 57, 59–60, 67–68, 73, 82, 99–104, 108, 122, 140, 176, 242, 305, 427, 433, 470, 569, 604, 608, 632, 677, 730, 749, 752, 774, 778, 783
   flat_engine.py25196%80
   ibis.py59296%72, 179
   polars.py471959%3–4, 6–9, 11–12, 18, 34–35, 38, 60, 84–85, 119, 152–153, 179
TOTAL8649341460% 

Tests Skipped Failures Errors Time
1969 22 💤 0 ❌ 0 🔥 6m 27s ⏱️

@github-actions
Copy link
Contributor

github-actions bot commented Nov 26, 2025

Test Results (Python 3.10)

1 969 tests   1 947 ✅  6m 13s ⏱️
    1 suites     22 💤
    1 files        0 ❌

Results for commit 8554f08.

♻️ This comment has been updated with latest results.

@github-actions
Copy link
Contributor

github-actions bot commented Nov 26, 2025

Coverage

Coverage Report (Python 3.10) •
FileStmtsMissCoverMissing
__init__.py20195%1, 3–5, 10–11, 23, 30, 36, 41, 47–48, 63, 66, 88, 108, 112–113, 116
metadata_store
   base.py49716267%3, 5–9, 11–15, 17–19, 25–27, 32, 36, 48–50, 55–57, 64, 67, 86, 88, 93, 98, 104, 109, 120, 148, 155, 158, 160, 230–231, 244–245, 258, 387, 448, 471, 557–558, 567–568, 577–578, 587, 601–602, 606–607, 611, 628, 645, 747, 796, 825, 827–828, 883, 960, 1043–1044, 1087–1088, 1091–1092, 1109, 1139–1140, 1145–1146, 1167, 1172, 1184, 1189, 1221, 1238, 1256, 1267, 1281, 1305, 1319, 1323, 1336, 1346–1347, 1371, 1398, 1433, 1467, 1526, 1529–1530, 1540, 1542, 1625, 1641, 1648, 1656, 1662, 1666, 1686, 1733, 1762, 1787, 1821, 1847, 1860, 1862, 1919, 1998, 2000, 2003–2004, 2007–2008, 2012, 2022, 2035, 2037, 2039–2041, 2046–2047, 2050–2051, 2053–2055, 2057, 2060–2061, 2064–2067, 2070, 2079–2080, 2082–2083, 2086, 2089, 2091–2093, 2097–2098, 2101–2103, 2106, 2108, 2112
versioning
   __init__.py440%20, 24–25, 33
   engine.py2115374%1, 3–7, 9–10, 12–13, 18–23, 31, 47, 54–55, 57, 59–60, 67–68, 73, 82, 99–104, 108, 122, 140, 176, 242, 305, 427, 433, 470, 569, 604, 608, 632, 677, 730, 749, 752, 774, 778, 783
   flat_engine.py25196%80
   ibis.py59296%72, 179
   polars.py471959%3–4, 6–9, 11–12, 18, 34–35, 38, 60, 84–85, 119, 152–153, 179
TOTAL8649341460% 

Tests Skipped Failures Errors Time
1969 22 💤 0 ❌ 0 🔥 6m 13s ⏱️

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces a dict-based versioning engine that avoids struct field access operations for databases that don't natively support them (PostgreSQL, SQLite). The approach uses flattened column naming conventions (e.g., metaxy_provenance_by_field__field1) instead of creating actual struct columns, enabling compatibility with SQL databases that lack struct support.

Key changes:

  • Adds DictBasedVersioningEngine base class with flattened column approach
  • Adds IbisDictBasedVersioningEngine implementation for Ibis backends
  • Extends VersioningEngine base class with access_provenance_field() method for backend-agnostic field access

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File Description
tests/versioning/test_dict_based_engine_feasibility.py Comprehensive feasibility tests verifying dict-based approach works as struct replacement
src/metaxy/versioning/dict_based.py New implementation of dict-based versioning engine with flattened column approach
src/metaxy/versioning/engine.py Adds access_provenance_field() method and updates call sites to use new abstraction

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Collaborator

@danielgafni danielgafni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In principle this is doing the right thing, but we need better architecture to avoid that much code duplication.

This basically re-implements the entire versioning engine.

Maybe let's think if it's possible to only alter some code paths instead of redoing everything from scratch? We need to share code to reduce risks in such a core component.

@geoHeil geoHeil force-pushed the 11-26-dict-based_versioning_engine branch 2 times, most recently from 4adbeef to 205f19a Compare November 27, 2025 12:54
@geoHeil geoHeil requested a review from danielgafni November 27, 2025 13:34
@geoHeil
Copy link
Collaborator Author

geoHeil commented Nov 27, 2025

In principle this is doing the right thing, but we need better architecture to avoid that much code duplication.

This basically re-implements the entire versioning engine.

Maybe let's think if it's possible to only alter some code paths instead of redoing everything from scratch? We need to share code to reduce risks in such a core component.

what do you think about this new version with more re-use?

@geoHeil geoHeil force-pushed the 11-26-dict-based_versioning_engine branch 3 times, most recently from 480cefc to f11bf5c Compare November 28, 2025 09:44
Copy link
Collaborator

@danielgafni danielgafni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, I think the implementation is good, but this cannot be merged until the docstrings are cleaned up (I'd encourage you to remove unnecessary docstrings, especially from trivial methods).

And please rename dict to flat everywhere, not just the (ex) mixin.

@geoHeil geoHeil force-pushed the 11-26-dict-based_versioning_engine branch from f11bf5c to c4f3f9e Compare November 29, 2025 08:03
@geoHeil geoHeil requested a review from danielgafni November 29, 2025 08:12
@geoHeil geoHeil force-pushed the 11-26-dict-based_versioning_engine branch 4 times, most recently from fcafca9 to 7aab718 Compare November 30, 2025 10:53
Copy link
Collaborator

@danielgafni danielgafni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All good except the weird getattr

Let's fix it and I'll approve the PR, but let's not merge it yet until the whole stack is ready (maybe you will want to get back to it and change something)


On a side note, I really think this is a bad use case for inheritance.
We are basically overwriting the same action done differently in the child class.
It should not be this way.
We should have two separate classes, a struct-based engine and a flat one.
The flat should not be a child class of the struct one.
But feel free to keep it as is for now, unless you really want to get this detail right (which maybe would be nice).

@geoHeil geoHeil force-pushed the 11-26-dict-based_versioning_engine branch from 7aab718 to 58bd2c4 Compare December 1, 2025 07:15
@geoHeil geoHeil requested a review from danielgafni December 1, 2025 07:36
@geoHeil
Copy link
Collaborator Author

geoHeil commented Dec 1, 2025

All good except the weird getattr

Let's fix it and I'll approve the PR, but let's not merge it yet until the whole stack is ready (maybe you will want to get back to it and change something)

On a side note, I really think this is a bad use case for inheritance. We are basically overwriting the same action done differently in the child class. It should not be this way. We should have two separate classes, a struct-based engine and a flat one. The flat should not be a child class of the struct one. But feel free to keep it as is for now, unless you really want to get this detail right (which maybe would be nice).

I changed the class hierarchy.

@geoHeil geoHeil force-pushed the 11-26-dict-based_versioning_engine branch from 58bd2c4 to fc32523 Compare December 3, 2025 08:22
Copy link
Collaborator

@danielgafni danielgafni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, just some typing issues, otherwise looking good

sample_components = []
for field_name in field_names:
expr = self.access_provenance_field(struct_column, field_name)
sample_components.append(expr.fill_null(""))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the fill_null is for cases when the set of fields has changed?

Copy link
Collaborator Author

@geoHeil geoHeil Dec 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ensures each field hash contribution is a string even when the upstream field is null/missing.

@geoHeil geoHeil force-pushed the 11-26-dict-based_versioning_engine branch from fc32523 to f89ce09 Compare December 4, 2025 08:35
@geoHeil geoHeil requested a review from danielgafni December 4, 2025 08:41
@danielgafni danielgafni changed the title Dict-Based Versioning Engine [core] FlatVersioningEngine Dec 4, 2025
Copy link
Collaborator

@danielgafni danielgafni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another Frame you've missed


I really don't want to be talking to unfiltered AI, that's a waste of time

@geoHeil geoHeil force-pushed the 11-26-dict-based_versioning_engine branch 2 times, most recently from 3dc5b8a to 8796207 Compare December 4, 2025 11:34
@geoHeil geoHeil requested a review from danielgafni December 4, 2025 12:02
@geoHeil
Copy link
Collaborator Author

geoHeil commented Dec 4, 2025

Check again - I hope it is fine now.

@geoHeil geoHeil force-pushed the 11-26-dict-based_versioning_engine branch 7 times, most recently from 2bd72e4 to c8de19a Compare December 10, 2025 13:33
@geoHeil geoHeil force-pushed the 11-26-dict-based_versioning_engine branch 3 times, most recently from 79f6e97 to 316a4cd Compare December 19, 2025 09:02
…oning

- Add _preprocess_samples_for_resolve_update hook
- Add _post_process_resolve_update_result hook
- Add _post_process_polars_frame hook for type conversions
- Support building struct from flattened columns in write_metadata
- Handle data_version derivation from flattened columns

These hooks enable stores without native struct support (like PostgreSQL with JSONB)
to convert between flattened and struct representations.
@geoHeil geoHeil force-pushed the 11-26-dict-based_versioning_engine branch from 316a4cd to 8554f08 Compare December 23, 2025 06:40
@geoHeil geoHeil requested a review from Copilot December 23, 2025 06:56
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 5 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

struct_column=METAXY_DATA_VERSION_BY_FIELD,
hash_column=METAXY_DATA_VERSION,
)
from metaxy.models.constants import METAXY_PROVENANCE
Copy link

Copilot AI Dec 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Redundant import: METAXY_PROVENANCE is already imported at the top of the file (line 44). This duplicate import should be removed.

Suggested change
from metaxy.models.constants import METAXY_PROVENANCE

Copilot uses AI. Check for mistakes.
)
for field_name, source_col in field_columns.items()
}
return df.rename(rename_map) # ty: ignore[invalid-argument-type]
Copy link

Copilot AI Dec 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Type annotation comment contains a typo: "ty: ignore" should be "type: ignore" to properly suppress type checker warnings.

Suggested change
return df.rename(rename_map) # ty: ignore[invalid-argument-type]
return df.rename(rename_map) # type: ignore[invalid-argument-type]

Copilot uses AI. Check for mistakes.
for field_name in sorted(field_names)
]
sample_components = []
for field_name in sorted(field_names):
Copy link

Copilot AI Dec 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The field_names list is already sorted on line 656, but then sorted again on line 662 in the loop. The second sort is redundant and can be removed for better performance.

Copilot uses AI. Check for mistakes.
[
nw.col(col).alias(col.split("__", 1)[1])
for col in prov_flat
if "__" in col
Copy link

Copilot AI Dec 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The condition if "__" in col on line 1537 is redundant because the list comprehension on line 1526-1528 already filters for columns that start with {METAXY_PROVENANCE_BY_FIELD}__, which by definition contain "__". This extra check can be removed.

Suggested change
if "__" in col

Copilot uses AI. Check for mistakes.
Comment on lines 1590 to 1609
if METAXY_DATA_VERSION_BY_FIELD not in columns:
df = df.with_columns(
nw.col(METAXY_PROVENANCE_BY_FIELD).alias(METAXY_DATA_VERSION_BY_FIELD)
has_flattened = any(
col.startswith(f"{METAXY_DATA_VERSION_BY_FIELD}__")
or col.startswith(f"{METAXY_PROVENANCE_BY_FIELD}__")
for col in columns
)
df = df.with_columns(nw.col(METAXY_PROVENANCE).alias(METAXY_DATA_VERSION))
elif METAXY_DATA_VERSION not in columns:
if not has_flattened and METAXY_PROVENANCE_BY_FIELD in columns:
df = df.with_columns(
nw.col(METAXY_PROVENANCE_BY_FIELD).alias(
METAXY_DATA_VERSION_BY_FIELD
)
)

if METAXY_DATA_VERSION not in columns:
df = self.hash_struct_version_column(
plan=self._resolve_feature_plan(feature_key),
df=df,
struct_column=METAXY_DATA_VERSION_BY_FIELD,
hash_column=METAXY_DATA_VERSION,
)
Copy link

Copilot AI Dec 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Logic issue: When flattened provenance columns exist (metaxy_provenance_by_field__*) but flattened data_version columns (metaxy_data_version_by_field__*) don't exist, the code skips creating the data_version struct/columns (lines 1596-1601) but then tries to hash from the non-existent METAXY_DATA_VERSION_BY_FIELD column on line 1607.

The fix should either:

  1. Copy the flattened provenance columns to flattened data_version column names before line 1603, or
  2. Rename the columns in the hash_struct_version_column call to use provenance columns instead when data_version columns are missing

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants