-
Notifications
You must be signed in to change notification settings - Fork 7
Open
Description
I had a helpful discussion about this with @TomNicholas today in the Distributed Arrays Working Group meeting. Here are the high level notes:
- What if in the long term, we made
xr.DataArrayandxr.Datasetcompliant with the Python Dataframe Protocol?- In this formulation, a
xr.Datasetwhose variables had consistent dimensions could be thought of as a table, wherexr.Variables are treated as chunked columns in the dataframe spec. - xarray would be a sort of "holder" of array references that could be lazily evaluated by a dataframe library that adheres to the Data API, where the iterating and computing would be part of that dataframe library's implementation (and likely, corresponding logical plan for evaluating df operations).
- In this formulation, a
- Since this would be a big change to Xarray, and further, it's likely not to work, how could this library prototype this feature ahead of time? Ideally, it would be nice if upon
import xarray_sql, allxr.Datasets available to the user suddenly had the dunder methods that made them compliant to the dataframe data API.- Tom recommended playing around with Custom accessors as an Xarray-native way to achieve this sort of functionality.
- If an
xr.Dataset(with consistent dimensions) was also a__dataframe__, then it would be really easy to integrate it into DataFusion/SQL.- PyArrow specifically provides a
__dataframe__interchange method,from_dataframe
- PyArrow specifically provides a
- It is an open question if this type of integration is better suited to be in Python on top of Xarray or in Rust on top of the raw data format. In both Tom and my view, it seems like we should do both and figure out a good integration point between them.
- Couldn't most of the value of Zarr --> Rust --> DataFusion be achieved with xarray_sql + Xarray backed by the zarrs-python IO layer? We'll have to evaluate this empirically.
- There will be a large swath of users who will need the performance gains provided by a direct integration of DataFusion with (virtual) Zarr in Rust with SQL/Python wrappers on top. At the same time, there will be a large swath of users who need the interoperability ("joinability") of melding any data loadable in Xarray via a SQL interface.
xref: pydata/xarray#10135
Edit: TIL The Dataframe API and the Dataframe protocol are related, but distinct things! That explains my confusion. Just updated the link with the right term + docs.
Metadata
Metadata
Assignees
Labels
No labels