Dataframe Collection (User Collections)#469
Conversation
… SQLGlot bug discovered along the way [RUN CI]
… into John/df_collection
…run sf][run postgres][run s3]
john-sanchez31
left a comment
There was a problem hiding this comment.
Some comments and guides for the review
| new_ancestry, new_child, ancestry_names = self.split_partition_ancestry( | ||
| parent, partition_ancestor | ||
| ) | ||
| case UnqualifiedGeneratedCollection(): |
There was a problem hiding this comment.
This changes are related to supporting Partition with GeneratedCollection
|
|
||
| Note: This covers the NOT standard representations | ||
| (Datetime, NumericType, UnknownType) meaning that the expression | ||
| changes in different dialects. |
There was a problem hiding this comment.
This method is implemented in all transform bindings
pyproject.toml
Outdated
| # Note: sqlite is included in the standard library, so it doesn't need to be listed here. | ||
| # There is a bug in some unit tests when run with sqlglot>=26.8.0 | ||
| dependencies = ["pytz", "sqlglot==26.7.0", "pandas>=2.0.0", "jupyterlab"] | ||
| dependencies = ["pytz", "sqlglot==26.7.0", "pandas>=2.0.0", "jupyterlab", "pyarrow"] |
There was a problem hiding this comment.
pyarrow is used to map types in a dataframe
There was a problem hiding this comment.
It should be pinned to the version you used to implement to avoid issues later on.
|
|
||
|
|
||
| @pytest.fixture(scope="session") | ||
| def get_postgres_defog_graphs() -> graph_fetcher: |
There was a problem hiding this comment.
Now postgres have its own defog graph. This is needed because a UDF implementation was required for dealership_adv13 which is different for each dialect. Before postgres was sharing graph with sqlite, but no more
|
|
||
|
|
||
| @pytest.fixture(scope="session") | ||
| def get_dialect_defog_graphs( |
There was a problem hiding this comment.
For test_defog_until_sql was used the standard graph. This made the sql refsol inconsistent to what was actually being executed. For example in the query I could see the UDF sqlite implementation in all dialect (which is wrong). This allows the usage for each dialect its graph. Actually with this new fixture most of the defog refsol file were changed.
| "synonyms": ["purchase record", "vehicle sale", "car purchase"] | ||
| } | ||
| ], | ||
| "functions": [ |
There was a problem hiding this comment.
UDF used for dealership_adv13
| "synonyms": ["purchase record", "vehicle sale", "car purchase"] | ||
| } | ||
| ], | ||
| "functions": [ |
There was a problem hiding this comment.
This file is the same than defog_graphs.json except for this UDF which is necessary
| @@ -756,9 +756,6 @@ def test_graph_structure_defog(defog_graphs: graph_fetcher, graph_name: str) -> | |||
| order_sensitive=True, | |||
| ), | |||
| id="dealership_adv8", | |||
There was a problem hiding this comment.
These test needed GeneratedCollection but now are executed.
| Test executing the TPC-H custom queries from the original code generation on | ||
| MySQL. | ||
| """ | ||
|
|
There was a problem hiding this comment.
MySQL doesn't support infinity values, so we skip the test
| @@ -1,5 +1,5 @@ | |||
| SELECT | |||
| title COLLATE utf8mb4_bin AS title | |||
| FROM main.publication | |||
There was a problem hiding this comment.
All this changes are related to the usage of the new graph for defog
… mysql][run sf][run postgres][run s3]" This reverts commit 03d37d6.
… sf][run postgres]
documentation/dsl.md
Outdated
| Supported Signatures: | ||
| - `dataframe_collection(dataframe, name)`: generates collection with the given datafram and name. |
There was a problem hiding this comment.
This is unnecessary here since there is only 1 signature, as opposed to range_collection which has several.
documentation/dsl.md
Outdated
| It takes in the following arguments: | ||
|
|
||
| - `name`: The name of the dataframe collection. | ||
| - `dataframe`: The panda dataframe containing the corresponding data |
There was a problem hiding this comment.
We explicitly should note which types are supported in the DataFrame currently.
There was a problem hiding this comment.
+1
Also, let's document the infinity behavior
There was a problem hiding this comment.
Oh, and also stuff like None vs NaN, and NaT
Shoot speaking of... did we test NaT (not-a-time?)
|
|
||
|
|
||
| # Windows functions | ||
| def dataframe_collection_window_functions(): |
There was a problem hiding this comment.
Let's also add one where the per=... refers to the generated collection.
hadia206
left a comment
There was a problem hiding this comment.
Good work John. Please see my comment below.
documentation/dsl.md
Outdated
| <!-- TOC --><a name="dataframe_collection"></a> | ||
| ### `pydough.dataframe_collection` | ||
|
|
||
| The `dataframe_collection` creates a collection within a specified pandas dataframe. This is useful for building datasets dynamically. |
There was a problem hiding this comment.
| The `dataframe_collection` creates a collection within a specified pandas dataframe. This is useful for building datasets dynamically. | |
| The `dataframe_collection` creates a collection from a specified Pandas DataFrame. This is useful for building datasets dynamically. |
documentation/dsl.md
Outdated
| It takes in the following arguments: | ||
|
|
||
| - `name`: The name of the dataframe collection. | ||
| - `dataframe`: The panda dataframe containing the corresponding data |
There was a problem hiding this comment.
| - `dataframe`: The panda dataframe containing the corresponding data | |
| - `dataframe`: Pandas DataFrame containing the corresponding data. |
documentation/dsl.md
Outdated
| Supported Signatures: | ||
| - `dataframe_collection(dataframe, name)`: generates collection with the given datafram and name. |
| @@ -4,7 +4,10 @@ | |||
|
|
|||
| __all__ = ["range_collection"] | |||
There was a problem hiding this comment.
Missing dataframe_collection?
pydough/user_collections/README.md
Outdated
|
|
||
| - `DataframeGeneratedCollection`: Class used to create a dataframe collection using the given dataframe and name. | ||
| - `name`: The name of the dataframe collection. | ||
| - `dataframe`: The pandas dataframe containing all data (rows and columns). |
There was a problem hiding this comment.
| - `dataframe`: The pandas dataframe containing all data (rows and columns). | |
| - `dataframe`: The Pandas DataFrame containing all data (rows and columns). |
pydough/user_collections/README.md
Outdated
| - Returns: An instance of `RangeGeneratedCollection`. | ||
| - `dataframe_collection`: Function to create a dataframe collection with the specified parameters. | ||
| - `name`: The name of the dataframe collection. | ||
| - `dataframe`: The pandas dataframe. |
There was a problem hiding this comment.
| - `dataframe`: The pandas dataframe. | |
| - `dataframe`: The Pandas DataFrame. |
pydough/user_collections/README.md
Outdated
| The user collections module provides a way to create collections that are not part of the static metadata graph but can be generated dynamically based on user input or code. The most common user collection are integer range collections and Pandas DataFrame collections. | ||
| The range collection, generates a sequence of numbers. The `RangeGeneratedCollection` class allows users to define a range collection by specifying the start, end, and step values. The `range_collection` function is a convenient API to create instances of `RangeGeneratedCollection`. No newline at end of file | ||
| The range collection, generates a sequence of numbers. The `RangeGeneratedCollection` class allows users to define a range collection by specifying the start, end, and step values. The `range_collection` function is a convenient API to create instances of `RangeGeneratedCollection`. | ||
| The dataframe collection, generates a collection based on the given pandas dataframe. The `DataframeGeneratedCollection` class |
There was a problem hiding this comment.
| The dataframe collection, generates a collection based on the given pandas dataframe. The `DataframeGeneratedCollection` class | |
| The dataframe collection, generates a collection based on the given Pandas Dataframe. The `DataframeGeneratedCollection` class |
pydough/user_collections/README.md
Outdated
| The range collection, generates a sequence of numbers. The `RangeGeneratedCollection` class allows users to define a range collection by specifying the start, end, and step values. The `range_collection` function is a convenient API to create instances of `RangeGeneratedCollection`. No newline at end of file | ||
| The range collection, generates a sequence of numbers. The `RangeGeneratedCollection` class allows users to define a range collection by specifying the start, end, and step values. The `range_collection` function is a convenient API to create instances of `RangeGeneratedCollection`. | ||
| The dataframe collection, generates a collection based on the given pandas dataframe. The `DataframeGeneratedCollection` class | ||
| allows user to create a collection by specifying the dataframe and name. The `dataframe_collection` function is an effective API |
There was a problem hiding this comment.
for consistency with range description
| allows user to create a collection by specifying the dataframe and name. The `dataframe_collection` function is an effective API | |
| allows user to create a collection by specifying the dataframe and name. The `dataframe_collection` function is a convenient API |
| ) -> UnqualifiedGeneratedCollection: | ||
| """ | ||
| Implementation of the `pydough.dataframe_collection` function, which provides | ||
| a way to create a collection of pandas dataframe in PyDough. |
There was a problem hiding this comment.
| a way to create a collection of pandas dataframe in PyDough. | |
| a way to create a collection of Pandas DataFrame in PyDough. |
knassre-bodo
left a comment
There was a problem hiding this comment.
A few more comments based on some of Hadia's feedback + a few things I spotted while taking second/third looks in some of those files.
| Supported Signatures: | ||
| - `dataframe_collection(dataframe, name)`: generates collection with the given datafram and name. | ||
|
|
||
| #### Example |
There was a problem hiding this comment.
Should we perhaps also have a more involved example that does more things with the collection?
There was a problem hiding this comment.
Reminder to update the example here
| table_alias.append("columns", exp.to_identifier(f"_col_{i}")) | ||
| # PyDough change: adjust the formula to match the | ||
| # dialect's | ||
| if isinstance(dialect, SQLite): | ||
| table_alias.append( | ||
| "columns", exp.to_identifier(f"column{i + 1}") | ||
| ) | ||
| else: | ||
| table_alias.append( | ||
| "columns", exp.to_identifier(f"_col_{i}") | ||
| ) |
There was a problem hiding this comment.
For context, this naming convention matters a lot when handling SQLite, and SQLGlot uses the wrong one so it glitches out, but if we use the SQLite convention for other dialects it causes SQLGlot to glitch out in other places. This is not an ideal fix, but its much more straightforward than taking SQLGlot apart to actually fix the issue.
|
|
||
| @property | ||
| def unique_column_names(self) -> list[str]: | ||
| return list(dict.fromkeys(self.columns)) |
There was a problem hiding this comment.
Aww heck I knew I forgot something in the spec: we need the option to specify subsets of the columns that are unique (just like we do in metadata). It can be an optional argument in pydough.dataframe_collection which has type list[str | list[str]] where the strings are the column names. For instance, in your dsl.md exmaple, it would be ["color", "idx"] since both columns are unique.
We should then have tests on the behavior of window function / correlation tests with DataFrames that use this argument.
| "r = pydough.range_collection('tbl', 'v', 0, 500, 13).CALCULATE(first_digit=INTEGER(STRING(v)[:1]))\n" | ||
| "result = r.PARTITION(name='digits', by=first_digit).CALCULATE(first_digit, n=COUNT(tbl))", | ||
| "TPCH", |
There was a problem hiding this comment.
These new tests cover some categories of edge cases John discovered while testing DataFrames which we realized hadn't been covered/handled yet for simple range (hence all the weird HybridTree changes).
| elif len(dataframe[col]) == 0: | ||
| raise TypeError( | ||
| f"Column '{col}' is empty. All columns must have at least one value." | ||
| ) |
There was a problem hiding this comment.
Oh shoot that reminds me... we should have some tests for DataFrames with 1+ columns but no rows. There's a lot of edge cases to consider there, e.g. what happens if you cross join it with something else.
| match field_type: | ||
| case _ if pa.types.is_null(field_type): |
There was a problem hiding this comment.
If all of these are going to be case _ if ...:, we should just be doing a a bunch of if ... elif ... elif ... instead.
| case _ if pa.types.is_binary(field_type) or pa.types.is_large_binary( | ||
| field_type | ||
| ): | ||
| return StringType() |
There was a problem hiding this comment.
Shouldn't this be rejected for now?
| case _: | ||
| return UnknownType() |
There was a problem hiding this comment.
Are there cases where this happens that are valid? If so what are there? (should be documented in comments). Otherwise, we should explicitly ban things with clear error messages.
There was a problem hiding this comment.
Same question. Let's be clear what are those unknowns
…[run sf][run postgres][run mysql][run s3]
hadia206
left a comment
There was a problem hiding this comment.
You addressed a lot of things. Almost there :)
| Supported Signatures: | ||
| - `dataframe_collection(dataframe, name)`: generates collection with the given datafram and name. | ||
|
|
||
| #### Example |
There was a problem hiding this comment.
Reminder to update the example here
| and self.name == other.name | ||
| and self.columns == other.columns |
There was a problem hiding this comment.
Do we need these? If we already check self.dataframe.equals in the line below it?
|
|
||
| except pa.ArrowInvalid: | ||
| raise ValueError( | ||
| f"Mixed types in column '{col}'. All values in a column must be of the same type." |
There was a problem hiding this comment.
Let's generalize error message here. I believe this error is not just for mixed types.
It can appear with other type conversion issues
Example I have seen online
ArrowInvalid: Invalid null value in PyArrow means data doesn't match the expected type
| raise ValueError("Structs are not supported for dataframe collections") | ||
|
|
||
| elif pa_types.is_dictionary(field_type): | ||
| raise ValueError("Dictionaries are not supported for dataframe collections") |
There was a problem hiding this comment.
what about tuples? Do we support that?
|
|
||
| The supported PyDough types for `dataframe_collection` are: | ||
| - `NumericType`: includes float, integer, infinity, Nan. | ||
| - `BooleanType`: includes classic true and false. |
There was a problem hiding this comment.
| - `BooleanType`: includes classic true and false. | |
| - `BooleanType`: True and False. |
| The supported PyDough types for `dataframe_collection` are: | ||
| - `NumericType`: includes float, integer, infinity, Nan. | ||
| - `BooleanType`: includes classic true and false. | ||
| - `StringType`: includes all aphanumeric caracters. |
There was a problem hiding this comment.
Removed and fixed the typo
| - `StringType`: includes all aphanumeric caracters. | |
| - `StringType`: alphanumeric characters. |
| - `NumericType`: includes float, integer, infinity, Nan. | ||
| - `BooleanType`: includes classic true and false. | ||
| - `StringType`: includes all aphanumeric caracters. | ||
| - `Datetype`: includes date and datetime. |
There was a problem hiding this comment.
Timestamp? NaT?
| - `Datetype`: includes date and datetime. | |
| - `Datetype`: date and datetime. |
| - float64 | ||
| - decimal | ||
| - bool | ||
| - datetime64 |
There was a problem hiding this comment.
date?
what about timestamp?
|
|
||
|
|
||
| def dataframe_collection( | ||
| name: str, dataframe: pd.DataFrame |
| else: | ||
| return UnknownType() |
There was a problem hiding this comment.
Let's clarify what are these if any are still missing.
See my comment in the dsl.md as well.
Resolves #162