Skip to content

Dataframe Collection (User Collections)#469

Open
john-sanchez31 wants to merge 39 commits intomainfrom
John/df_collection
Open

Dataframe Collection (User Collections)#469
john-sanchez31 wants to merge 39 commits intomainfrom
John/df_collection

Conversation

@john-sanchez31
Copy link
Contributor

Resolves #162

@john-sanchez31 john-sanchez31 marked this pull request as draft January 8, 2026 14:43
@john-sanchez31 john-sanchez31 self-assigned this Jan 8, 2026
@john-sanchez31 john-sanchez31 added enhancement New feature or request extensibility Increasing situations in which PyDough works effort - high major issue that will require multiple steps or complex design testing Alters the testing/CI process for PyDough labels Jan 8, 2026
@john-sanchez31 john-sanchez31 marked this pull request as ready for review January 27, 2026 18:57
Copy link
Contributor Author

@john-sanchez31 john-sanchez31 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments and guides for the review

new_ancestry, new_child, ancestry_names = self.split_partition_ancestry(
parent, partition_ancestor
)
case UnqualifiedGeneratedCollection():
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This changes are related to supporting Partition with GeneratedCollection


Note: This covers the NOT standard representations
(Datetime, NumericType, UnknownType) meaning that the expression
changes in different dialects.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method is implemented in all transform bindings

pyproject.toml Outdated
# Note: sqlite is included in the standard library, so it doesn't need to be listed here.
# There is a bug in some unit tests when run with sqlglot>=26.8.0
dependencies = ["pytz", "sqlglot==26.7.0", "pandas>=2.0.0", "jupyterlab"]
dependencies = ["pytz", "sqlglot==26.7.0", "pandas>=2.0.0", "jupyterlab", "pyarrow"]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pyarrow is used to map types in a dataframe

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be pinned to the version you used to implement to avoid issues later on.



@pytest.fixture(scope="session")
def get_postgres_defog_graphs() -> graph_fetcher:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now postgres have its own defog graph. This is needed because a UDF implementation was required for dealership_adv13 which is different for each dialect. Before postgres was sharing graph with sqlite, but no more



@pytest.fixture(scope="session")
def get_dialect_defog_graphs(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For test_defog_until_sql was used the standard graph. This made the sql refsol inconsistent to what was actually being executed. For example in the query I could see the UDF sqlite implementation in all dialect (which is wrong). This allows the usage for each dialect its graph. Actually with this new fixture most of the defog refsol file were changed.

"synonyms": ["purchase record", "vehicle sale", "car purchase"]
}
],
"functions": [
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UDF used for dealership_adv13

"synonyms": ["purchase record", "vehicle sale", "car purchase"]
}
],
"functions": [
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file is the same than defog_graphs.json except for this UDF which is necessary

@@ -756,9 +756,6 @@ def test_graph_structure_defog(defog_graphs: graph_fetcher, graph_name: str) ->
order_sensitive=True,
),
id="dealership_adv8",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These test needed GeneratedCollection but now are executed.

Test executing the TPC-H custom queries from the original code generation on
MySQL.
"""

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MySQL doesn't support infinity values, so we skip the test

@@ -1,5 +1,5 @@
SELECT
title COLLATE utf8mb4_bin AS title
FROM main.publication
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All this changes are related to the usage of the new graph for defog

@john-sanchez31 john-sanchez31 requested review from a team, hadia206, juankx-bodo and knassre-bodo and removed request for a team January 27, 2026 21:34
Comment on lines 1587 to 1588
Supported Signatures:
- `dataframe_collection(dataframe, name)`: generates collection with the given datafram and name.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is unnecessary here since there is only 1 signature, as opposed to range_collection which has several.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

It takes in the following arguments:

- `name`: The name of the dataframe collection.
- `dataframe`: The panda dataframe containing the corresponding data
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We explicitly should note which types are supported in the DataFrame currently.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1
Also, let's document the infinity behavior

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, and also stuff like None vs NaN, and NaT

Shoot speaking of... did we test NaT (not-a-time?)



# Windows functions
def dataframe_collection_window_functions():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's also add one where the per=... refers to the generated collection.

Copy link
Contributor

@hadia206 hadia206 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good work John. Please see my comment below.

<!-- TOC --><a name="dataframe_collection"></a>
### `pydough.dataframe_collection`

The `dataframe_collection` creates a collection within a specified pandas dataframe. This is useful for building datasets dynamically.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The `dataframe_collection` creates a collection within a specified pandas dataframe. This is useful for building datasets dynamically.
The `dataframe_collection` creates a collection from a specified Pandas DataFrame. This is useful for building datasets dynamically.

It takes in the following arguments:

- `name`: The name of the dataframe collection.
- `dataframe`: The panda dataframe containing the corresponding data
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- `dataframe`: The panda dataframe containing the corresponding data
- `dataframe`: Pandas DataFrame containing the corresponding data.

Comment on lines 1587 to 1588
Supported Signatures:
- `dataframe_collection(dataframe, name)`: generates collection with the given datafram and name.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

@@ -4,7 +4,10 @@

__all__ = ["range_collection"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing dataframe_collection?


- `DataframeGeneratedCollection`: Class used to create a dataframe collection using the given dataframe and name.
- `name`: The name of the dataframe collection.
- `dataframe`: The pandas dataframe containing all data (rows and columns).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- `dataframe`: The pandas dataframe containing all data (rows and columns).
- `dataframe`: The Pandas DataFrame containing all data (rows and columns).

- Returns: An instance of `RangeGeneratedCollection`.
- `dataframe_collection`: Function to create a dataframe collection with the specified parameters.
- `name`: The name of the dataframe collection.
- `dataframe`: The pandas dataframe.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- `dataframe`: The pandas dataframe.
- `dataframe`: The Pandas DataFrame.

The user collections module provides a way to create collections that are not part of the static metadata graph but can be generated dynamically based on user input or code. The most common user collection are integer range collections and Pandas DataFrame collections.
The range collection, generates a sequence of numbers. The `RangeGeneratedCollection` class allows users to define a range collection by specifying the start, end, and step values. The `range_collection` function is a convenient API to create instances of `RangeGeneratedCollection`. No newline at end of file
The range collection, generates a sequence of numbers. The `RangeGeneratedCollection` class allows users to define a range collection by specifying the start, end, and step values. The `range_collection` function is a convenient API to create instances of `RangeGeneratedCollection`.
The dataframe collection, generates a collection based on the given pandas dataframe. The `DataframeGeneratedCollection` class
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The dataframe collection, generates a collection based on the given pandas dataframe. The `DataframeGeneratedCollection` class
The dataframe collection, generates a collection based on the given Pandas Dataframe. The `DataframeGeneratedCollection` class

The range collection, generates a sequence of numbers. The `RangeGeneratedCollection` class allows users to define a range collection by specifying the start, end, and step values. The `range_collection` function is a convenient API to create instances of `RangeGeneratedCollection`. No newline at end of file
The range collection, generates a sequence of numbers. The `RangeGeneratedCollection` class allows users to define a range collection by specifying the start, end, and step values. The `range_collection` function is a convenient API to create instances of `RangeGeneratedCollection`.
The dataframe collection, generates a collection based on the given pandas dataframe. The `DataframeGeneratedCollection` class
allows user to create a collection by specifying the dataframe and name. The `dataframe_collection` function is an effective API
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for consistency with range description

Suggested change
allows user to create a collection by specifying the dataframe and name. The `dataframe_collection` function is an effective API
allows user to create a collection by specifying the dataframe and name. The `dataframe_collection` function is a convenient API

) -> UnqualifiedGeneratedCollection:
"""
Implementation of the `pydough.dataframe_collection` function, which provides
a way to create a collection of pandas dataframe in PyDough.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
a way to create a collection of pandas dataframe in PyDough.
a way to create a collection of Pandas DataFrame in PyDough.

Copy link
Contributor

@knassre-bodo knassre-bodo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few more comments based on some of Hadia's feedback + a few things I spotted while taking second/third looks in some of those files.

Supported Signatures:
- `dataframe_collection(dataframe, name)`: generates collection with the given datafram and name.

#### Example
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we perhaps also have a more involved example that does more things with the collection?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reminder to update the example here

Comment on lines -282 to +292
table_alias.append("columns", exp.to_identifier(f"_col_{i}"))
# PyDough change: adjust the formula to match the
# dialect's
if isinstance(dialect, SQLite):
table_alias.append(
"columns", exp.to_identifier(f"column{i + 1}")
)
else:
table_alias.append(
"columns", exp.to_identifier(f"_col_{i}")
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For context, this naming convention matters a lot when handling SQLite, and SQLGlot uses the wrong one so it glitches out, but if we use the SQLite convention for other dialects it causes SQLGlot to glitch out in other places. This is not an ideal fix, but its much more straightforward than taking SQLGlot apart to actually fix the issue.


@property
def unique_column_names(self) -> list[str]:
return list(dict.fromkeys(self.columns))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aww heck I knew I forgot something in the spec: we need the option to specify subsets of the columns that are unique (just like we do in metadata). It can be an optional argument in pydough.dataframe_collection which has type list[str | list[str]] where the strings are the column names. For instance, in your dsl.md exmaple, it would be ["color", "idx"] since both columns are unique.

We should then have tests on the behavior of window function / correlation tests with DataFrames that use this argument.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reminder to do this^

Comment on lines +3267 to +3269
"r = pydough.range_collection('tbl', 'v', 0, 500, 13).CALCULATE(first_digit=INTEGER(STRING(v)[:1]))\n"
"result = r.PARTITION(name='digits', by=first_digit).CALCULATE(first_digit, n=COUNT(tbl))",
"TPCH",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These new tests cover some categories of edge cases John discovered while testing DataFrames which we realized hadn't been covered/handled yet for simple range (hence all the weird HybridTree changes).

Comment on lines 108 to 111
elif len(dataframe[col]) == 0:
raise TypeError(
f"Column '{col}' is empty. All columns must have at least one value."
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh shoot that reminds me... we should have some tests for DataFrames with 1+ columns but no rows. There's a lot of edge cases to consider there, e.g. what happens if you cross join it with something else.

Comment on lines 124 to 125
match field_type:
case _ if pa.types.is_null(field_type):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If all of these are going to be case _ if ...:, we should just be doing a a bunch of if ... elif ... elif ... instead.

Comment on lines 143 to 146
case _ if pa.types.is_binary(field_type) or pa.types.is_large_binary(
field_type
):
return StringType()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be rejected for now?

Comment on lines 169 to 170
case _:
return UnknownType()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there cases where this happens that are valid? If so what are there? (should be documented in comments). Otherwise, we should explicitly ban things with clear error messages.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question. Let's be clear what are those unknowns

Copy link
Contributor

@hadia206 hadia206 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You addressed a lot of things. Almost there :)

Supported Signatures:
- `dataframe_collection(dataframe, name)`: generates collection with the given datafram and name.

#### Example
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reminder to update the example here

Comment on lines +99 to +100
and self.name == other.name
and self.columns == other.columns
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need these? If we already check self.dataframe.equals in the line below it?


except pa.ArrowInvalid:
raise ValueError(
f"Mixed types in column '{col}'. All values in a column must be of the same type."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's generalize error message here. I believe this error is not just for mixed types.
It can appear with other type conversion issues

Example I have seen online
ArrowInvalid: Invalid null value in PyArrow means data doesn't match the expected type

raise ValueError("Structs are not supported for dataframe collections")

elif pa_types.is_dictionary(field_type):
raise ValueError("Dictionaries are not supported for dataframe collections")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about tuples? Do we support that?


The supported PyDough types for `dataframe_collection` are:
- `NumericType`: includes float, integer, infinity, Nan.
- `BooleanType`: includes classic true and false.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- `BooleanType`: includes classic true and false.
- `BooleanType`: True and False.

The supported PyDough types for `dataframe_collection` are:
- `NumericType`: includes float, integer, infinity, Nan.
- `BooleanType`: includes classic true and false.
- `StringType`: includes all aphanumeric caracters.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed and fixed the typo

Suggested change
- `StringType`: includes all aphanumeric caracters.
- `StringType`: alphanumeric characters.

- `NumericType`: includes float, integer, infinity, Nan.
- `BooleanType`: includes classic true and false.
- `StringType`: includes all aphanumeric caracters.
- `Datetype`: includes date and datetime.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Timestamp? NaT?

Suggested change
- `Datetype`: includes date and datetime.
- `Datetype`: date and datetime.

- float64
- decimal
- bool
- datetime64
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

date?
what about timestamp?



def dataframe_collection(
name: str, dataframe: pd.DataFrame
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is missing Kian's request here

Comment on lines +200 to +201
else:
return UnknownType()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's clarify what are these if any are still missing.
See my comment in the dsl.md as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

effort - high major issue that will require multiple steps or complex design enhancement New feature or request extensibility Increasing situations in which PyDough works testing Alters the testing/CI process for PyDough

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add support for user-created collections in PyDough

3 participants