Dataframe Collection (User Collections) by john-sanchez31 · Pull Request #469 · bodo-ai/PyDough

john-sanchez31 · 2026-01-08T14:43:17Z

Resolves #162

… SQLGlot bug discovered along the way [RUN CI]

… into John/df_collection

hadia206

You addressed a lot of things. Almost there :)

hadia206 · 2026-02-06T20:13:39Z

documentation/dsl.md

+Supported Signatures:
+- `dataframe_collection(dataframe, name)`: generates collection with the given datafram and name.
+
+#### Example


Reminder to update the example here

hadia206 · 2026-02-06T20:20:45Z

pydough/user_collections/dataframe_collection.py

+            and self.name == other.name
+            and self.columns == other.columns


Do we need these? If we already check self.dataframe.equals in the line below it?

I think we still need the self.name == other.name right? Or do we care just what's inside the dataframe and not the name itself?

Comparison is usually on the content. The name is just an identifier

hadia206 · 2026-02-06T20:29:02Z

pydough/user_collections/dataframe_collection.py

+
+            except pa.ArrowInvalid:
+                raise ValueError(
+                    f"Mixed types in column '{col}'. All values in a column must be of the same type."


Let's generalize error message here. I believe this error is not just for mixed types.
It can appear with other type conversion issues

Example I have seen online
ArrowInvalid: Invalid null value in PyArrow means data doesn't match the expected type

hadia206 · 2026-02-06T20:29:41Z

pydough/user_collections/dataframe_collection.py

+            raise ValueError("Structs are not supported for dataframe collections")
+
+        elif pa_types.is_dictionary(field_type):
+            raise ValueError("Dictionaries are not supported for dataframe collections")


what about tuples? Do we support that?

There is not tuple type in pyarrow, usually falls into list, large_list, fixed_size_list (added) or struct

hadia206 · 2026-02-06T20:36:10Z

documentation/dsl.md

+
+The supported PyDough types for `dataframe_collection` are:
+- `NumericType`: includes float, integer, infinity, Nan.
+- `BooleanType`: includes classic true and false.


Suggested change

- `BooleanType`: includes classic true and false.

- `BooleanType`: True and False.

hadia206 · 2026-02-06T20:40:59Z

documentation/dsl.md

+The supported PyDough types for `dataframe_collection` are:
+- `NumericType`: includes float, integer, infinity, Nan.
+- `BooleanType`: includes classic true and false.
+- `StringType`: includes all aphanumeric caracters.


Removed and fixed the typo

Suggested change

- `StringType`: includes all aphanumeric caracters.

- `StringType`: alphanumeric characters.

hadia206 · 2026-02-06T20:42:11Z

documentation/dsl.md

+- `NumericType`: includes float, integer, infinity, Nan.
+- `BooleanType`: includes classic true and false.
+- `StringType`: includes all aphanumeric caracters.
+- `Datetype`: includes date and datetime.


Timestamp? NaT?

Suggested change

- `Datetype`: includes date and datetime.

- `Datetype`: date and datetime.

pydough/user_collections/dataframe_collection.py

hadia206 · 2026-02-06T20:54:50Z

pydough/user_collections/user_collection_apis.py

+
+
+def dataframe_collection(
+    name: str, dataframe: pd.DataFrame


This is missing Kian's request here

hadia206 · 2026-02-06T20:56:32Z

pydough/user_collections/dataframe_collection.py

+        else:
+            return UnknownType()


Let's clarify what are these if any are still missing.
See my comment in the dsl.md as well.

hadia206

I have one question about the unique properties.

hadia206 · 2026-02-12T19:38:37Z

pydough/user_collections/dataframe_collection.py

+    def equals(self, other) -> bool:
+        return (
+            isinstance(other, DataframeGeneratedCollection)
+            and self.name == other.name


reminder to remove this

hadia206 · 2026-02-12T19:41:03Z

pydough/conversion/agg_removal.py

        # Scans are unchanged, and their uniqueness is based on the unique sets
        # of the underlying table.
-        case Scan():
+        case Scan() | GeneratedTable():


Why moved it here?

Because now GeneratedTable have unique_sets based on the unique_column_names argument. Both are managed the same way (triggers optimizations and sql generation taking in count those unique columns)

hadia206 · 2026-02-12T19:43:03Z

pydough/user_collections/user_collection_apis.py


 def dataframe_collection(
-    name: str, dataframe: pd.DataFrame
+    name: str, dataframe: pd.DataFrame, unique_column_names: list[str | list[str]]


Shouldn't this be optional? Because now it means the user has to specify unique column names which is not intuitive.
Also, I have another request (but that can be a followup if it's too much)
which is having another optional argument specifying some column names to be able create a collection from these columns only.

My understanding was that this argument is mandatory. This is because PyDough needs to have at least 1 primary key. Regarding the second request, I'll spend some time (not more than 2 hours) trying to implement it, If I think it'll take too much I'll let you know.

I don't think unique is mandatory.
@knassre-bodo ^?

Discussed offline, it's mandatory

hadia206 · 2026-02-12T19:47:20Z

pydough/user_collections/user_collection_apis.py

+
+    if not all(col in dataframe.columns for col in unique_flatten_columns):
+        raise ValueError(
+            "Not existing column from 'unique_column_names' in the dataframe."


Add column name to the message

hadia206 · 2026-02-12T19:49:47Z

pydough/user_collections/user_collection_apis.py

+        )
+    if not isinstance(unique_column_names, list):
+        raise TypeError(
+            f"Expected 'unique_column_names' to be a list of string, got {type(unique_column_names).__name__}"


it could also be a list of list of strings

knassre-bodo

Just a few more things to tinker with, but LGTM!!!

knassre-bodo · 2026-02-13T18:21:12Z

pydough/user_collections/README.md

+    - `name`: The name of the dataframe collection.
+    - `dataframe`: The Pandas dataframe containing all data (rows and columns).


Don't forget the uniqueness stuff in the docs

knassre-bodo · 2026-02-13T18:21:18Z

pydough/user_collections/README.md

+  - `dataframe_collection`: Function to create a dataframe collection with the specified parameters.
+    - `name`: The name of the dataframe collection.
+    - `dataframe`: The Pandas dataframe.
+    - Returns: An instance of `DataframeGeneratedCollection`.


knassre-bodo · 2026-02-13T18:23:33Z

tests/test_pydough_functions/defog_test_functions.py

+            PMSPS=DEFAULT_TO(COUNT(filtered_sales), 0),
+            PMSR=DEFAULT_TO(SUM(filtered_sales.sale_price), 0),


Both COUNT and SUM should be defaulting to 0 in this case already. Does the test work without DEFAULT_TO?

knassre-bodo · 2026-02-13T18:34:24Z

tests/test_pydough_functions/user_collections.py

+    class_df = pd.DataFrame(
+        {
+            "key": [15112, 15122, 15150, 15210, 15251],
+            "class_name": [
+                "Programming Fundamentals",
+                "Imperative Programming",
+                "Functional Programming",
+                "Parallel Algorithms",
+                "Theoretical CS",
+            ],
+            "language": ["Python", "C", "SML", "SML", None],
+        }
+    )


If you have repeat DataFrames, you can define them elsewhere in this file in a common location

knassre-bodo · 2026-02-13T21:47:17Z

tests/test_pydough_functions/user_collections.py

+        .WHERE(tid == teacher_1)
+        .PARTITION(name="classes", by=(first_name, last_name))
+        .CALCULATE(first_name, last_name, n_teachers=COUNT(teachers))
+    )


A few more tests to try with any of the DataFrames:

Just a single collection that gets partitioned on some of its columns that are already unqiue (e.g. partition teacher_tbl by the first/last name) -> check to make sure the GROUP BY is optimized out.

Something with a more correlated join. E.g. for each class, how many different classes are taught in the same language:

other_classes_same_language = CROSS( classes .WHERE((language == original_language) & (key != original_key))) ) result = ( classes .CALCULATE(original_language=language, original_class=key) .CALCULATE( class_name, language, n_other_classes=COUNT(other_classes_same_language) ) )

^ If this is truly correlated, the classes table should show up three times in the final relational plan/sql. If not, then the correlations are being optimized out and a different query is needed.

…tests

tests/test_pydough_functions/user_collections.py

for more information, see https://pre-commit.ci

knassre-bodo

LGTM, letting Hadia do final approval

hadia206

Great work John.
I have some minor comments, please address before merging.

hadia206 · 2026-02-19T18:55:53Z

documentation/dsl.md


 - `name`: The name of the dataframe collection.
 - `dataframe`: Pandas DataFrame containing the corresponding data.
+- `unique_column_names`: List of strings or list of strings


This should be list of strings or list of list of strings

I'd phrase it as List of elements that are either strings or lists of strings (since it can have both in the same list, e.g. ["A", ["B", "C"]])

hadia206 · 2026-02-19T18:58:15Z

documentation/dsl.md

+If provided, indicates all columns from the original dataframe that will be in the
+final dataframe collection. 
+
+**Note**: All columns in `unique_column_names` must be included in `filter_columns`; otherwise, an error will be raised. 


This is a followup, could it be possible to have the ability to include one of the unique columns?
example unique columns are ["column1", ["column2", "column3"]]
but in the filter column I can include column1 only or column2 and column3 without column1

Sure, I'll add this to the followup github issue so we don't forget later

hadia206 · 2026-02-19T19:02:31Z

pydough/user_collections/README.md

  - `DataframeGeneratedCollection`: Class used to create a dataframe collection using the given dataframe and name.
    - `name`: The name of the dataframe collection.
    - `dataframe`: The Pandas dataframe containing all data (rows and columns).
+    - `unique_column_names`: List of string or list or string 


Same missing list of list

hadia206 · 2026-02-19T19:04:15Z

pydough/user_collections/user_collection_apis.py

+        raise TypeError(
+            f"Expected 'filter_columns' to a list of string, got {type(filter_columns).__name__}"


This error message is misleading if I passed a list of integers for example.
Generalize the error message or split the check and have one error message for list and one for all(col, str).

knassre-bodo

Adding on a few more test requests based on Hadia's review

knassre-bodo · 2026-02-19T22:03:59Z

documentation/dsl.md

+If provided, indicates all columns from the original dataframe that will be in the
+final dataframe collection. 


Suggested change

If provided, indicates all columns from the original dataframe that will be in the

final dataframe collection.

If provided, indicates a subset of the columns from the original dataframe that will be in the

final dataframe collection, and the order they will be in. If omitted, indicates that

all of the columns should be included in the same order they are currently present

knassre-bodo · 2026-02-19T22:04:11Z

documentation/dsl.md

+`(list [str | list[ str ]])` representing the unique properties for the dataframe 
+collection. For example: ["column1", ["column2", "column3"]] indicates `column1`
+is a unique property and the combination of column2 and column3 is also unique.
+- `filter_columns`(optional): List of filter/selected columns from the dataframe.


Suggested change

- `filter_columns`(optional): List of filter/selected columns from the dataframe.

- `filter_columns` (optional): List of filter/selected columns from the dataframe.

Can we perhaps use a different name like column_subset or chosen_columns?

knassre-bodo · 2026-02-19T22:07:47Z

tests/test_pipeline_tpch_custom.py

+            re.escape(
+                "The following column(s) from 'unique_column_names' are missing in `filter_columns`: id"
+            ),
+            id="dataframe_collection_bad_9",


Some other bad behaviors to test:

Missing unique columns

The filtered columns include repeats

The dataframe columns are not valid PyDough column names

The unique columns is an empty list, or contains an empty list (e.g. [] or ["A", []])

knassre-bodo · 2026-02-19T22:09:08Z

tests/test_pipeline_tpch_custom.py

+        pytest.param(
+            dataframe_collection_bad_5,
+            None,
+            re.escape(
+                "Arrays in column 'col1', are not supported for dataframe collections"
+            ),
+            id="dataframe_collection_bad_5",
+        ),
+        pytest.param(


Note: we should have a regular test with a column that is a bad type but is NOT included in the filtered columns, which means it does NOT have an error (e.g. column x is an array type, but not included in the filtered columns) since the type checking/inference only matters for the columns we are using.

knassre-bodo · 2026-02-19T22:12:54Z

tests/test_pydough_functions/user_collections.py

+
+
+def simple_dataframe_collection_1():


I don't see any non-error tests where the filtered columns come into play. Lets include a test with DataFrame columns [A, B, C, D, E] and the filter_columns as ["D", "C", "A"], then make sure that the answer only has the 3 columns we want (in the correct order).

knassre-bodo · 2026-02-19T22:17:22Z

pydough/user_collections/user_collection_apis.py

+    if not isinstance(filter_columns, list) and all(
+        isinstance(col, str) for col in filter_columns
+    ):
+        raise TypeError(
+            f"Expected 'filter_columns' to a list of string, got {type(filter_columns).__name__}"
+        )


Instead, can just use the predicates from from error_utils.py. Add from pydough.errors.error_utils import NonEmptyListOf, is_string to the top, then replace this entire chunk with he following:

Suggested change

if not isinstance(filter_columns, list) and all(

isinstance(col, str) for col in filter_columns

):

raise TypeError(

f"Expected 'filter_columns' to a list of string, got {type(filter_columns).__name__}"

)

NonEmptyListOf(is_string).verify(filter_columns, "filter_columns")

knassre-bodo · 2026-02-19T22:19:05Z

pydough/user_collections/dataframe_collection.py

+    @staticmethod
+    def valid_unique_column_names(unique_columns_name: list[str | list[str]]) -> bool:


We already have something like this in error_utils.py used for metadata, called unique_properties_predicate. Calling unique_properties_predicate.verify(unique_columns_names, "unique_columns_names") verifies that unique_columns_names is a non-empty list of objects that are either strings or non-empty lists of strings, and if not raises an exception with the name "unique_columns_names" inside the message

Initial documentation

dfad733

john-sanchez31 marked this pull request as draft January 8, 2026 14:43

john-sanchez31 self-assigned this Jan 8, 2026

john-sanchez31 added enhancement New feature or request extensibility Increasing situations in which PyDough works effort - high major issue that will require multiple steps or complex design testing Alters the testing/CI process for PyDough labels Jan 8, 2026

john-sanchez31 and others added 23 commits January 12, 2026 15:05

base df collection implementation for sqlite, ansi and mysql

3d10cc0

types fixed

4ea44f8

ref sql added

486d785

implementation df collections for postgres and snowflake

0eef40d

datatypes fixed

036f39a

datatypes, numbers and inf test added

c882800

string and cross df collection tests (no fix for partition yet)

4e84394

WIP: patition with user generated collections

60b9703

Adding more range colleciton partition tests and fixing qualification…

4d6ca40

… SQLGlot bug discovered along the way [RUN CI]

fixing comments and deleting unneccesary case

63fb854

Merge branch 'John/df_collection' of https://github.com/bodo-ai/PyDough…

b0d942e

… into John/df_collection

partition 2 df collection test

f3f22ae

df collection where date test added

7a49b16

df collection top_k test added

0318e25

dataframe_collection_best test added

a2d3dd6

bad test and window function test added

c500b54

docstring and refactored code [run all]

2038159

pyarrow dependency added [run all]

cb77bd8

pyarrow dependency changed [run all]

78ec168

testing [run all]

3c2d350

testing [run all]

f15d3a1

reverting [run all]

eb53764

connectors version locked [run all]

29276b1

john-sanchez31 requested a review from hadia206 February 2, 2026 16:45

hadia206 reviewed Feb 6, 2026

View reviewed changes

john-sanchez31 added 5 commits February 10, 2026 16:16

unique column names added and more testing

3aa134e

refsol files updated

a5e65f2

merge conflicts fixed [run all]

f6d2472

test refsol updated [run all]

92d4879

simple_dataframe_1 test fixed [run all]

7aa21ee

john-sanchez31 requested a review from hadia206 February 10, 2026 23:25

hadia206 reviewed Feb 12, 2026

View reviewed changes

knassre-bodo approved these changes Feb 13, 2026

View reviewed changes

john-sanchez31 and others added 4 commits February 16, 2026 16:07

Adding documentation, filter columns in df collections, validations, …

56aafb8

…tests

renaminig for multiple generated collections

4e2273e

Resolving conflicts

77ab58b

Fixing the hybrid existance bug [RUN CI]

d02d575

john-sanchez31 requested a review from hadia206 February 19, 2026 17:08

knassre-bodo reviewed Feb 19, 2026

View reviewed changes

tests/test_pydough_functions/user_collections.py Outdated Show resolved Hide resolved

knassre-bodo and others added 2 commits February 19, 2026 14:04

Apply suggestion from @knassre-bodo

bd9355f

[pre-commit.ci] auto fixes from pre-commit.com hooks

b777a4c

for more information, see https://pre-commit.ci

knassre-bodo approved these changes Feb 19, 2026

View reviewed changes

hadia206 approved these changes Feb 19, 2026

View reviewed changes

knassre-bodo reviewed Feb 19, 2026

View reviewed changes

john-sanchez31 added 5 commits February 20, 2026 10:19

quoted names on user collections supported

22af017

alias quoted bug fixed

cd538d8

using error_utils for validations, more bad test [run dialects]

02fa0bd

testing [run ci][run dialects]

c27c787

adding last test [run ci][run dialects]

75f2778

john-sanchez31 merged commit 4f69828 into main Feb 20, 2026
22 checks passed

john-sanchez31 deleted the John/df_collection branch February 20, 2026 21:55

		and self.name == other.name
		and self.columns == other.columns

	- `BooleanType`: includes classic true and false.
	- `BooleanType`: True and False.

	- `StringType`: includes all aphanumeric caracters.
	- `StringType`: alphanumeric characters.

	- `Datetype`: includes date and datetime.
	- `Datetype`: date and datetime.

		- `name`: The name of the dataframe collection.
		- `dataframe`: The Pandas dataframe containing all data (rows and columns).

		PMSPS=DEFAULT_TO(COUNT(filtered_sales), 0),
		PMSR=DEFAULT_TO(SUM(filtered_sales.sale_price), 0),

		raise TypeError(
		f"Expected 'filter_columns' to a list of string, got {type(filter_columns).__name__}"

		If provided, indicates all columns from the original dataframe that will be in the
		final dataframe collection.

-If provided, indicates all columns from the original dataframe that will be in the
-final dataframe collection.
+If provided, indicates a subset of the columns from the original dataframe that will be in the
+final dataframe collection, and the order they will be in. If omitted, indicates that
+all of the columns should be included in the same order they are currently present

	- `filter_columns`(optional): List of filter/selected columns from the dataframe.
	- `filter_columns` (optional): List of filter/selected columns from the dataframe.

		@staticmethod
		def valid_unique_column_names(unique_columns_name: list[str \| list[str]]) -> bool:

Conversation

john-sanchez31 commented Jan 8, 2026

Uh oh!

hadia206 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

john-sanchez31 Feb 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hadia206 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

knassre-bodo left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

knassre-bodo left a comment

Choose a reason for hiding this comment

Uh oh!

hadia206 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

john-sanchez31 Feb 10, 2026 •

edited

Loading