Skip to content

Dataframe Collection (User Collections)#469

Merged
john-sanchez31 merged 55 commits intomainfrom
John/df_collection
Feb 20, 2026
Merged

Dataframe Collection (User Collections)#469
john-sanchez31 merged 55 commits intomainfrom
John/df_collection

Conversation

@john-sanchez31
Copy link
Contributor

Resolves #162

@john-sanchez31 john-sanchez31 marked this pull request as draft January 8, 2026 14:43
@john-sanchez31 john-sanchez31 self-assigned this Jan 8, 2026
@john-sanchez31 john-sanchez31 added enhancement New feature or request extensibility Increasing situations in which PyDough works effort - high major issue that will require multiple steps or complex design testing Alters the testing/CI process for PyDough labels Jan 8, 2026
Copy link
Contributor

@hadia206 hadia206 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You addressed a lot of things. Almost there :)

Supported Signatures:
- `dataframe_collection(dataframe, name)`: generates collection with the given datafram and name.

#### Example
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reminder to update the example here

Comment on lines 99 to 100
and self.name == other.name
and self.columns == other.columns
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need these? If we already check self.dataframe.equals in the line below it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we still need the self.name == other.name right? Or do we care just what's inside the dataframe and not the name itself?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comparison is usually on the content. The name is just an identifier


except pa.ArrowInvalid:
raise ValueError(
f"Mixed types in column '{col}'. All values in a column must be of the same type."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's generalize error message here. I believe this error is not just for mixed types.
It can appear with other type conversion issues

Example I have seen online
ArrowInvalid: Invalid null value in PyArrow means data doesn't match the expected type

raise ValueError("Structs are not supported for dataframe collections")

elif pa_types.is_dictionary(field_type):
raise ValueError("Dictionaries are not supported for dataframe collections")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about tuples? Do we support that?

Copy link
Contributor Author

@john-sanchez31 john-sanchez31 Feb 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is not tuple type in pyarrow, usually falls into list, large_list, fixed_size_list (added) or struct


The supported PyDough types for `dataframe_collection` are:
- `NumericType`: includes float, integer, infinity, Nan.
- `BooleanType`: includes classic true and false.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- `BooleanType`: includes classic true and false.
- `BooleanType`: True and False.

The supported PyDough types for `dataframe_collection` are:
- `NumericType`: includes float, integer, infinity, Nan.
- `BooleanType`: includes classic true and false.
- `StringType`: includes all aphanumeric caracters.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed and fixed the typo

Suggested change
- `StringType`: includes all aphanumeric caracters.
- `StringType`: alphanumeric characters.

- `NumericType`: includes float, integer, infinity, Nan.
- `BooleanType`: includes classic true and false.
- `StringType`: includes all aphanumeric caracters.
- `Datetype`: includes date and datetime.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Timestamp? NaT?

Suggested change
- `Datetype`: includes date and datetime.
- `Datetype`: date and datetime.



def dataframe_collection(
name: str, dataframe: pd.DataFrame
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is missing Kian's request here

Comment on lines 200 to 201
else:
return UnknownType()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's clarify what are these if any are still missing.
See my comment in the dsl.md as well.

Copy link
Contributor

@hadia206 hadia206 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have one question about the unique properties.

def equals(self, other) -> bool:
return (
isinstance(other, DataframeGeneratedCollection)
and self.name == other.name
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reminder to remove this

# Scans are unchanged, and their uniqueness is based on the unique sets
# of the underlying table.
case Scan():
case Scan() | GeneratedTable():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why moved it here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because now GeneratedTable have unique_sets based on the unique_column_names argument. Both are managed the same way (triggers optimizations and sql generation taking in count those unique columns)


def dataframe_collection(
name: str, dataframe: pd.DataFrame
name: str, dataframe: pd.DataFrame, unique_column_names: list[str | list[str]]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be optional? Because now it means the user has to specify unique column names which is not intuitive.
Also, I have another request (but that can be a followup if it's too much)
which is having another optional argument specifying some column names to be able create a collection from these columns only.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding was that this argument is mandatory. This is because PyDough needs to have at least 1 primary key. Regarding the second request, I'll spend some time (not more than 2 hours) trying to implement it, If I think it'll take too much I'll let you know.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think unique is mandatory.
@knassre-bodo ^?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed offline, it's mandatory


if not all(col in dataframe.columns for col in unique_flatten_columns):
raise ValueError(
"Not existing column from 'unique_column_names' in the dataframe."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add column name to the message

)
if not isinstance(unique_column_names, list):
raise TypeError(
f"Expected 'unique_column_names' to be a list of string, got {type(unique_column_names).__name__}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it could also be a list of list of strings

Copy link
Contributor

@knassre-bodo knassre-bodo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few more things to tinker with, but LGTM!!!

Comment on lines +19 to +20
- `name`: The name of the dataframe collection.
- `dataframe`: The Pandas dataframe containing all data (rows and columns).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't forget the uniqueness stuff in the docs

Comment on lines 30 to 33
- `dataframe_collection`: Function to create a dataframe collection with the specified parameters.
- `name`: The name of the dataframe collection.
- `dataframe`: The Pandas dataframe.
- Returns: An instance of `DataframeGeneratedCollection`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here

Comment on lines 864 to 865
PMSPS=DEFAULT_TO(COUNT(filtered_sales), 0),
PMSR=DEFAULT_TO(SUM(filtered_sales.sale_price), 0),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both COUNT and SUM should be defaulting to 0 in this case already. Does the test work without DEFAULT_TO?

Comment on lines 763 to 775
class_df = pd.DataFrame(
{
"key": [15112, 15122, 15150, 15210, 15251],
"class_name": [
"Programming Fundamentals",
"Imperative Programming",
"Functional Programming",
"Parallel Algorithms",
"Theoretical CS",
],
"language": ["Python", "C", "SML", "SML", None],
}
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you have repeat DataFrames, you can define them elsewhere in this file in a common location

.WHERE(tid == teacher_1)
.PARTITION(name="classes", by=(first_name, last_name))
.CALCULATE(first_name, last_name, n_teachers=COUNT(teachers))
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few more tests to try with any of the DataFrames:

  • Just a single collection that gets partitioned on some of its columns that are already unqiue (e.g. partition teacher_tbl by the first/last name) -> check to make sure the GROUP BY is optimized out.
  • Something with a more correlated join. E.g. for each class, how many different classes are taught in the same language:
other_classes_same_language = CROSS(
  classes
  .WHERE((language == original_language) & (key != original_key)))
)
result = (
 classes
 .CALCULATE(original_language=language, original_class=key)
 .CALCULATE(
  class_name,
  language,
  n_other_classes=COUNT(other_classes_same_language)
 )
)

^ If this is truly correlated, the classes table should show up three times in the final relational plan/sql. If not, then the correlations are being optimized out and a different query is needed.

Copy link
Contributor

@knassre-bodo knassre-bodo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, letting Hadia do final approval

Copy link
Contributor

@hadia206 hadia206 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work John.
I have some minor comments, please address before merging.


- `name`: The name of the dataframe collection.
- `dataframe`: Pandas DataFrame containing the corresponding data.
- `unique_column_names`: List of strings or list of strings
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be list of strings or list of list of strings

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd phrase it as List of elements that are either strings or lists of strings (since it can have both in the same list, e.g. ["A", ["B", "C"]])

If provided, indicates all columns from the original dataframe that will be in the
final dataframe collection.

**Note**: All columns in `unique_column_names` must be included in `filter_columns`; otherwise, an error will be raised.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a followup, could it be possible to have the ability to include one of the unique columns?
example unique columns are ["column1", ["column2", "column3"]]
but in the filter column I can include column1 only or column2 and column3 without column1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I'll add this to the followup github issue so we don't forget later

- `DataframeGeneratedCollection`: Class used to create a dataframe collection using the given dataframe and name.
- `name`: The name of the dataframe collection.
- `dataframe`: The Pandas dataframe containing all data (rows and columns).
- `unique_column_names`: List of string or list or string
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same missing list of list

Comment on lines 87 to 88
raise TypeError(
f"Expected 'filter_columns' to a list of string, got {type(filter_columns).__name__}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This error message is misleading if I passed a list of integers for example.
Generalize the error message or split the check and have one error message for list and one for all(col, str).

Copy link
Contributor

@knassre-bodo knassre-bodo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding on a few more test requests based on Hadia's review

Comment on lines 1593 to 1594
If provided, indicates all columns from the original dataframe that will be in the
final dataframe collection.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
If provided, indicates all columns from the original dataframe that will be in the
final dataframe collection.
If provided, indicates a subset of the columns from the original dataframe that will be in the
final dataframe collection, and the order they will be in. If omitted, indicates that
all of the columns should be included in the same order they are currently present

`(list [str | list[ str ]])` representing the unique properties for the dataframe
collection. For example: ["column1", ["column2", "column3"]] indicates `column1`
is a unique property and the combination of column2 and column3 is also unique.
- `filter_columns`(optional): List of filter/selected columns from the dataframe.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- `filter_columns`(optional): List of filter/selected columns from the dataframe.
- `filter_columns` (optional): List of filter/selected columns from the dataframe.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we perhaps use a different name like column_subset or chosen_columns?

re.escape(
"The following column(s) from 'unique_column_names' are missing in `filter_columns`: id"
),
id="dataframe_collection_bad_9",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some other bad behaviors to test:

  • Missing unique columns
  • The filtered columns include repeats
  • The dataframe columns are not valid PyDough column names
  • The unique columns is an empty list, or contains an empty list (e.g. [] or ["A", []])

Comment on lines +5544 to +5552
pytest.param(
dataframe_collection_bad_5,
None,
re.escape(
"Arrays in column 'col1', are not supported for dataframe collections"
),
id="dataframe_collection_bad_5",
),
pytest.param(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: we should have a regular test with a column that is a bad type but is NOT included in the filtered columns, which means it does NOT have an error (e.g. column x is an array type, but not included in the filtered columns) since the type checking/inference only matters for the columns we are using.

Comment on lines +155 to +157


def simple_dataframe_collection_1():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see any non-error tests where the filtered columns come into play. Lets include a test with DataFrame columns [A, B, C, D, E] and the filter_columns as ["D", "C", "A"], then make sure that the answer only has the 3 columns we want (in the correct order).

Comment on lines 84 to 89
if not isinstance(filter_columns, list) and all(
isinstance(col, str) for col in filter_columns
):
raise TypeError(
f"Expected 'filter_columns' to a list of string, got {type(filter_columns).__name__}"
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead, can just use the predicates from from error_utils.py. Add from pydough.errors.error_utils import NonEmptyListOf, is_string to the top, then replace this entire chunk with he following:

Suggested change
if not isinstance(filter_columns, list) and all(
isinstance(col, str) for col in filter_columns
):
raise TypeError(
f"Expected 'filter_columns' to a list of string, got {type(filter_columns).__name__}"
)
NonEmptyListOf(is_string).verify(filter_columns, "filter_columns")

Comment on lines 229 to 230
@staticmethod
def valid_unique_column_names(unique_columns_name: list[str | list[str]]) -> bool:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already have something like this in error_utils.py used for metadata, called unique_properties_predicate. Calling unique_properties_predicate.verify(unique_columns_names, "unique_columns_names") verifies that unique_columns_names is a non-empty list of objects that are either strings or non-empty lists of strings, and if not raises an exception with the name "unique_columns_names" inside the message

@john-sanchez31 john-sanchez31 merged commit 4f69828 into main Feb 20, 2026
22 checks passed
@john-sanchez31 john-sanchez31 deleted the John/df_collection branch February 20, 2026 21:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

effort - high major issue that will require multiple steps or complex design enhancement New feature or request extensibility Increasing situations in which PyDough works testing Alters the testing/CI process for PyDough

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add support for user-created collections in PyDough

3 participants