Dataframe Collection (User Collections)#469
Conversation
… SQLGlot bug discovered along the way [RUN CI]
… into John/df_collection
hadia206
left a comment
There was a problem hiding this comment.
You addressed a lot of things. Almost there :)
documentation/dsl.md
Outdated
| Supported Signatures: | ||
| - `dataframe_collection(dataframe, name)`: generates collection with the given datafram and name. | ||
|
|
||
| #### Example |
There was a problem hiding this comment.
Reminder to update the example here
| and self.name == other.name | ||
| and self.columns == other.columns |
There was a problem hiding this comment.
Do we need these? If we already check self.dataframe.equals in the line below it?
There was a problem hiding this comment.
I think we still need the self.name == other.name right? Or do we care just what's inside the dataframe and not the name itself?
There was a problem hiding this comment.
Comparison is usually on the content. The name is just an identifier
|
|
||
| except pa.ArrowInvalid: | ||
| raise ValueError( | ||
| f"Mixed types in column '{col}'. All values in a column must be of the same type." |
There was a problem hiding this comment.
Let's generalize error message here. I believe this error is not just for mixed types.
It can appear with other type conversion issues
Example I have seen online
ArrowInvalid: Invalid null value in PyArrow means data doesn't match the expected type
| raise ValueError("Structs are not supported for dataframe collections") | ||
|
|
||
| elif pa_types.is_dictionary(field_type): | ||
| raise ValueError("Dictionaries are not supported for dataframe collections") |
There was a problem hiding this comment.
what about tuples? Do we support that?
There was a problem hiding this comment.
There is not tuple type in pyarrow, usually falls into list, large_list, fixed_size_list (added) or struct
documentation/dsl.md
Outdated
|
|
||
| The supported PyDough types for `dataframe_collection` are: | ||
| - `NumericType`: includes float, integer, infinity, Nan. | ||
| - `BooleanType`: includes classic true and false. |
There was a problem hiding this comment.
| - `BooleanType`: includes classic true and false. | |
| - `BooleanType`: True and False. |
documentation/dsl.md
Outdated
| The supported PyDough types for `dataframe_collection` are: | ||
| - `NumericType`: includes float, integer, infinity, Nan. | ||
| - `BooleanType`: includes classic true and false. | ||
| - `StringType`: includes all aphanumeric caracters. |
There was a problem hiding this comment.
Removed and fixed the typo
| - `StringType`: includes all aphanumeric caracters. | |
| - `StringType`: alphanumeric characters. |
documentation/dsl.md
Outdated
| - `NumericType`: includes float, integer, infinity, Nan. | ||
| - `BooleanType`: includes classic true and false. | ||
| - `StringType`: includes all aphanumeric caracters. | ||
| - `Datetype`: includes date and datetime. |
There was a problem hiding this comment.
Timestamp? NaT?
| - `Datetype`: includes date and datetime. | |
| - `Datetype`: date and datetime. |
|
|
||
|
|
||
| def dataframe_collection( | ||
| name: str, dataframe: pd.DataFrame |
| else: | ||
| return UnknownType() |
There was a problem hiding this comment.
Let's clarify what are these if any are still missing.
See my comment in the dsl.md as well.
hadia206
left a comment
There was a problem hiding this comment.
I have one question about the unique properties.
| def equals(self, other) -> bool: | ||
| return ( | ||
| isinstance(other, DataframeGeneratedCollection) | ||
| and self.name == other.name |
| # Scans are unchanged, and their uniqueness is based on the unique sets | ||
| # of the underlying table. | ||
| case Scan(): | ||
| case Scan() | GeneratedTable(): |
There was a problem hiding this comment.
Because now GeneratedTable have unique_sets based on the unique_column_names argument. Both are managed the same way (triggers optimizations and sql generation taking in count those unique columns)
|
|
||
| def dataframe_collection( | ||
| name: str, dataframe: pd.DataFrame | ||
| name: str, dataframe: pd.DataFrame, unique_column_names: list[str | list[str]] |
There was a problem hiding this comment.
Shouldn't this be optional? Because now it means the user has to specify unique column names which is not intuitive.
Also, I have another request (but that can be a followup if it's too much)
which is having another optional argument specifying some column names to be able create a collection from these columns only.
There was a problem hiding this comment.
My understanding was that this argument is mandatory. This is because PyDough needs to have at least 1 primary key. Regarding the second request, I'll spend some time (not more than 2 hours) trying to implement it, If I think it'll take too much I'll let you know.
There was a problem hiding this comment.
I don't think unique is mandatory.
@knassre-bodo ^?
There was a problem hiding this comment.
Discussed offline, it's mandatory
|
|
||
| if not all(col in dataframe.columns for col in unique_flatten_columns): | ||
| raise ValueError( | ||
| "Not existing column from 'unique_column_names' in the dataframe." |
There was a problem hiding this comment.
Add column name to the message
| ) | ||
| if not isinstance(unique_column_names, list): | ||
| raise TypeError( | ||
| f"Expected 'unique_column_names' to be a list of string, got {type(unique_column_names).__name__}" |
There was a problem hiding this comment.
it could also be a list of list of strings
knassre-bodo
left a comment
There was a problem hiding this comment.
Just a few more things to tinker with, but LGTM!!!
| - `name`: The name of the dataframe collection. | ||
| - `dataframe`: The Pandas dataframe containing all data (rows and columns). |
There was a problem hiding this comment.
Don't forget the uniqueness stuff in the docs
| - `dataframe_collection`: Function to create a dataframe collection with the specified parameters. | ||
| - `name`: The name of the dataframe collection. | ||
| - `dataframe`: The Pandas dataframe. | ||
| - Returns: An instance of `DataframeGeneratedCollection`. |
| PMSPS=DEFAULT_TO(COUNT(filtered_sales), 0), | ||
| PMSR=DEFAULT_TO(SUM(filtered_sales.sale_price), 0), |
There was a problem hiding this comment.
Both COUNT and SUM should be defaulting to 0 in this case already. Does the test work without DEFAULT_TO?
| class_df = pd.DataFrame( | ||
| { | ||
| "key": [15112, 15122, 15150, 15210, 15251], | ||
| "class_name": [ | ||
| "Programming Fundamentals", | ||
| "Imperative Programming", | ||
| "Functional Programming", | ||
| "Parallel Algorithms", | ||
| "Theoretical CS", | ||
| ], | ||
| "language": ["Python", "C", "SML", "SML", None], | ||
| } | ||
| ) |
There was a problem hiding this comment.
If you have repeat DataFrames, you can define them elsewhere in this file in a common location
| .WHERE(tid == teacher_1) | ||
| .PARTITION(name="classes", by=(first_name, last_name)) | ||
| .CALCULATE(first_name, last_name, n_teachers=COUNT(teachers)) | ||
| ) |
There was a problem hiding this comment.
A few more tests to try with any of the DataFrames:
- Just a single collection that gets partitioned on some of its columns that are already unqiue (e.g. partition
teacher_tblby the first/last name) -> check to make sure theGROUP BYis optimized out. - Something with a more correlated join. E.g. for each class, how many different classes are taught in the same language:
other_classes_same_language = CROSS(
classes
.WHERE((language == original_language) & (key != original_key)))
)
result = (
classes
.CALCULATE(original_language=language, original_class=key)
.CALCULATE(
class_name,
language,
n_other_classes=COUNT(other_classes_same_language)
)
)^ If this is truly correlated, the classes table should show up three times in the final relational plan/sql. If not, then the correlations are being optimized out and a different query is needed.
for more information, see https://pre-commit.ci
knassre-bodo
left a comment
There was a problem hiding this comment.
LGTM, letting Hadia do final approval
hadia206
left a comment
There was a problem hiding this comment.
Great work John.
I have some minor comments, please address before merging.
documentation/dsl.md
Outdated
|
|
||
| - `name`: The name of the dataframe collection. | ||
| - `dataframe`: Pandas DataFrame containing the corresponding data. | ||
| - `unique_column_names`: List of strings or list of strings |
There was a problem hiding this comment.
This should be list of strings or list of list of strings
There was a problem hiding this comment.
I'd phrase it as List of elements that are either strings or lists of strings (since it can have both in the same list, e.g. ["A", ["B", "C"]])
documentation/dsl.md
Outdated
| If provided, indicates all columns from the original dataframe that will be in the | ||
| final dataframe collection. | ||
|
|
||
| **Note**: All columns in `unique_column_names` must be included in `filter_columns`; otherwise, an error will be raised. |
There was a problem hiding this comment.
This is a followup, could it be possible to have the ability to include one of the unique columns?
example unique columns are ["column1", ["column2", "column3"]]
but in the filter column I can include column1 only or column2 and column3 without column1
There was a problem hiding this comment.
Sure, I'll add this to the followup github issue so we don't forget later
pydough/user_collections/README.md
Outdated
| - `DataframeGeneratedCollection`: Class used to create a dataframe collection using the given dataframe and name. | ||
| - `name`: The name of the dataframe collection. | ||
| - `dataframe`: The Pandas dataframe containing all data (rows and columns). | ||
| - `unique_column_names`: List of string or list or string |
There was a problem hiding this comment.
Same missing list of list
| raise TypeError( | ||
| f"Expected 'filter_columns' to a list of string, got {type(filter_columns).__name__}" |
There was a problem hiding this comment.
This error message is misleading if I passed a list of integers for example.
Generalize the error message or split the check and have one error message for list and one for all(col, str).
knassre-bodo
left a comment
There was a problem hiding this comment.
Adding on a few more test requests based on Hadia's review
documentation/dsl.md
Outdated
| If provided, indicates all columns from the original dataframe that will be in the | ||
| final dataframe collection. |
There was a problem hiding this comment.
| If provided, indicates all columns from the original dataframe that will be in the | |
| final dataframe collection. | |
| If provided, indicates a subset of the columns from the original dataframe that will be in the | |
| final dataframe collection, and the order they will be in. If omitted, indicates that | |
| all of the columns should be included in the same order they are currently present |
documentation/dsl.md
Outdated
| `(list [str | list[ str ]])` representing the unique properties for the dataframe | ||
| collection. For example: ["column1", ["column2", "column3"]] indicates `column1` | ||
| is a unique property and the combination of column2 and column3 is also unique. | ||
| - `filter_columns`(optional): List of filter/selected columns from the dataframe. |
There was a problem hiding this comment.
| - `filter_columns`(optional): List of filter/selected columns from the dataframe. | |
| - `filter_columns` (optional): List of filter/selected columns from the dataframe. |
There was a problem hiding this comment.
Can we perhaps use a different name like column_subset or chosen_columns?
| re.escape( | ||
| "The following column(s) from 'unique_column_names' are missing in `filter_columns`: id" | ||
| ), | ||
| id="dataframe_collection_bad_9", |
There was a problem hiding this comment.
Some other bad behaviors to test:
- Missing unique columns
- The filtered columns include repeats
- The dataframe columns are not valid PyDough column names
- The unique columns is an empty list, or contains an empty list (e.g.
[]or["A", []])
| pytest.param( | ||
| dataframe_collection_bad_5, | ||
| None, | ||
| re.escape( | ||
| "Arrays in column 'col1', are not supported for dataframe collections" | ||
| ), | ||
| id="dataframe_collection_bad_5", | ||
| ), | ||
| pytest.param( |
There was a problem hiding this comment.
Note: we should have a regular test with a column that is a bad type but is NOT included in the filtered columns, which means it does NOT have an error (e.g. column x is an array type, but not included in the filtered columns) since the type checking/inference only matters for the columns we are using.
|
|
||
|
|
||
| def simple_dataframe_collection_1(): |
There was a problem hiding this comment.
I don't see any non-error tests where the filtered columns come into play. Lets include a test with DataFrame columns [A, B, C, D, E] and the filter_columns as ["D", "C", "A"], then make sure that the answer only has the 3 columns we want (in the correct order).
| if not isinstance(filter_columns, list) and all( | ||
| isinstance(col, str) for col in filter_columns | ||
| ): | ||
| raise TypeError( | ||
| f"Expected 'filter_columns' to a list of string, got {type(filter_columns).__name__}" | ||
| ) |
There was a problem hiding this comment.
Instead, can just use the predicates from from error_utils.py. Add from pydough.errors.error_utils import NonEmptyListOf, is_string to the top, then replace this entire chunk with he following:
| if not isinstance(filter_columns, list) and all( | |
| isinstance(col, str) for col in filter_columns | |
| ): | |
| raise TypeError( | |
| f"Expected 'filter_columns' to a list of string, got {type(filter_columns).__name__}" | |
| ) | |
| NonEmptyListOf(is_string).verify(filter_columns, "filter_columns") |
| @staticmethod | ||
| def valid_unique_column_names(unique_columns_name: list[str | list[str]]) -> bool: |
There was a problem hiding this comment.
We already have something like this in error_utils.py used for metadata, called unique_properties_predicate. Calling unique_properties_predicate.verify(unique_columns_names, "unique_columns_names") verifies that unique_columns_names is a non-empty list of objects that are either strings or non-empty lists of strings, and if not raises an exception with the name "unique_columns_names" inside the message
Resolves #162