Add support for dataframe and arrow table in add_entities #907

mdekstrand · 2025-10-27T14:53:39Z

We should update the docstring to document the kinds of attributes supported, limitations, etc., along with the index logic.

This should be in the main body of the docstring (before Args:), not in the argument documentation, for readability.

mdekstrand · 2025-10-27T14:51:00Z

There is an interesting and challenging edge case here, that we need to clearly document and/or design for.

Right now, this works because your test case names the index item_id, which then turns into a column when we do from_pandas.

However, if the client provides code that has no item_id column, and has an index with a different name, we need to figure out what to do. Do we want to use the index? Do we want to throw an error?

I think we probably want to use the Pandas index, with the following logic:

If the data frame has a column named {cls}_id, use that column as the entity IDs, and ignore the index.

Otherwise, assume the index has entity IDs.

Implementing this logic will require this line to be a little more aware of the Pandas data frames, and also require tests for each of the different conditions. Importantly, for case (1), this line here will create a new attribute called index (or whatever the index name is), and we don't want to include that.

The input cases we will need to test for correct behavior with:

Input has an index named {cls}_id (the current test)

Input has an index named something else, and no column named {cls}_id

Input has a column named {cls}_id

This isn't a problem for PyArrow input, because Arrow tables do not have indices.

mdekstrand · 2025-10-27T14:52:05Z

We should only exclude the {cls}_id column — we want any other _id columns to result in an error, not be silently ignored.

mdekstrand · 2025-10-27T14:45:57Z

We should test that a few item IDs have the correct titles, too. It's possible for the code to set up the structures in the right format, but not align them correctly, and the tests should check for that.

mdekstrand · 2025-10-27T14:46:05Z

Same as above.

-Original file line number
+Diff line change
@@ -1,3 +1,3 @@
     # This file is part of LensKit.
     # Copyright (C) 2018-2023 Boise State University.
     # Copyright (C) 2023-2025 Drexel University.
@@ Expand Down Expand Up / @@ -265,10 +265,23 @@ @@
                 duplicates:
                     How to handle duplicate entity IDs.
             """
-            if isinstance(source, pd.DataFrame):  # pragma: nocover
-                raise NotImplementedError()
-            if isinstance(source, pa.Table):  # pragma: nocover
-                raise NotImplementedError()
+            if isinstance(source, pd.DataFrame):
+                source = pa.Table.from_pandas(source)
+            if isinstance(source, pa.Table):
+                entity_col = source.column(cls + "_id")
+                self.add_entities(cls, entity_col)
+                for col_name in source.column_names:
+                    if not col_name.endswith("_id"):
+                        col_type = source.column(col_name).type
+                        if any([pa.types.is_list(col_type),
+                                pa.types.is_large_list(col_type),
+                                pa.types.is_fixed_size_list(col_type)]):
+                            self.add_list_attribute(cls, col_name, entity_col, source.column(col_name))
+                        else:
+                            self.add_scalar_attribute(cls, col_name, entity_col, source.column(col_name))
+                return
             self._validate_entity_name(cls)
@@ Expand Down @@

-Original file line number
+Diff line change
@@ -1,3 +1,3 @@
     # This file is part of LensKit.
     # Copyright (C) 2018-2023 Boise State University.
     # Copyright (C) 2023-2025 Drexel University.
@@ Expand All / @@ -5,13 +5,15 @@ @@
     # SPDX-License-Identifier: MIT
     # pyright: strict
     import numpy as np
     import pyarrow as pa
+    import pandas as pd
     from pytest import raises
     from lenskit.data import DatasetBuilder
     from lenskit.diagnostics import DataError
+    from lenskit.testing import ml_test_dir
     def test_empty_builder():
@@ Expand Down Expand Up / @@ -143,3 +145,38 @@ @@
         assert ds.item_count == 0
         assert ds.user_count == 8
         assert np.all(ds.users.ids() == ["a", "b", "x", "y", "z", "q", "r", "s"])
+    def test_add_entities_with_dataframe():
+        dsb = DatasetBuilder()
+        items = pd.read_csv(ml_test_dir / "movies.csv")
+        items = items.rename(columns={"movieId": "item_id"}).set_index("item_id")
+        genres = items["genres"].str.split("|")
+        items["genres"] = genres
+        dsb.add_entities("item", items)
+        ds = dsb.build()
+        assert ds.entities("item").attribute("title").is_scalar
+        assert ds.entities("item").attribute("genres").is_list
+    def test_add_entities_with_arrow_table():
+        dsb = DatasetBuilder()
+        items = pd.read_csv(ml_test_dir / "movies.csv")
+        items = items.rename(columns={"movieId": "item_id"}).set_index("item_id")
+        genres = items["genres"].str.split("|")
+        items["genres"] = genres
+        table = pa.Table.from_pandas(items)
+        dsb.add_entities("item", table)
+        ds = dsb.build()
+        assert ds.entities("item").attribute("title").is_scalar
+        assert ds.entities("item").attribute("genres").is_list

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for dataframe and arrow table in add_entities #907

Uh oh!

Diff view

Diff view

There are no files selected for viewing

mdekstrand Oct 27, 2025

Uh oh!

mdekstrand Oct 27, 2025

Uh oh!

mdekstrand Oct 27, 2025

Uh oh!

GitHub Actions / Annotate with lint failures

GitHub Actions / Annotate with lint failures

mdekstrand Oct 27, 2025

Uh oh!

mdekstrand Oct 27, 2025

Uh oh!

Uh oh!

Add support for dataframe and arrow table in add_entities #907

Are you sure you want to change the base?

Uh oh!

Add support for dataframe and arrow table in add_entities #907

Uh oh!

Uh oh!

Diff view

Diff view

There are no files selected for viewing

mdekstrand Oct 27, 2025

Choose a reason for hiding this comment

Uh oh!

mdekstrand Oct 27, 2025

Choose a reason for hiding this comment

Uh oh!

mdekstrand Oct 27, 2025

Choose a reason for hiding this comment

Uh oh!

GitHub Actions / Annotate with lint failures

GitHub Actions / Annotate with lint failures

mdekstrand Oct 27, 2025

Choose a reason for hiding this comment

Uh oh!

mdekstrand Oct 27, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!