Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
21 changes: 17 additions & 4 deletions src/lenskit/data/builder.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
# This file is part of LensKit.
# Copyright (C) 2018-2023 Boise State University.
# Copyright (C) 2023-2025 Drexel University.
Expand Down Expand Up @@ -265,10 +265,23 @@
duplicates:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should update the docstring to document the kinds of attributes supported, limitations, etc., along with the index logic.

This should be in the main body of the docstring (before Args:), not in the argument documentation, for readability.

How to handle duplicate entity IDs.
"""
if isinstance(source, pd.DataFrame): # pragma: nocover
raise NotImplementedError()
if isinstance(source, pa.Table): # pragma: nocover
raise NotImplementedError()
if isinstance(source, pd.DataFrame):
source = pa.Table.from_pandas(source)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is an interesting and challenging edge case here, that we need to clearly document and/or design for.

Right now, this works because your test case names the index item_id, which then turns into a column when we do from_pandas.

However, if the client provides code that has no item_id column, and has an index with a different name, we need to figure out what to do. Do we want to use the index? Do we want to throw an error?

I think we probably want to use the Pandas index, with the following logic:

  1. If the data frame has a column named {cls}_id, use that column as the entity IDs, and ignore the index.
  2. Otherwise, assume the index has entity IDs.

Implementing this logic will require this line to be a little more aware of the Pandas data frames, and also require tests for each of the different conditions. Importantly, for case (1), this line here will create a new attribute called index (or whatever the index name is), and we don't want to include that.

The input cases we will need to test for correct behavior with:

  1. Input has an index named {cls}_id (the current test)
  2. Input has an index named something else, and no column named {cls}_id
  3. Input has a column named {cls}_id

This isn't a problem for PyArrow input, because Arrow tables do not have indices.

if isinstance(source, pa.Table):
entity_col = source.column(cls + "_id")
self.add_entities(cls, entity_col)

for col_name in source.column_names:
if not col_name.endswith("_id"):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should only exclude the {cls}_id column — we want any other _id columns to result in an error, not be silently ignored.

col_type = source.column(col_name).type

if any([pa.types.is_list(col_type),
pa.types.is_large_list(col_type),
pa.types.is_fixed_size_list(col_type)]):
self.add_list_attribute(cls, col_name, entity_col, source.column(col_name))
else:
self.add_scalar_attribute(cls, col_name, entity_col, source.column(col_name))

Check failure on line 283 in src/lenskit/data/builder.py

View workflow job for this annotation

GitHub Actions / Annotate with lint failures

Ruff (E501)

src/lenskit/data/builder.py:283:101: E501 Line too long (101 > 100)
return

self._validate_entity_name(cls)

Expand Down
37 changes: 37 additions & 0 deletions tests/data/test_builder_entities.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
# This file is part of LensKit.
# Copyright (C) 2018-2023 Boise State University.
# Copyright (C) 2023-2025 Drexel University.
Expand All @@ -5,13 +5,15 @@
# SPDX-License-Identifier: MIT

# pyright: strict
import numpy as np
import pyarrow as pa
import pandas as pd

from pytest import raises

from lenskit.data import DatasetBuilder
from lenskit.diagnostics import DataError
from lenskit.testing import ml_test_dir

Check failure on line 16 in tests/data/test_builder_entities.py

View workflow job for this annotation

GitHub Actions / Annotate with lint failures

Ruff (I001)

tests/data/test_builder_entities.py:8:1: I001 Import block is un-sorted or un-formatted


def test_empty_builder():
Expand Down Expand Up @@ -143,3 +145,38 @@
assert ds.item_count == 0
assert ds.user_count == 8
assert np.all(ds.users.ids() == ["a", "b", "x", "y", "z", "q", "r", "s"])


def test_add_entities_with_dataframe():
dsb = DatasetBuilder()

items = pd.read_csv(ml_test_dir / "movies.csv")
items = items.rename(columns={"movieId": "item_id"}).set_index("item_id")

genres = items["genres"].str.split("|")
items["genres"] = genres

dsb.add_entities("item", items)

ds = dsb.build()

assert ds.entities("item").attribute("title").is_scalar
assert ds.entities("item").attribute("genres").is_list
Comment on lines +163 to +164
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should test that a few item IDs have the correct titles, too. It's possible for the code to set up the structures in the right format, but not align them correctly, and the tests should check for that.



def test_add_entities_with_arrow_table():
dsb = DatasetBuilder()

items = pd.read_csv(ml_test_dir / "movies.csv")
items = items.rename(columns={"movieId": "item_id"}).set_index("item_id")

genres = items["genres"].str.split("|")
items["genres"] = genres
table = pa.Table.from_pandas(items)

dsb.add_entities("item", table)

ds = dsb.build()

assert ds.entities("item").attribute("title").is_scalar
assert ds.entities("item").attribute("genres").is_list
Comment on lines +181 to +182
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above.

Loading