Skip to content

[python-package] removed _json_default_with_numpy private function#7145

Open
daguirre11 wants to merge 9 commits intolightgbm-org:masterfrom
daguirre11:basic-json-default-with-numpy-test
Open

[python-package] removed _json_default_with_numpy private function#7145
daguirre11 wants to merge 9 commits intolightgbm-org:masterfrom
daguirre11:basic-json-default-with-numpy-test

Conversation

@daguirre11
Copy link
Contributor

@daguirre11 daguirre11 commented Jan 31, 2026

Hi 👋 ,

I believe this private function should be removed instead of tested.

  • There are two functions that invoke _json_default_with_numpy , model_to_string here and dump_model here .
  • Each of these functions use json.dumps(self.pandas_categorical, default=_json_default_with_numpy).
  • self.pandas_categorical is only value other than None if the X data argument is given as a a pandas DataFrame here.
  • np.bool_, np.floating, and np.integer that are in the isinstance() if condition in _json_default_with_numpy here can all be converted to their appropriate python types resulting in self.pandas_categorical = None -> self.pandas_categorical = {}. This is possible because Pandas automatically converts NumPy scalars to pandas dtypes during DataFrame construction of the DataFrame.
  • Lastly, the next if condition in _json_default_with_numpy regarding np.ndarray here is not reachable in the code because it is not allow dtype for a pandas DataFrame based on the instilled checks.

data used:

    X = pd.DataFrame({
        'np_bool_col': [np.bool_(True), np.bool_(False), np.bool_(True)],
        'regular_col': [np.uint8(1), np.uint16(2), np.uint8(3)],
        'np_float_col': [np.float64(1.23), np.float64(4.56), np.float64(7.89)],
        'np_array_col': [
            np.array([1, 2, 3]),  
            np.array([4, 5, 6]),    
            np.array([7, 8, 9]) ,
        ],
    })
     def _check_for_bad_pandas_dtypes(pandas_dtypes_series: pd_Series) -> None:
        bad_pandas_dtypes = [
            f"{column_name}: {pandas_dtype}"
            for column_name, pandas_dtype in pandas_dtypes_series.items()
            if not _is_allowed_numpy_dtype(pandas_dtype.type)
        ]
        if bad_pandas_dtypes:
>           raise ValueError(
                f"pandas dtypes must be int, float or bool.\nFields with bad pandas dtypes: {', '.join(bad_pandas_dtypes)}"
            )
E           ValueError: pandas dtypes must be int, float or bool.
E           Fields with bad pandas dtypes: np_array_col: object

.venv/lib/python3.14/site-packages/lightgbm/basic.py:791: ValueError

I also checked _json_default_with_numpy by manually inputting a self.pandas_categorical that actually invokes the function.

    categorical_json = json.dumps(        
        {
            "feature_index": 0,
            "test_np_bool": np.bool_(True),
            "test_np_int64": np.int64(42), 
            "test_np_array": np.array([1,2,3])
        }, 
        default=_json_default_with_numpy,
    )

which results in a model dump json pandas categorical key value pair
pandas_categorical:{"feature_index": 0, "test_np_bool": true, "test_np_int64": 42, "test_np_array": [1, 2, 3]}

However, as explained before it is not possible for self.pandas_categorical to have a value like this.

If I am wrong please explain to me what I am missing 😃

Copy link
Member

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for the investigation!

I'll need to take a little time to read through what you've shared here. The expectations around Dataset.pandas_categorical are a little unclear in the codebase, I'll try to improve that.

I can share that I looked through the git blame tonight and it seems this _json_default_with_numpy() function has been in lightgbm for 9 years (#247), and its addition didn't generate any discussion about why it was necessary. @wxchan added this but isn't active in LightGBM or on GitHub any more, so I don't think they'll be able to help us understand it.

I'll look at this shortly. Two other notes while I do that:

  1. please do update your git config so your commits will be tied to your GitHub account (#7143 (comment))
  2. in the future, share code links as raw links instead of wrapped in markdown like [here](link), so they'll be rendered directly in the GitHub UI like this:

https://github.com/microsoft/LightGBM/blob/74fa3863461854dee80722d4c1ccc4db696801aa/python-package/lightgbm/basic.py#L530-L537

@daguirre11 daguirre11 closed this Feb 1, 2026
@daguirre11 daguirre11 reopened this Feb 1, 2026
@daguirre11
Copy link
Contributor Author

@jameslamb is there anything else that needs to be done? I updated my config as well as added my local mac machine emails to my github profile.

@daguirre11 daguirre11 requested a review from jameslamb February 17, 2026 16:46
Copy link
Member

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks so much for investigating, sorry it took a while for me to review!

I tested some combinations tonight and I agree with you! The category values get converted to base types like float and int by the time they are written to Dataset.pandas_categorical, and therefore don't cause any problems for JSON serialization.

And I don't think an np.ndarray could ever reach this code.

I've pushed a unit test that confirms this. I'd like to see how that goes in CI here (especially the job that covers old numpy and pandas versions).

If everything passes, I'd be happy to merge this 😁

@jameslamb
Copy link
Member

is there anything else that needs to be done? I updated my config as well as added my local mac machine emails to my github profile.

Sorry, forgot to answer this question. Commits look great now, thanks for fixing that.

)

# confirm that the array dtypes also become the category dtypes
assert df["np_float"].dtype.categories.dtype == np.float32
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is failing in a few CI jobs (notably Windows jobs on AppVeyor):

FAILED tests/python_package_test/test_basic.py::test_pandas_categorical_json_serialization_works - AssertionError: assert dtype('float64') == <class 'numpy.float32'>

(build link)

It might be slightly too strict. @daguirre11 if you figure out a better pattern for this test (or notice any other issues I've introduced) please feel free to push updates here. Otherwise, I'll look at this again some time in the next few days.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants