[python-package] removed _json_default_with_numpy private function#7145
[python-package] removed _json_default_with_numpy private function#7145daguirre11 wants to merge 9 commits intolightgbm-org:masterfrom
Conversation
There was a problem hiding this comment.
Thank you very much for the investigation!
I'll need to take a little time to read through what you've shared here. The expectations around Dataset.pandas_categorical are a little unclear in the codebase, I'll try to improve that.
I can share that I looked through the git blame tonight and it seems this _json_default_with_numpy() function has been in lightgbm for 9 years (#247), and its addition didn't generate any discussion about why it was necessary. @wxchan added this but isn't active in LightGBM or on GitHub any more, so I don't think they'll be able to help us understand it.
I'll look at this shortly. Two other notes while I do that:
- please do update your git config so your commits will be tied to your GitHub account (#7143 (comment))
- in the future, share code links as raw links instead of wrapped in markdown like
[here](link), so they'll be rendered directly in the GitHub UI like this:
|
@jameslamb is there anything else that needs to be done? I updated my config as well as added my local mac machine emails to my github profile. |
jameslamb
left a comment
There was a problem hiding this comment.
Thanks so much for investigating, sorry it took a while for me to review!
I tested some combinations tonight and I agree with you! The category values get converted to base types like float and int by the time they are written to Dataset.pandas_categorical, and therefore don't cause any problems for JSON serialization.
And I don't think an np.ndarray could ever reach this code.
I've pushed a unit test that confirms this. I'd like to see how that goes in CI here (especially the job that covers old numpy and pandas versions).
If everything passes, I'd be happy to merge this 😁
Sorry, forgot to answer this question. Commits look great now, thanks for fixing that. |
| ) | ||
|
|
||
| # confirm that the array dtypes also become the category dtypes | ||
| assert df["np_float"].dtype.categories.dtype == np.float32 |
There was a problem hiding this comment.
This is failing in a few CI jobs (notably Windows jobs on AppVeyor):
FAILED tests/python_package_test/test_basic.py::test_pandas_categorical_json_serialization_works - AssertionError: assert dtype('float64') == <class 'numpy.float32'>
It might be slightly too strict. @daguirre11 if you figure out a better pattern for this test (or notice any other issues I've introduced) please feel free to push updates here. Otherwise, I'll look at this again some time in the next few days.
Hi 👋 ,
I believe this private function should be removed instead of tested.
_json_default_with_numpy,model_to_stringhere anddump_modelhere .json.dumps(self.pandas_categorical, default=_json_default_with_numpy).self.pandas_categoricalis only value other thanNoneif theXdata argument is given as a a pandas DataFrame here.np.bool_,np.floating, andnp.integerthat are in theisinstance()if condition in_json_default_with_numpyhere can all be converted to their appropriate python types resulting inself.pandas_categorical = None->self.pandas_categorical = {}. This is possible because Pandas automatically converts NumPy scalars to pandas dtypes during DataFrame construction of the DataFrame._json_default_with_numpyregardingnp.ndarrayhere is not reachable in the code because it is not allow dtype for a pandas DataFrame based on the instilled checks.data used:
I also checked
_json_default_with_numpyby manually inputting aself.pandas_categoricalthat actually invokes the function.which results in a model dump json pandas categorical key value pair
pandas_categorical:{"feature_index": 0, "test_np_bool": true, "test_np_int64": 42, "test_np_array": [1, 2, 3]}However, as explained before it is not possible for self.pandas_categorical to have a value like this.
If I am wrong please explain to me what I am missing 😃