Skip to content

Saving and loading for Fit objects#126

Open
Jfeatherstone wants to merge 12 commits intomasterfrom
saving_loading
Open

Saving and loading for Fit objects#126
Jfeatherstone wants to merge 12 commits intomasterfrom
saving_loading

Conversation

@Jfeatherstone
Copy link
Collaborator

This PR implements robust saving and loading features as brought up in #124, and implements (optional) parallelization for the xmin calculation.

  • Implements the __hash__ and __eq__ functions for Fit objects, allowing equality to be tested between different instances. This equality is defined in a reasonable way based on the values of the data and fitting parameters, with emphasis on being able to compare if a Fit object contains exactly the same information regardless of whether the instance itself is different.
  • Implements the save and load functions for Fit objects, which allow for caching results and loading them in easily. These functions currently support two formats: Python pickle files and hdf5. Both function pretty much exactly the same from all of my testing. Pickling is of course a standard practice for Python, and hdf5 is implemented in the case that users want a more universal file type (for example that can be loaded in another language). The format keyword to these functions (also automatically inferred from file extension) controls which is used; hdf5 is the default since it gives smaller files.
  • Tests have been added in testing/test_saving_loading.py to confirm that this hashing works properly, and that the saving and loading works as intended.
  • In order to make sure a Fit object is able to be cached, there are some (minor) restrictions on the form of constraint functions. I've added a discussion of this in the documentation, including best practices and how to avoid issues.
  • The pickle format requires using dill instead of the standard library pickle. Of course it is better to have fewer dependencies, but pickling arbitrary (local) functions isn't possible without this. I've been debating introducing this library for a while now since it is needed for parallelizing the xmin calculation (for exactly the same reason it is needed for saving and loading). With the h5py library needed for that format, this update adds two new dependencies.
  • xmin calculation parallelization was already templated in v2.0, and with the addition of dill, it now works. Note that you'll only really see a speed-up from using more cores if you're fitting a costly function (eg. truncated power law). Power laws will take pretty much the same amount of time (if not more) with more processes since the MLE expression is so quick, most of the time is spent communicating and storing values anyway.

Copy link
Collaborator

@keflavich keflavich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a lot to review here, and my comments are mostly nitpicky, but:

  • It would be helpful to separate different concepts as much as possible into their own PRs. So, documentation updating should be in a different PR than saving. I think saving and parallelizing are intricately linked, though, so they have to go together?
  • I can't open the .ipynb file. If it's meant to include rendered graphics, I recommend we make a separate repository for notebooks-with-rendered-graphics (in the powerlaw-dev org) to avoid bloating the code base

...

The saving and loading functions currently support two different file formats:
pickle and hdf5. A pickle file is Python's way of serializing an object,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

link to numpy/h5py documentation here?

-----------
- Original paper for the library: http://arxiv.org/abs/1305.0215
- Source code: https://github.com/jeffalstott/powerlaw
- Source code: https://github.com/powerlaw-devs/powerlaw
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these changes are great & important - but maybe they should go in a separate PR?


# Try and grab the current version of the package from the _version.py file.
# This is for comparing with saved/loaded Fit objects. If we can't find that
# file, we have to work without it.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this just raise an exception about being a bad installation?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason I didn't raise an error here is because the _version.py file is only generated if you properly install the library (run pip install . in the cloned repository or equivalent), so it won't exist right after you clone the repository. I wanted it to be possible to test new features, branches, etc. without fully installing the package, for which the only solution I could think of was to manually read the _version.py file, but being able to continue when this file doesn't exist.

Please let me know if there is a better way to achieve this, or if you think requiring installation is fine.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd recommend requiring installation.

If you want to have a test version running, pip install -e . means you don't have to reinstall every time you want to run a test

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, how about a warning if the package isn't installed, but not an error? I think we don't lose any functionality by not installing other than not being able to compare versions of cached files, so I'd rather not put a full error there.

Maybe this is bad practice, but I find myself using local packages that aren't installed through pip sometimes, especially if I want to compare different versions of the same package (where the stable version is installed through pip, and the experimental one is local and just directly added to the Python path).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that is bad practice and can have some really awkward consequences. I'm mostly parroting what others have instructed me to do in large collaborations, but pip install -e . works perfectly whenever I have a package with appropriate installation instructions like this one.

That said, I'm not going to be overly rigid about this. I don't think it's going to break anyone's code to have this extra approach.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to know :) I will look into alternative ways to do this for my own personal development, but yeah hopefully the warning should be enough to deter people from using this approach unless they really have no other option.

@Jfeatherstone
Copy link
Collaborator Author

It would be helpful to separate different concepts as much as possible into their own PRs. So, documentation updating should be in a different PR than saving. I think saving and parallelizing are intricately linked, though, so they have to go together?

I put the documentation and code changes together since I thought it might make things easier to review, though I can separate them in the future. And indeed, parallelizing should be included in this, even if it doesn't seem related at first.

I can't open the .ipynb file. If it's meant to include rendered graphics, I recommend we make a separate repository for notebooks-with-rendered-graphics (in the powerlaw-dev org) to avoid bloating the code base

Yeah, now that I think about it, having the documentation figures be generated in a Jupyter notebook is not a good idea at all. Probably better to just have a regular Python script. I plan to submit a PR about the documentation soon (including a Github Action to generate and host it on Github Pages), so I will address this issue then.

@Jfeatherstone
Copy link
Collaborator Author

I think the tolerances on the generation-fitting unit tests are a little too low, as one of the tests failed but then passed on a re-run. This doesn't relate to this PR so I'll address it in another one, but all tests are passing now.

Copy link
Collaborator

@keflavich keflavich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one more minor comment from me. I'd still recommend splitting the .ipynb & relinking stuff into a different PR prior to merging this.

Jfeatherstone and others added 2 commits March 4, 2026 10:34
Cleaned up header formatting

Co-authored-by: Adam Ginsburg <keflavich@gmail.com>
@Jfeatherstone
Copy link
Collaborator Author

Ok I've reset all the changes about the repository links and the Jupyter notebook now. The notebook already exists in the master branch (was included in v2.0) so the file still exists here, but this PR doesn't change anything about it. As above, I'll fix it in a PR about the documentation.

@Jfeatherstone
Copy link
Collaborator Author

Thanks for reviewing this, and sorry that it was a little messy :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants