Conversation
… do some more bug checking
keflavich
left a comment
There was a problem hiding this comment.
There's a lot to review here, and my comments are mostly nitpicky, but:
- It would be helpful to separate different concepts as much as possible into their own PRs. So, documentation updating should be in a different PR than saving. I think saving and parallelizing are intricately linked, though, so they have to go together?
- I can't open the .ipynb file. If it's meant to include rendered graphics, I recommend we make a separate repository for notebooks-with-rendered-graphics (in the powerlaw-dev org) to avoid bloating the code base
| ... | ||
|
|
||
| The saving and loading functions currently support two different file formats: | ||
| pickle and hdf5. A pickle file is Python's way of serializing an object, |
There was a problem hiding this comment.
link to numpy/h5py documentation here?
| ----------- | ||
| - Original paper for the library: http://arxiv.org/abs/1305.0215 | ||
| - Source code: https://github.com/jeffalstott/powerlaw | ||
| - Source code: https://github.com/powerlaw-devs/powerlaw |
There was a problem hiding this comment.
these changes are great & important - but maybe they should go in a separate PR?
|
|
||
| # Try and grab the current version of the package from the _version.py file. | ||
| # This is for comparing with saved/loaded Fit objects. If we can't find that | ||
| # file, we have to work without it. |
There was a problem hiding this comment.
shouldn't this just raise an exception about being a bad installation?
There was a problem hiding this comment.
The reason I didn't raise an error here is because the _version.py file is only generated if you properly install the library (run pip install . in the cloned repository or equivalent), so it won't exist right after you clone the repository. I wanted it to be possible to test new features, branches, etc. without fully installing the package, for which the only solution I could think of was to manually read the _version.py file, but being able to continue when this file doesn't exist.
Please let me know if there is a better way to achieve this, or if you think requiring installation is fine.
There was a problem hiding this comment.
I'd recommend requiring installation.
If you want to have a test version running, pip install -e . means you don't have to reinstall every time you want to run a test
There was a problem hiding this comment.
Hmm, how about a warning if the package isn't installed, but not an error? I think we don't lose any functionality by not installing other than not being able to compare versions of cached files, so I'd rather not put a full error there.
Maybe this is bad practice, but I find myself using local packages that aren't installed through pip sometimes, especially if I want to compare different versions of the same package (where the stable version is installed through pip, and the experimental one is local and just directly added to the Python path).
There was a problem hiding this comment.
I think that is bad practice and can have some really awkward consequences. I'm mostly parroting what others have instructed me to do in large collaborations, but pip install -e . works perfectly whenever I have a package with appropriate installation instructions like this one.
That said, I'm not going to be overly rigid about this. I don't think it's going to break anyone's code to have this extra approach.
There was a problem hiding this comment.
Good to know :) I will look into alternative ways to do this for my own personal development, but yeah hopefully the warning should be enough to deter people from using this approach unless they really have no other option.
I put the documentation and code changes together since I thought it might make things easier to review, though I can separate them in the future. And indeed, parallelizing should be included in this, even if it doesn't seem related at first.
Yeah, now that I think about it, having the documentation figures be generated in a Jupyter notebook is not a good idea at all. Probably better to just have a regular Python script. I plan to submit a PR about the documentation soon (including a Github Action to generate and host it on Github Pages), so I will address this issue then. |
…changes in readme
…t match standard python ones
|
I think the tolerances on the generation-fitting unit tests are a little too low, as one of the tests failed but then passed on a re-run. This doesn't relate to this PR so I'll address it in another one, but all tests are passing now. |
keflavich
left a comment
There was a problem hiding this comment.
one more minor comment from me. I'd still recommend splitting the .ipynb & relinking stuff into a different PR prior to merging this.
Cleaned up header formatting Co-authored-by: Adam Ginsburg <keflavich@gmail.com>
|
Ok I've reset all the changes about the repository links and the Jupyter notebook now. The notebook already exists in the master branch (was included in v2.0) so the file still exists here, but this PR doesn't change anything about it. As above, I'll fix it in a PR about the documentation. |
|
Thanks for reviewing this, and sorry that it was a little messy :) |
This PR implements robust saving and loading features as brought up in #124, and implements (optional) parallelization for the xmin calculation.
__hash__and__eq__functions forFitobjects, allowing equality to be tested between different instances. This equality is defined in a reasonable way based on the values of the data and fitting parameters, with emphasis on being able to compare if aFitobject contains exactly the same information regardless of whether the instance itself is different.saveandloadfunctions forFitobjects, which allow for caching results and loading them in easily. These functions currently support two formats: Python pickle files and hdf5. Both function pretty much exactly the same from all of my testing. Pickling is of course a standard practice for Python, and hdf5 is implemented in the case that users want a more universal file type (for example that can be loaded in another language). Theformatkeyword to these functions (also automatically inferred from file extension) controls which is used; hdf5 is the default since it gives smaller files.testing/test_saving_loading.pyto confirm that this hashing works properly, and that the saving and loading works as intended.Fitobject is able to be cached, there are some (minor) restrictions on the form of constraint functions. I've added a discussion of this in the documentation, including best practices and how to avoid issues.dillinstead of the standard librarypickle. Of course it is better to have fewer dependencies, but pickling arbitrary (local) functions isn't possible without this. I've been debating introducing this library for a while now since it is needed for parallelizing the xmin calculation (for exactly the same reason it is needed for saving and loading). With theh5pylibrary needed for that format, this update adds two new dependencies.dill, it now works. Note that you'll only really see a speed-up from using more cores if you're fitting a costly function (eg. truncated power law). Power laws will take pretty much the same amount of time (if not more) with more processes since the MLE expression is so quick, most of the time is spent communicating and storing values anyway.