ENH: Implemented performance speedup for binary ReliefF + bug fixes#79
Open
CaptainKanuk wants to merge 6 commits intoEpistasisLab:masterfrom
Open
ENH: Implemented performance speedup for binary ReliefF + bug fixes#79CaptainKanuk wants to merge 6 commits intoEpistasisLab:masterfrom
CaptainKanuk wants to merge 6 commits intoEpistasisLab:masterfrom
Conversation
e6a7ec6 to
05fb82e
Compare
Updating README to include new performance changes.
b1e8b5f to
5d323e7
Compare
CaptainKanuk
commented
Dec 29, 2021
| # nopython = True will prevent a silent fallback if the function cannot | ||
| # be numba compiled for whatever reason. | ||
| @jit(nopython=True) | ||
| def find_neighbors_binary( |
Author
There was a problem hiding this comment.
This is the core of the performance improvements for neighbor identification. We modify the function to find neighbors - add numba compilation, and add batch parallelization.
CaptainKanuk
commented
Dec 29, 2021
|
|
||
|
|
||
| # interestingly, this is sometimes faster without numba | ||
| def compute_score_binary( |
Author
There was a problem hiding this comment.
This is the core of the performance improvements for score computation - it expresses the computation as a sequence of numpy matrix operations.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Context
TL:DR - I was able to implement some significant performance improvements for ReliefF on binary + discrete data.
For a GAMETES generated binary class discrete data file with 3200 rows and 400 attributes, the statistics are:
-ReliefF w/ 10 neighbors Before: 59.05 seconds (mean) | 0.73 seconds (std) (n=4 runs)
-ReliefF w/ 10 neighbors After: 12.68 seconds (mean) | 0.65 seconds (std) (n=4 runs)
Process
I set out to see whether I could speed up the implementation of these algorithms. I had a few hypotheses:
My general plan of action was as follows:
Changes in This PR
For benchmarking I wrote a simple tool that takes some of the common test cases and runs timeit repeatedly. I drop the results of the first run to avoid measuring compilation overhead, which can be quite high for numba when we're using small testing datasets. The tool will print results but also dump them to a csv file in the same directory. To run this tool,
python performance_tests.pyAdditionally - you can run a parameter sweep to estimate how performance varies across different parameter values withpython performance_tests.py sweep.I also wrote a simple shell script to automate the generation of profiling graphs using Graphviz. This is available at
./run_performance_benchmark.shand creates a performance csv for selected test cases, a python cprofile data file, and a png performance graph.Using these tools I confirmed that the bulk of the runtime was in finding relevant neighbors for the relief algorithm and scoring. This is unsurprising as those functions are called roughly O(num_rows ^ 2) times and were not currently performance optimized.
Additional Notes & Results
In the interests of time I focused on a proof of concept for ReliefF with binary features and discrete data. The main scoring functions in scoring_utils.py were indeed very complex, with a lot of switch statements and non-performant operations. There is an elegance to the centralization, but I think dedicated scoring functions are not only clearer to a reader but also allow for dedicated optimization. My approach was to create a dedicated compute_binary_score function which uses optimized numpy functions. I also parallelized the selection of neighbors, and optimized the neighbor selection function using numba. Doing this caused me to have to refactor some code, but I aimed to try and keep the contracts the same so that the rest of the codebase would not be affected.
Lastly - all tests were failing when I checked out the repo. I determined this was due to Imputer being deprecated from sklearn. I updated it to the new SimpleImputer. This caused only 5 tests to fail, but looking at recent repo history I don't believe all tests had been passing. It seems to be a more general bug with how the algorithms treat data with a mixture of contiuous and discrete attributes. I did not attempt to fix this problem.