ENH: Implemented performance speedup for binary ReliefF + bug fixes by CaptainKanuk · Pull Request #79 · EpistasisLab/scikit-rebate

CaptainKanuk · 2021-12-16T08:49:26Z

Context

TL:DR - I was able to implement some significant performance improvements for ReliefF on binary + discrete data.
For a GAMETES generated binary class discrete data file with 3200 rows and 400 attributes, the statistics are:
-ReliefF w/ 10 neighbors Before: 59.05 seconds (mean) | 0.73 seconds (std) (n=4 runs)
-ReliefF w/ 10 neighbors After: 12.68 seconds (mean) | 0.65 seconds (std) (n=4 runs)

Process

I set out to see whether I could speed up the implementation of these algorithms. I had a few hypotheses:

The algorithms are currently overindexed on code-deduplication, with central functions in scoring_utils.py that try to serve all types of algorithms
We can improve parallelization of the algorithms
We can strategically use numba to speed up frequently called functions
We can optimize functions to rely more on numpy array math

My general plan of action was as follows:

Write a simple performance benchmarking tool for the library.
Pick one algorithm and data type case (continuous vs. discrete features, etc.)
Create one optimized scoring function for that (algorithm / data) case using numpy / numba

Changes in This PR

For benchmarking I wrote a simple tool that takes some of the common test cases and runs timeit repeatedly. I drop the results of the first run to avoid measuring compilation overhead, which can be quite high for numba when we're using small testing datasets. The tool will print results but also dump them to a csv file in the same directory. To run this tool, python performance_tests.py Additionally - you can run a parameter sweep to estimate how performance varies across different parameter values with python performance_tests.py sweep.

I also wrote a simple shell script to automate the generation of profiling graphs using Graphviz. This is available at ./run_performance_benchmark.sh and creates a performance csv for selected test cases, a python cprofile data file, and a png performance graph.

Using these tools I confirmed that the bulk of the runtime was in finding relevant neighbors for the relief algorithm and scoring. This is unsurprising as those functions are called roughly O(num_rows ^ 2) times and were not currently performance optimized.

Additional Notes & Results

In the interests of time I focused on a proof of concept for ReliefF with binary features and discrete data. The main scoring functions in scoring_utils.py were indeed very complex, with a lot of switch statements and non-performant operations. There is an elegance to the centralization, but I think dedicated scoring functions are not only clearer to a reader but also allow for dedicated optimization. My approach was to create a dedicated compute_binary_score function which uses optimized numpy functions. I also parallelized the selection of neighbors, and optimized the neighbor selection function using numba. Doing this caused me to have to refactor some code, but I aimed to try and keep the contracts the same so that the rest of the codebase would not be affected.

Lastly - all tests were failing when I checked out the repo. I determined this was due to Imputer being deprecated from sklearn. I updated it to the new SimpleImputer. This caused only 5 tests to fail, but looking at recent repo history I don't believe all tests had been passing. It seems to be a more general bug with how the algorithms treat data with a mixture of contiuous and discrete attributes. I did not attempt to fix this problem.

Updating README to include new performance changes.

CaptainKanuk · 2021-12-29T03:15:46Z

skrebate/relieff.py

+# nopython = True will prevent a silent fallback if the function cannot
+# be numba compiled for whatever reason.
+@jit(nopython=True)
+def find_neighbors_binary(


This is the core of the performance improvements for neighbor identification. We modify the function to find neighbors - add numba compilation, and add batch parallelization.

CaptainKanuk · 2021-12-29T03:16:47Z

skrebate/relieff.py



+# interestingly, this is sometimes faster without numba
+def compute_score_binary(


This is the core of the performance improvements for score computation - it expresses the computation as a sequence of numpy matrix operations.

CaptainKanuk mentioned this pull request Dec 16, 2021

Refactor: Looking for implementation strategies to improve run time efficiency of all algorithms regardless of data type (i.e. discrete/continuous, missing data) #39

Open

CaptainKanuk force-pushed the performance-improvements branch from e6a7ec6 to 05fb82e Compare December 17, 2021 06:43

CaptainKanuk and others added 6 commits December 27, 2021 14:54

BUG: Fixed error caused by deprecation in tests.py

6c9fcb3

ENH: Added perf testing utils and a requirements.txt for pip install

fee6235

ENH: Perf improvements for neighbor identification

1410a62

ENH: Performance improvements in score computation

e5895ad

STY: Flake8 fixes

fcf50eb

Update README.md

5d323e7

Updating README to include new performance changes.

CaptainKanuk force-pushed the performance-improvements branch from b1e8b5f to 5d323e7 Compare December 29, 2021 00:49

CaptainKanuk commented Dec 29, 2021

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Implemented performance speedup for binary ReliefF + bug fixes#79

ENH: Implemented performance speedup for binary ReliefF + bug fixes#79
CaptainKanuk wants to merge 6 commits intoEpistasisLab:masterfrom
CaptainKanuk:performance-improvements

CaptainKanuk commented Dec 16, 2021 •

edited

Loading

Uh oh!

CaptainKanuk Dec 29, 2021 •

edited

Loading

Uh oh!

CaptainKanuk Dec 29, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant



		# interestingly, this is sometimes faster without numba
		def compute_score_binary(

Conversation

CaptainKanuk commented Dec 16, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Context

Process

Changes in This PR

Additional Notes & Results

Uh oh!

CaptainKanuk Dec 29, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

CaptainKanuk Dec 29, 2021

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

CaptainKanuk commented Dec 16, 2021 •

edited

Loading

CaptainKanuk Dec 29, 2021 •

edited

Loading