Feature/dm 191 refactor load processed pileup#21
Conversation
…l need to apply to export functions, and need to check that performance is unimpacted.
…ause it will complicate parallelization and is not clearly useful.
…cess_pileup_row and renamed/configured a few variables to support this.
…hunk generation, currently only for load_processed.pileup_vectors_from_bedmethyl. pileup_counts_from_bedmethyl is a simpler case. Passes pre-existing tests including when chunk_size<region size, but further infrastructure is required to assess whether regions_5to3prime still works right (logic gets a bit complicated). Speed with a single core and a small number of small regions is slightly slower due to overhead necessary to support parallelization. Need to test that we get a speedup with multi-core long regions or many regions.
…ead_vectors_from_hdf5 test targets to pass tests; other test targets are the same for now.
…d generate_targets.py to properly handle copying over new kwargs from cases.py when re-running some but not all targets: if all targets are re-run, then the old test_matrix.pkl is discarded. If only some are re-run, we need the old test_matrix.pkl, but still need to copy over all the kwargs for all cases that won't be re-run and still need to copy over old results, which will be overridden for re-run targets.
…rocessed-tests-baseline Feature/dm 191 refactor load processed tests baseline
…ss_pileup_row. Passing all tests.
…ns_5to3prime logic within the pileup_vectors_from_bedmethyl function, which is the only one that actually needs it. Runs faster, still passes all tests.
…y, re-ran targets for pileup_counts_from_bedmethyl and pileup_vectors_from_bedmethyl.
…rocessed-tests-baseline DM-192 Added chunk_size for load_processes parallelization to cases.p…
…. Added progress bars and quiet toggles for pileup loaders. Set default chunk size to 1 MB after whole-chromosome speed testing. Added option to parallelize within rather than between regions with regions_to_list.
…e to avoid intermittent crashes. Adjust tabix fetch to avoid negative start values. Tweak wording on progress bars.
… to all pileup plotters. A few other small tweaks.
… and Macos ostensibly have 4 cores. This should not change any outputs because for extract the test is hardcoded to use 1 core, and that is the only one with any stochasticity in parallelization.
…to avoid ever sending the cores arg twice, and added a comment to explain the reason why extract is tested unparallelized, unlike other functions.
…ment for extract. Updated test_matrix.pickle by re-running generate_targets. No changes to pileup or extract files, but should pass all tests.
…rocessed-tests-baseline New test_matrix and cases to cover 1, 2, 3, 4, and None for cores
thekugelmeister
left a comment
There was a problem hiding this comment.
Anecdotal testing comments:
- A figure that took me 30 minutes to produce this morning took me 15 minutes this afternoon, on a not very optimized setup. My computer might be screaming in pain but I can adjust that. I'm theoretically quite content on the whole with that result, although I'm curious if you've seen better speedups on savio.
- Other plots have had amazing speedups since. Still not sure what that was about.
General comments from review:
- I feel like we should add clarification of what the options to
corescan be, for the user-facing methods. I understand that they all are the same (Nonemeans "all available", etc), and it's slightly redundant to say it everywhere, but I think it would be useful for the end users. - Progress bars
- Almost every time I talk to a new package user, they say, "it's nice it has progress bars" 🤣 That being said, we don't need them to stick around after the fact. I might actually start arguing that all of our progress bars should be set to
leave=Falseor=None. - I think the message on the bar should be changed to something more generic like "loading data". I think it will suffice and look cleaner.
- Almost every time I talk to a new package user, they say, "it's nice it has progress bars" 🤣 That being said, we don't need them to stick around after the fact. I might actually start arguing that all of our progress bars should be set to
- The fact that it prints
modkit found with expected version 0.2.4for every core is mildly annoying. Now that I think about it, is it necessary to print that at all? Can we just have it print if none or the wrong version is found?
dimelo/load_processed.py
Outdated
| window_size: int | None = None, | ||
| quiet: bool = True, | ||
| cores: int | None = None, | ||
| parallelize_within_regions: bool = False, |
There was a problem hiding this comment.
Why is this option here? Do you have any use cases in mind where this would be useful?
There was a problem hiding this comment.
Clarified use case in docstring. It is for small numbers of large regions, e.g. comparing chromosomes.
There was a problem hiding this comment.
I loosely understand the goal of the use case, but I find the wording difficult to parse and I'm not sure I fully follow the logic.
If I'm understanding correctly, the goal is to specify whether to split up large regions into smaller chunks. If you have small regions, you can just be embarrassingly parallel across regions. If you have larger ones, you want to split those regions up. Is that high level accurate?
If this understanding is true, then is the current configuration that False --> don't split regions and True means to split regions?
If all of this is true, I suggest maybe a rename of this argument to split_large_regions or something like that? What do you think?
I only care because this is user facing and I find the current name/docs a little arcane. If I'm understanding correctly, this is an important setting to include, so let's make it clear to the user why they should care.
dimelo/load_processed.py
Outdated
| return results | ||
|
|
||
|
|
||
| def process_region(region_string, function_handle, **kwargs): |
There was a problem hiding this comment.
I'm probably misunderstanding, but why is this a separate method? Why can't it be part of the partial above? If it needs to exist, I think it needs a better name; this one is very generic. Can you describe more fully what it is intended to do?
There was a problem hiding this comment.
I think the executor.map call needs a positional input rather than a keyword (e.g. regions), but then the ones that take keywords don't seem to be happy to return things in the same order as they were sent (necessary for the outputs list to map to the input regions), so I am using this wrapper. Renamed it for clarity to apply_loader_function_to_region. Happy to use a different multiprocessing approach that can directly handle kwargs if you know of one.
| elif len(pileup_basemod.split(",")) == 3: | ||
| pileup_modname, pileup_motif, pileup_mod_coord = ( | ||
| pileup_basemod.split(",") | ||
| keep_basemod, genomic_coord, modified_in_row, valid_in_row = ( |
There was a problem hiding this comment.
Very much appreciate the simplification!
Performance test resultsThese are the speedups I see on Savio. The second graph is showing the results for fewer cores also, so should tell also what to expect for e.g. an 8-core macbook. High level feedback responseI got rid of the successfully-found-modkit printout, leaving the error outputs. Adjusted the progress bars to disappear when done for load_processed (not for parse_bam though we can do that in the future). Changed description to just "Loading data." Granular feedback responseI made changes to fix most of the requests. I decided not to consolidate the SharedMemory management because there are subtle differences in how the variables are created between the counts and vectors method (added comments describing what they are doing), and the dereferencing code is so straightforward already. Other changesI added this to both the vectors and counts loader. I had that in previous versions of these pileup loaders but removed for simplicity, only to realize/remember that we get errors if we don't make sure the chromosomes is present in the tabix index before fetching. |
…ub.com/streetslab/dimelo_v2 into feature/DM-191-refactor-load-processed
thekugelmeister
left a comment
There was a problem hiding this comment.
Looks almost good! I requested some additional clarification in one place for a user-facing argument but everything else i think looks fine.
dimelo/load_processed.py
Outdated
| window_size: int | None = None, | ||
| quiet: bool = True, | ||
| cores: int | None = None, | ||
| parallelize_within_regions: bool = False, |
There was a problem hiding this comment.
I loosely understand the goal of the use case, but I find the wording difficult to parse and I'm not sure I fully follow the logic.
If I'm understanding correctly, the goal is to specify whether to split up large regions into smaller chunks. If you have small regions, you can just be embarrassingly parallel across regions. If you have larger ones, you want to split those regions up. Is that high level accurate?
If this understanding is true, then is the current configuration that False --> don't split regions and True means to split regions?
If all of this is true, I suggest maybe a rename of this argument to split_large_regions or something like that? What do you think?
I only care because this is user facing and I find the current name/docs a little arcane. If I'm understanding correctly, this is an important setting to include, so let's make it clear to the user why they should care.


Description
The core of this PR is a set of adjustments to load_processed for pileup vectors, pileup counts, and regions_to_list, with the goal of utilizing multiple cores to speed up processing and reduce memory overhead. Similar changes are possible conceptually for the single read loaders, but given that we may adjust single read loading to a new file type with new modkit versions and given that I have mostly had performance bottlenecks in data loading with pileups, I opted to keep that as-is. Most changes are directly related to this goal; some minor things were incidentally updated because I was overhauling that part of the code anyway.
Here is an overview of the changes by file:
utils.py
load_processed.py
export.py
parse_bam.py
plot_depth_histogram.py / plot_depth_profile.py / plot_enrichment.py / plot_enrichment_profile.py
run_modkit.py
cases.py
test data bed files
dimelo_test.py
generate_targets.py
Things to Focus On
The core logic does seem to work identically to the previous implementation, as shown by my test infrastructure upgrades on a separate branch without the refactor. Certainly any improvements to clarity of the code would be great, because it has become in some ways less readable/clear. We want to make sure comments are sufficient everywhere and that it all flows nicely. Also, is it right to show pbars by default? Maybe not. Open to other interface adjustments too.