Feature/dm 191 refactor load processed pileup by OberonDixon · Pull Request #21 · streetslab/dimelo_v2

OberonDixon · 2025-02-21T01:51:46Z

Description

The core of this PR is a set of adjustments to load_processed for pileup vectors, pileup counts, and regions_to_list, with the goal of utilizing multiple cores to speed up processing and reduce memory overhead. Similar changes are possible conceptually for the single read loaders, but given that we may adjust single read loading to a new file type with new modkit versions and given that I have mostly had performance bottlenecks in data loading with pileups, I opted to keep that as-is. Most changes are directly related to this goal; some minor things were incidentally updated because I was overhauling that part of the code anyway.

Here is an overview of the changes by file:

utils.py

Added process_chunks_from_regions_dict, which lets you break up a regions_dict (regions by chromosome) into a region chunks dict, with subregions of the specified size. This helps with parallelization later by providing a shareable function that chunks stuff up.

load_processed.py

The whole file was reorganized into sections: Load wrappers (which let you call other loaders for many regions simultaneously), Pileup loaders, and Single read loaders. Comments were added to function descriptions indicating which are User-facing as opposed to internal helpers.
For the currently-implemented load wrapper function, regions_to_list (used for read depth histograms currently but applicable for a bunch of stuff later) was updated to either parallelize across the different load calls OR parallelize within the load calls, based on whether parallelize_within_region is True. This is useful if one is analyzing several large regions rather than many, many small regions. Logic was added to handle progress bars.
The process_region function, called by regions_to_list, was shifted down to follow the organization paradigm of top-level loader first, helpers after
pileup_counts_from_bedmethyl was updated to enable non-windowed loading, which is convenient for certain analysis tasks (NOT essential for parallelization)
pileup_counts_from_bedmethyl and pileup_vectors_from_bedmethyl were both adjusted to reference the same shared row parsing logic, now inside process_pileup_row.
pileup_counts_from_bedmethyl and pileup_vectors_from_bedmethyl got progress bars, that can be turned off with quiet=True.
pileup_counts_from_bedmethyl and pileup_vectors_from_bedmethyl both got their more specific logic moved into respective process_chunk functions, which are what get run by the parallel process executor for chunks. Experimentation showed that a chunk size of 1 million bp is fastest in pretty much every case: for small regions, that is just the whole region, whereas for whole-chromosome processing you get good resource utilization, relatively low memory footprint, and overall fast speed.
The output values for pileup_counts_from_bedmethyl and pileup_vectors_from_bedmethyl are handled with multiprocessing shared_memory.SharedMemory objects, which let you share a range of memory (that I map to numpy arrays - other mappings such as to int proved challenging to maintain the mapping to shared memory for editing) between all of the cores. With a lock that is also shared, we avoid conflicting write attempts. There is some nuance in handling these objects properly; calling .close() on the instantiated reference in the child process function seems crucial and then .close() followed by .unlink() is absolutely crucial in the manager function to avoid catastrophic failure.
In the single read loader section, some additional comments were added describing functionality where it was missing

export.py

Updated the bigwig exported to reference the new load_processed.process_pileup_row function to use shared logic with load_processed pileup handlers. This cleans up quite a lot, it turns out.

parse_bam.py

ruff demanded tiny format changes so I relented.

plot_depth_histogram.py / plot_depth_profile.py / plot_enrichment.py / plot_enrichment_profile.py

Added passthrough parameter for keeping pbars quiet

run_modkit.py

ruff demanded tiny format changes so I relented.

cases.py

Added test coverage for different single_strand, regions_5to3prime, chunk_size, and cores values. These were all important to test before and after the refactor because I totally rewrote the core strand handling logic, and the cores parameter is now dealing with things that we implemented as opposed to modkit.

test data bed files

Added strand info per the actual bed spec, i.e. in the right column, so the tests actually test something.

dimelo_test.py

Explained decision to override cores test value for extract specifically
Some formatting changes demanded by ruff

generate_targets.py

New logic to properly handle test case management when trying to adjust some tests to increase coverage while wanting to keep other test results identical, including generation timestamps, for the purposes of easy traceability of time-of-validity.
Added logic to remove the cores parameters passed to extract at generation time, per reason explained in dimelo_test.py i.e. because of the non-deterministic nature of modkit extract in terms of ordering (at least as of 0.2.4)

Things to Focus On

The core logic does seem to work identically to the previous implementation, as shown by my test infrastructure upgrades on a separate branch without the refactor. Certainly any improvements to clarity of the code would be great, because it has become in some ways less readable/clear. We want to make sure comments are sufficient everywhere and that it all flows nicely. Also, is it right to show pbars by default? Maybe not. Open to other interface adjustments too.

…l need to apply to export functions, and need to check that performance is unimpacted.

…ause it will complicate parallelization and is not clearly useful.

…cess_pileup_row and renamed/configured a few variables to support this.

…hunk generation, currently only for load_processed.pileup_vectors_from_bedmethyl. pileup_counts_from_bedmethyl is a simpler case. Passes pre-existing tests including when chunk_size<region size, but further infrastructure is required to assess whether regions_5to3prime still works right (logic gets a bit complicated). Speed with a single core and a small number of small regions is slightly slower due to overhead necessary to support parallelization. Need to test that we get a speedup with multi-core long regions or many regions.

…ead_vectors_from_hdf5 test targets to pass tests; other test targets are the same for now.

…d generate_targets.py to properly handle copying over new kwargs from cases.py when re-running some but not all targets: if all targets are re-run, then the old test_matrix.pkl is discarded. If only some are re-run, we need the old test_matrix.pkl, but still need to copy over all the kwargs for all cases that won't be re-run and still need to copy over old results, which will be overridden for re-run targets.

…rocessed-tests-baseline Feature/dm 191 refactor load processed tests baseline

…ss_pileup_row. Passing all tests.

…ns_5to3prime logic within the pileup_vectors_from_bedmethyl function, which is the only one that actually needs it. Runs faster, still passes all tests.

…y, re-ran targets for pileup_counts_from_bedmethyl and pileup_vectors_from_bedmethyl.

…rocessed-tests-baseline DM-192 Added chunk_size for load_processes parallelization to cases.p…

…. Added progress bars and quiet toggles for pileup loaders. Set default chunk size to 1 MB after whole-chromosome speed testing. Added option to parallelize within rather than between regions with regions_to_list.

…e to avoid intermittent crashes. Adjust tabix fetch to avoid negative start values. Tweak wording on progress bars.

…d some comments.

… to all pileup plotters. A few other small tweaks.

… and Macos ostensibly have 4 cores. This should not change any outputs because for extract the test is hardcoded to use 1 core, and that is the only one with any stochasticity in parallelization.

…to avoid ever sending the cores arg twice, and added a comment to explain the reason why extract is tested unparallelized, unlike other functions.

…ment for extract. Updated test_matrix.pickle by re-running generate_targets. No changes to pileup or extract files, but should pass all tests.

…rocessed-tests-baseline New test_matrix and cases to cover 1, 2, 3, 4, and None for cores

thekugelmeister

Anecdotal testing comments:

A figure that took me 30 minutes to produce this morning took me 15 minutes this afternoon, on a not very optimized setup. My computer might be screaming in pain but I can adjust that. I'm theoretically quite content on the whole with that result, although I'm curious if you've seen better speedups on savio.
Other plots have had amazing speedups since. Still not sure what that was about.

General comments from review:

I feel like we should add clarification of what the options to cores can be, for the user-facing methods. I understand that they all are the same (None means "all available", etc), and it's slightly redundant to say it everywhere, but I think it would be useful for the end users.
Progress bars
- Almost every time I talk to a new package user, they say, "it's nice it has progress bars" 🤣 That being said, we don't need them to stick around after the fact. I might actually start arguing that all of our progress bars should be set to leave=False or =None.
- I think the message on the bar should be changed to something more generic like "loading data". I think it will suffice and look cleaner.
The fact that it prints modkit found with expected version 0.2.4 for every core is mildly annoying. Now that I think about it, is it necessary to print that at all? Can we just have it print if none or the wrong version is found?

dimelo/plot_enrichment.py

dimelo/load_processed.py

thekugelmeister · 2025-03-17T22:47:13Z

dimelo/load_processed.py

    window_size: int | None = None,
+    quiet: bool = True,
    cores: int | None = None,
+    parallelize_within_regions: bool = False,


Why is this option here? Do you have any use cases in mind where this would be useful?

Clarified use case in docstring. It is for small numbers of large regions, e.g. comparing chromosomes.

I loosely understand the goal of the use case, but I find the wording difficult to parse and I'm not sure I fully follow the logic.

If I'm understanding correctly, the goal is to specify whether to split up large regions into smaller chunks. If you have small regions, you can just be embarrassingly parallel across regions. If you have larger ones, you want to split those regions up. Is that high level accurate?

If this understanding is true, then is the current configuration that False --> don't split regions and True means to split regions?

If all of this is true, I suggest maybe a rename of this argument to split_large_regions or something like that? What do you think?

I only care because this is user facing and I find the current name/docs a little arcane. If I'm understanding correctly, this is an important setting to include, so let's make it clear to the user why they should care.

thekugelmeister · 2025-03-17T22:51:05Z

dimelo/load_processed.py

    return results


+def process_region(region_string, function_handle, **kwargs):


I'm probably misunderstanding, but why is this a separate method? Why can't it be part of the partial above? If it needs to exist, I think it needs a better name; this one is very generic. Can you describe more fully what it is intended to do?

I think the executor.map call needs a positional input rather than a keyword (e.g. regions), but then the ones that take keywords don't seem to be happy to return things in the same order as they were sent (necessary for the outputs list to map to the input regions), so I am using this wrapper. Renamed it for clarity to apply_loader_function_to_region. Happy to use a different multiprocessing approach that can directly handle kwargs if you know of one.

dimelo/load_processed.py

thekugelmeister · 2025-03-17T23:07:08Z

dimelo/export.py

-                elif len(pileup_basemod.split(",")) == 3:
-                    pileup_modname, pileup_motif, pileup_mod_coord = (
-                        pileup_basemod.split(",")
+                keep_basemod, genomic_coord, modified_in_row, valid_in_row = (


Very much appreciate the simplification!

dimelo/plot_enrichment_profile.py

OberonDixon · 2025-03-23T18:44:54Z

Performance test results

These are the speedups I see on Savio. The second graph is showing the results for fewer cores also, so should tell also what to expect for e.g. an 8-core macbook.

High level feedback response

I got rid of the successfully-found-modkit printout, leaving the error outputs. Adjusted the progress bars to disappear when done for load_processed (not for parse_bam though we can do that in the future). Changed description to just "Loading data."

Granular feedback response

I made changes to fix most of the requests. I decided not to consolidate the SharedMemory management because there are subtle differences in how the variables are created between the counts and vectors method (added comments describing what they are doing), and the dereferencing code is so straightforward already.

Other changes

I added this to both the vectors and counts loader. I had that in previous versions of these pileup loaders but removed for simplicity, only to realize/remember that we get errors if we don't make sure the chromosomes is present in the tabix index before fetching.

        # tabix throws and error if the contig is not present
        # by the current design, this should be silent
        if chromosome in source_tabix.contigs:
            for row in source_tabix.fetch(
                chromosome, max(subregion_start, 0), subregion_end
            ):

…ub.com/streetslab/dimelo_v2 into feature/DM-191-refactor-load-processed

thekugelmeister

Looks almost good! I requested some additional clarification in one place for a user-facing argument but everything else i think looks fine.

thekugelmeister · 2025-04-01T21:08:59Z

dimelo/load_processed.py

    window_size: int | None = None,
+    quiet: bool = True,
    cores: int | None = None,
+    parallelize_within_regions: bool = False,


I loosely understand the goal of the use case, but I find the wording difficult to parse and I'm not sure I fully follow the logic.

If I'm understanding correctly, the goal is to specify whether to split up large regions into smaller chunks. If you have small regions, you can just be embarrassingly parallel across regions. If you have larger ones, you want to split those regions up. Is that high level accurate?

If this understanding is true, then is the current configuration that False --> don't split regions and True means to split regions?

If all of this is true, I suggest maybe a rename of this argument to split_large_regions or something like that? What do you think?

I only care because this is user facing and I find the current name/docs a little arcane. If I'm understanding correctly, this is an important setting to include, so let's make it clear to the user why they should care.

…rameter.

Oberon Dixon-Luinenburg and others added 23 commits January 8, 2025 09:18

DM-229 Consolidating parsing logic for load_processed functions. Stil…

b2b7689

…l need to apply to export functions, and need to check that performance is unimpacted.

DM-231 Removed regions=None case for pileup_counts_from_bedmethyl bec…

3f3ec85

…ause it will complicate parallelization and is not clearly useful.

DM-229 Adjusted export.pileup_to_bigwig to use new load_processed.pro…

91fb5e4

…cess_pileup_row and renamed/configured a few variables to support this.

DM-232 Added strand information to regions .bed files. Re-generated r…

caf9dcd

…ead_vectors_from_hdf5 test targets to pass tests; other test targets are the same for now.

Merge pull request #16 from streetslab/feature/DM-191-refactor-load-p…

bb526bf

…rocessed-tests-baseline Feature/dm 191 refactor load processed tests baseline

DM-192 Fixed logic to properly handle single_strand case within proce…

caa3e59

…ss_pileup_row. Passing all tests.

DM-192 Further refactor to simplify process_pileup_row and keep regio…

ac3657e

…ns_5to3prime logic within the pileup_vectors_from_bedmethyl function, which is the only one that actually needs it. Runs faster, still passes all tests.

DM-192 Added chunk_size for load_processes parallelization to cases.p…

a9648df

…y, re-ran targets for pileup_counts_from_bedmethyl and pileup_vectors_from_bedmethyl.

Merge pull request #17 from streetslab/feature/DM-191-refactor-load-p…

c830abf

…rocessed-tests-baseline DM-192 Added chunk_size for load_processes parallelization to cases.p…

DM-192,DM-236 Properly close and delink shared memory when appropriat…

56e6333

…e to avoid intermittent crashes. Adjust tabix fetch to avoid negative start values. Tweak wording on progress bars.

DM-191 Re-arrange contents of load_processed for better clarity. Adde…

97efec1

…d some comments.

DM-238 Add quiet and cores parameters and corresponding documentation…

a8ed1ae

… to all pileup plotters. A few other small tweaks.

DM-239 Added test case coverage for 1-4 cores. GitHub jobs for Ubuntu…

f1b0758

… and Macos ostensibly have 4 cores. This should not change any outputs because for extract the test is hardcoded to use 1 core, and that is the only one with any stochasticity in parallelization.

DM-239 Added cases.py support for cores=None. Tweaked dimelo_test.py …

0b68892

…to avoid ever sending the cores arg twice, and added a comment to explain the reason why extract is tested unparallelized, unlike other functions.

DM-239 Adjusted generate_targets module to properly handle cores argu…

1da3d3c

…ment for extract. Updated test_matrix.pickle by re-running generate_targets. No changes to pileup or extract files, but should pass all tests.

Ruff format fixes

fcf116f

Merge pull request #18 from streetslab/feature/DM-191-refactor-load-p…

5dad63b

…rocessed-tests-baseline New test_matrix and cases to cover 1, 2, 3, 4, and None for cores

Merge branch 'main' into feature/DM-191-refactor-load-processed

f0f027b

Fixed small mistakes from merge

bd94e78

Merge branch 'main' into feature/DM-191-refactor-load-processed

0a9feb0

OberonDixon changed the title ~~Feature/dm 191 refactor load processed~~ Feature/dm 191 refactor load processed pileup Mar 13, 2025

thekugelmeister requested changes Mar 17, 2025

View reviewed changes

Added check for chrom in tabixfile.

4454f67

Oberon Dixon-Luinenburg added 2 commits March 23, 2025 13:30

DM-191 Adjustments in response to PR comments from thekugelmeister

bfdbff6

Merge branch 'feature/DM-191-refactor-load-processed' of https://gith…

d3cc71c

…ub.com/streetslab/dimelo_v2 into feature/DM-191-refactor-load-processed

thekugelmeister requested changes Apr 1, 2025

View reviewed changes

Oberon Dixon-Luinenburg and others added 3 commits May 29, 2025 20:03

Adjust naming for regions_to_list region-splitting parallelization pa…

7f50782

…rameter.

attempting to fix ruff versioning

3ce0e8f

ruff version fix attempt 2

2972c34

thekugelmeister approved these changes May 31, 2025

View reviewed changes

thekugelmeister merged commit ffce0f6 into main May 31, 2025
4 checks passed

thekugelmeister deleted the feature/DM-191-refactor-load-processed branch May 31, 2025 16:37

		return results


		def process_region(region_string, function_handle, **kwargs):

Conversation

OberonDixon commented Feb 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

utils.py

load_processed.py

export.py

parse_bam.py

plot_depth_histogram.py / plot_depth_profile.py / plot_enrichment.py / plot_enrichment_profile.py

run_modkit.py

cases.py

test data bed files

dimelo_test.py

generate_targets.py

Things to Focus On

Uh oh!

thekugelmeister left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

thekugelmeister Mar 17, 2025

Choose a reason for hiding this comment

Uh oh!

OberonDixon Mar 23, 2025

Choose a reason for hiding this comment

Uh oh!

thekugelmeister Apr 1, 2025

Choose a reason for hiding this comment

Uh oh!

thekugelmeister Mar 17, 2025

Choose a reason for hiding this comment

Uh oh!

OberonDixon Mar 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

thekugelmeister Mar 17, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

OberonDixon commented Mar 23, 2025

Performance test results

High level feedback response

Granular feedback response

Other changes

Uh oh!

thekugelmeister left a comment

Choose a reason for hiding this comment

Uh oh!

thekugelmeister Apr 1, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

OberonDixon commented Feb 21, 2025 •

edited

Loading

OberonDixon Mar 23, 2025 •

edited

Loading