Skip to content

Feature/dm 191 refactor load processed pileup#21

Merged
thekugelmeister merged 29 commits intomainfrom
feature/DM-191-refactor-load-processed
May 31, 2025
Merged

Feature/dm 191 refactor load processed pileup#21
thekugelmeister merged 29 commits intomainfrom
feature/DM-191-refactor-load-processed

Conversation

@OberonDixon
Copy link

@OberonDixon OberonDixon commented Feb 21, 2025

Description

The core of this PR is a set of adjustments to load_processed for pileup vectors, pileup counts, and regions_to_list, with the goal of utilizing multiple cores to speed up processing and reduce memory overhead. Similar changes are possible conceptually for the single read loaders, but given that we may adjust single read loading to a new file type with new modkit versions and given that I have mostly had performance bottlenecks in data loading with pileups, I opted to keep that as-is. Most changes are directly related to this goal; some minor things were incidentally updated because I was overhauling that part of the code anyway.

Here is an overview of the changes by file:

utils.py

  1. Added process_chunks_from_regions_dict, which lets you break up a regions_dict (regions by chromosome) into a region chunks dict, with subregions of the specified size. This helps with parallelization later by providing a shareable function that chunks stuff up.

load_processed.py

  1. The whole file was reorganized into sections: Load wrappers (which let you call other loaders for many regions simultaneously), Pileup loaders, and Single read loaders. Comments were added to function descriptions indicating which are User-facing as opposed to internal helpers.
  2. For the currently-implemented load wrapper function, regions_to_list (used for read depth histograms currently but applicable for a bunch of stuff later) was updated to either parallelize across the different load calls OR parallelize within the load calls, based on whether parallelize_within_region is True. This is useful if one is analyzing several large regions rather than many, many small regions. Logic was added to handle progress bars.
  3. The process_region function, called by regions_to_list, was shifted down to follow the organization paradigm of top-level loader first, helpers after
  4. pileup_counts_from_bedmethyl was updated to enable non-windowed loading, which is convenient for certain analysis tasks (NOT essential for parallelization)
  5. pileup_counts_from_bedmethyl and pileup_vectors_from_bedmethyl were both adjusted to reference the same shared row parsing logic, now inside process_pileup_row.
  6. pileup_counts_from_bedmethyl and pileup_vectors_from_bedmethyl got progress bars, that can be turned off with quiet=True.
  7. pileup_counts_from_bedmethyl and pileup_vectors_from_bedmethyl both got their more specific logic moved into respective process_chunk functions, which are what get run by the parallel process executor for chunks. Experimentation showed that a chunk size of 1 million bp is fastest in pretty much every case: for small regions, that is just the whole region, whereas for whole-chromosome processing you get good resource utilization, relatively low memory footprint, and overall fast speed.
  8. The output values for pileup_counts_from_bedmethyl and pileup_vectors_from_bedmethyl are handled with multiprocessing shared_memory.SharedMemory objects, which let you share a range of memory (that I map to numpy arrays - other mappings such as to int proved challenging to maintain the mapping to shared memory for editing) between all of the cores. With a lock that is also shared, we avoid conflicting write attempts. There is some nuance in handling these objects properly; calling .close() on the instantiated reference in the child process function seems crucial and then .close() followed by .unlink() is absolutely crucial in the manager function to avoid catastrophic failure.
  9. In the single read loader section, some additional comments were added describing functionality where it was missing

export.py

  1. Updated the bigwig exported to reference the new load_processed.process_pileup_row function to use shared logic with load_processed pileup handlers. This cleans up quite a lot, it turns out.

parse_bam.py

  1. ruff demanded tiny format changes so I relented.

plot_depth_histogram.py / plot_depth_profile.py / plot_enrichment.py / plot_enrichment_profile.py

  1. Added passthrough parameter for keeping pbars quiet

run_modkit.py

  1. ruff demanded tiny format changes so I relented.

cases.py

  1. Added test coverage for different single_strand, regions_5to3prime, chunk_size, and cores values. These were all important to test before and after the refactor because I totally rewrote the core strand handling logic, and the cores parameter is now dealing with things that we implemented as opposed to modkit.

test data bed files

  1. Added strand info per the actual bed spec, i.e. in the right column, so the tests actually test something.

dimelo_test.py

  1. Explained decision to override cores test value for extract specifically
  2. Some formatting changes demanded by ruff

generate_targets.py

  1. New logic to properly handle test case management when trying to adjust some tests to increase coverage while wanting to keep other test results identical, including generation timestamps, for the purposes of easy traceability of time-of-validity.
  2. Added logic to remove the cores parameters passed to extract at generation time, per reason explained in dimelo_test.py i.e. because of the non-deterministic nature of modkit extract in terms of ordering (at least as of 0.2.4)

Things to Focus On

The core logic does seem to work identically to the previous implementation, as shown by my test infrastructure upgrades on a separate branch without the refactor. Certainly any improvements to clarity of the code would be great, because it has become in some ways less readable/clear. We want to make sure comments are sufficient everywhere and that it all flows nicely. Also, is it right to show pbars by default? Maybe not. Open to other interface adjustments too.

Oberon Dixon-Luinenburg and others added 23 commits January 8, 2025 09:18
…l need to apply to export functions, and need to check that performance is unimpacted.
…ause it will complicate parallelization and is not clearly useful.
…cess_pileup_row and renamed/configured a few variables to support this.
…hunk generation, currently only for load_processed.pileup_vectors_from_bedmethyl. pileup_counts_from_bedmethyl is a simpler case. Passes pre-existing tests including when chunk_size<region size, but further infrastructure is required to assess whether regions_5to3prime still works right (logic gets a bit complicated). Speed with a single core and a small number of small regions is slightly slower due to overhead necessary to support parallelization. Need to test that we get a speedup with multi-core long regions or many regions.
…ead_vectors_from_hdf5 test targets to pass tests; other test targets are the same for now.
…d generate_targets.py to properly handle copying over new kwargs from cases.py when re-running some but not all targets: if all targets are re-run, then the old test_matrix.pkl is discarded. If only some are re-run, we need the old test_matrix.pkl, but still need to copy over all the kwargs for all cases that won't be re-run and still need to copy over old results, which will be overridden for re-run targets.
…rocessed-tests-baseline

Feature/dm 191 refactor load processed tests baseline
…ns_5to3prime logic within the pileup_vectors_from_bedmethyl function, which is the only one that actually needs it. Runs faster, still passes all tests.
…y, re-ran targets for pileup_counts_from_bedmethyl and pileup_vectors_from_bedmethyl.
…rocessed-tests-baseline

DM-192 Added chunk_size for load_processes parallelization to cases.p…
…. Added progress bars and quiet toggles for pileup loaders. Set default chunk size to 1 MB after whole-chromosome speed testing. Added option to parallelize within rather than between regions with regions_to_list.
…e to avoid intermittent crashes. Adjust tabix fetch to avoid negative start values. Tweak wording on progress bars.
… to all pileup plotters. A few other small tweaks.
… and Macos ostensibly have 4 cores. This should not change any outputs because for extract the test is hardcoded to use 1 core, and that is the only one with any stochasticity in parallelization.
…to avoid ever sending the cores arg twice, and added a comment to explain the reason why extract is tested unparallelized, unlike other functions.
…ment for extract. Updated test_matrix.pickle by re-running generate_targets. No changes to pileup or extract files, but should pass all tests.
…rocessed-tests-baseline

New test_matrix and cases to cover 1, 2, 3, 4, and None for cores
@OberonDixon OberonDixon changed the title Feature/dm 191 refactor load processed Feature/dm 191 refactor load processed pileup Mar 13, 2025
Copy link

@thekugelmeister thekugelmeister left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anecdotal testing comments:

  • A figure that took me 30 minutes to produce this morning took me 15 minutes this afternoon, on a not very optimized setup. My computer might be screaming in pain but I can adjust that. I'm theoretically quite content on the whole with that result, although I'm curious if you've seen better speedups on savio.
  • Other plots have had amazing speedups since. Still not sure what that was about.

General comments from review:

  • I feel like we should add clarification of what the options to cores can be, for the user-facing methods. I understand that they all are the same (None means "all available", etc), and it's slightly redundant to say it everywhere, but I think it would be useful for the end users.
  • Progress bars
    • Almost every time I talk to a new package user, they say, "it's nice it has progress bars" 🤣 That being said, we don't need them to stick around after the fact. I might actually start arguing that all of our progress bars should be set to leave=False or =None.
    • I think the message on the bar should be changed to something more generic like "loading data". I think it will suffice and look cleaner.
  • The fact that it prints modkit found with expected version 0.2.4 for every core is mildly annoying. Now that I think about it, is it necessary to print that at all? Can we just have it print if none or the wrong version is found?

window_size: int | None = None,
quiet: bool = True,
cores: int | None = None,
parallelize_within_regions: bool = False,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this option here? Do you have any use cases in mind where this would be useful?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clarified use case in docstring. It is for small numbers of large regions, e.g. comparing chromosomes.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I loosely understand the goal of the use case, but I find the wording difficult to parse and I'm not sure I fully follow the logic.

If I'm understanding correctly, the goal is to specify whether to split up large regions into smaller chunks. If you have small regions, you can just be embarrassingly parallel across regions. If you have larger ones, you want to split those regions up. Is that high level accurate?

If this understanding is true, then is the current configuration that False --> don't split regions and True means to split regions?

If all of this is true, I suggest maybe a rename of this argument to split_large_regions or something like that? What do you think?

I only care because this is user facing and I find the current name/docs a little arcane. If I'm understanding correctly, this is an important setting to include, so let's make it clear to the user why they should care.

return results


def process_region(region_string, function_handle, **kwargs):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm probably misunderstanding, but why is this a separate method? Why can't it be part of the partial above? If it needs to exist, I think it needs a better name; this one is very generic. Can you describe more fully what it is intended to do?

Copy link
Author

@OberonDixon OberonDixon Mar 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the executor.map call needs a positional input rather than a keyword (e.g. regions), but then the ones that take keywords don't seem to be happy to return things in the same order as they were sent (necessary for the outputs list to map to the input regions), so I am using this wrapper. Renamed it for clarity to apply_loader_function_to_region. Happy to use a different multiprocessing approach that can directly handle kwargs if you know of one.

elif len(pileup_basemod.split(",")) == 3:
pileup_modname, pileup_motif, pileup_mod_coord = (
pileup_basemod.split(",")
keep_basemod, genomic_coord, modified_in_row, valid_in_row = (

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very much appreciate the simplification!

@OberonDixon
Copy link
Author

Performance test results

These are the speedups I see on Savio. The second graph is showing the results for fewer cores also, so should tell also what to expect for e.g. an 8-core macbook.

image
image

High level feedback response

I got rid of the successfully-found-modkit printout, leaving the error outputs. Adjusted the progress bars to disappear when done for load_processed (not for parse_bam though we can do that in the future). Changed description to just "Loading data."

Granular feedback response

I made changes to fix most of the requests. I decided not to consolidate the SharedMemory management because there are subtle differences in how the variables are created between the counts and vectors method (added comments describing what they are doing), and the dereferencing code is so straightforward already.

Other changes

I added this to both the vectors and counts loader. I had that in previous versions of these pileup loaders but removed for simplicity, only to realize/remember that we get errors if we don't make sure the chromosomes is present in the tabix index before fetching.

        # tabix throws and error if the contig is not present
        # by the current design, this should be silent
        if chromosome in source_tabix.contigs:
            for row in source_tabix.fetch(
                chromosome, max(subregion_start, 0), subregion_end
            ):

Copy link

@thekugelmeister thekugelmeister left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks almost good! I requested some additional clarification in one place for a user-facing argument but everything else i think looks fine.

window_size: int | None = None,
quiet: bool = True,
cores: int | None = None,
parallelize_within_regions: bool = False,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I loosely understand the goal of the use case, but I find the wording difficult to parse and I'm not sure I fully follow the logic.

If I'm understanding correctly, the goal is to specify whether to split up large regions into smaller chunks. If you have small regions, you can just be embarrassingly parallel across regions. If you have larger ones, you want to split those regions up. Is that high level accurate?

If this understanding is true, then is the current configuration that False --> don't split regions and True means to split regions?

If all of this is true, I suggest maybe a rename of this argument to split_large_regions or something like that? What do you think?

I only care because this is user facing and I find the current name/docs a little arcane. If I'm understanding correctly, this is an important setting to include, so let's make it clear to the user why they should care.

@thekugelmeister thekugelmeister merged commit ffce0f6 into main May 31, 2025
4 checks passed
@thekugelmeister thekugelmeister deleted the feature/DM-191-refactor-load-processed branch May 31, 2025 16:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants