Feature/dm 215,dm 113,dm 90 genome browser exports, test phase 2, read depth qc#15
Conversation
…sed functions to cover future plans to collapse redundant code
…t bigwig code. Added checkpoints to gitignore.
…cases. Added error handling for utils.parse_region_string as this is called directly by plot_read_browser rather than routing through utils.add_region_to_dict. This may want to be consolidated later by re-ordering the logic in add_region_to_dict, tbd.
…g. Adjustments to test routine and error throwing for plot_read_browser. Re-generated all test targets, after verifying that dimelo passes both old and new targets.
…st target values should be unchanged.
…management. Confirmed that generation and test run successfully on macos.
…functions in load_processed (load region-by-region) and utils (hist plotter).
… read depth histogram to __init__.
…processed.regions_to_list, and thus a speedup for plot_depth_histogram functions. Added a useful little cores_available utility for use by other functions.
…ed core functions.
thekugelmeister
left a comment
There was a problem hiding this comment.
I stopped this review halfway through after spending an hour on the read depth plotting. It appears to me that there are 2 or 3 different things being combined into one PR, and I started losing the thread of what has been done here. In general, PRs should be constructed in more compartmentalized ways.
That being said, I covered a lot in the plotting sections, which is a start, at least.
Can you add comments to the PR describing the changes you made to testing, so that I have some context for what to review when I return to it?
dimelo/export.py
Outdated
| # note: the tqdm progress bar slows things down by about 33%, which was deemed better at the time of writing this than | ||
| # 90 seconds without any status updates | ||
| rows_count, last_row = list( | ||
| tail( |
There was a problem hiding this comment.
I understand part of this comment, but am concerned or confused about others.
- hard agree that progress bars are good
tqdmis very performant. Usually, if adding a progress bar is making things slow, that means that something else is going wrong in the calculation of progress- fetching the last item in an iterator is supposed to be slow
- unless I misunderstand something, constructing a deque just to get the last item seems wasteful?
- TL;DR: naively, it seems like the call to
tailshould be what makes it slow? But I might need to put more thought into this.
dimelo/export.py
Outdated
| output_file_path = ( | ||
| bigwig_file | ||
| if bigwig_file is not None | ||
| else Path(bedmethyl_file).parent / "pileup.fractions.bigwig" |
There was a problem hiding this comment.
Path(bedmethyl_file) is constructed multiple times throughout this method; should be declared once and referenced later.
dimelo/export.py
Outdated
| ) | ||
| os.makedirs(output_file_path.parent, exist_ok=True) | ||
|
|
||
| # Because we need to set up the bigwig header for we start writing data to it, we need to pre-index the length of each contig |
dimelo/plot_depth_histogram.py
Outdated
| window_size: half-size of the desired window to plot; how far the window stretches on either side of the center point | ||
| single_strand: True means we only grab counts from reads from the same strand as | ||
| the region of interest, False means we always grab both strands within the regions | ||
| average_within_region: if True, each region will only report a single depth value, averaging across all non-zero depths |
There was a problem hiding this comment.
What happens if it's false? It has to perform some sort of aggregation for each region, right?
dimelo/plot_depth_histogram.py
Outdated
| **kwargs, | ||
| ) -> Axes: | ||
| """ | ||
| Plot depth histograms, overlaying the resulting traces on top of each other. |
There was a problem hiding this comment.
I think there's nuance in this method that could require some more explanation.
If I understand this method correctly, I propose this expansion of the docstring introduction:
Plot a histogram of the read depths observed for each provided region.
If this is correct, I'm not sure I understand the options regarding averaging, and what value is actually reported for each region. I think that needs to be documented more clearly as well.
Also might need some clarification about the single-stranded-ness, and the difference between true read depth and what is reported here.
If this is all explained somewhere else already, a reference for where to look for details would suffice.
dimelo/plot_depth_histogram.py
Outdated
| axes = utils.hist_plot( | ||
| value_vectors=depth_vectors, | ||
| value_names=sample_names, | ||
| x_label="per strand read\ndepth in region" |
There was a problem hiding this comment.
I know this is pythonic and sensible enough, but I found it weird to look at now that ruff has reformatted it. Can the x_label be assigned above so that it's easier to read?
| **kwargs, | ||
| ) -> Axes: | ||
| """ | ||
| Plot depth profiles, overlaying the resulting traces on top of each other. |
There was a problem hiding this comment.
I think a lot of the same commentary from the histogram methods applies to this file.
| region=region, window_size=None | ||
| ) | ||
| except ValueError as err: | ||
| raise ValueError( |
There was a problem hiding this comment.
Is this just tidying up error reporting, or is this solving a problem you encountered?
There was a problem hiding this comment.
Some of the test cases deliberately trip this and so I wanted to handle it in a way where they could check that the error message is right
dimelo/export.py
Outdated
| you can specify the regions at parsing time, rather than re-implementing the subset handling logic here. | ||
|
|
||
| Args: | ||
| bedmethyl_file: Path to the input bedmethyl file |
dimelo/test/cases.py
Outdated
| "title": "megalodon_peaks_190", | ||
| }, | ||
| # outputs dict function:values | ||
| {}, # populated in subsequent cells |
There was a problem hiding this comment.
Change "cells" to "generate_targets.py"
dimelo/plot_depth_histogram.py
Outdated
| regions_list: list[str | Path | list[str | Path]], | ||
| motifs: list[str], | ||
| sample_names: list[str], | ||
| window_size: int, |
dimelo/utils.py
Outdated
| Takes arbitrarily many counts vectors and plots on same histogram. | ||
|
|
||
| Args: | ||
| value_vectors: parallel with each entry in vectors |
There was a problem hiding this comment.
Replace vectors with value_names?
dimelo/utils.py
Outdated
| x=x_label, | ||
| hue=y_label, | ||
| multiple="dodge", | ||
| **{**kwargs, **({"bins": bins} if bins is not None else {})}, |
There was a problem hiding this comment.
I understand what this is doing, but looking at it makes me cringe 😅 python is dumb
I think it's clearer if the bins entry in the dictionary is set in an if statement earlier. Assuming we're keeping the integer_values argument and the defaults, etc., here's how I might organize it for clarity:
if integer_values:
# Warn user that passed bins are being overwritten
if "bins" in kwargs:
print("Warning: bin settings overwritten by defaults")
kwargs["bins"] = np.arange(data_table[x_label].min() - 0.5, data_table[x_label].max() + 1.5, 1)
# plot histogram
ax = sns.histplot(
data=data_table,
x=x_label,
hue=y_label,
multiple="dodge",
**kwargs,
)Note that I'm also warning the user that we might be overwriting their manual settings; that's kind of unrelated but also came up while drafting this code snippet.
dimelo/test/cases.py
Outdated
| test_data_dir = Path("./data") | ||
| output_dir = test_data_dir / "test_targets" | ||
|
|
||
| region = "chr1:114357437-114359753" # 'chr1:9167177-9169177' |
There was a problem hiding this comment.
Can you document where this region came from, even if it's totally random or a historical anecdote? Also, what is the commented out region?
dimelo/test/cases.py
Outdated
|
|
||
| # Paths to input files | ||
| ctcf_bam_file = test_data_dir / "ctcf_demo.sorted.bam" | ||
| # ctcf_guppy_bam_file = test_data_dir / 'winnowmap_guppy_merge_subset.updated.bam' |
There was a problem hiding this comment.
Is this necessary? If so, can you document why it's here?
dimelo/test/dimelo_test.py
Outdated
| outputs section of dimelo/test/generate_test_targets.ipynb. | ||
| """ | ||
|
|
||
| def test_unit__regions_list_list( |
There was a problem hiding this comment.
I can't tell what this is testing from the method name. Can it be clarified, either in name or documentation?
There was a problem hiding this comment.
Also, does this method test for values being correct? On first glance it does not appear to. Based on my rough understanding and the docstring for the class, I would have assumed it would do so.
dimelo/test/dimelo_test.py
Outdated
| Tests file export functionality in export module. | ||
|
|
||
| This test currently simply checks that we can make the appropriate output files without raising errors. | ||
| The values stored in the files are not verified. |
There was a problem hiding this comment.
Is there a reason for this other than not really needing to right now?
…sts, added to tutorial.
Added bigwig export in export module.
Updated test framework to include export and plot_read_browser, as well as a new test case. Fixed some error reporting and some code that was generating warnings.
Revamped test target generator code to be more modular and maintainable.
Created read depth profile and read depth histogram plotting modules, with corresponding supporting functions and test coverage.