Skip to content

Feature/dm 215,dm 113,dm 90 genome browser exports, test phase 2, read depth qc#15

Merged
thekugelmeister merged 31 commits intomainfrom
feature/DM-90-read-depth-qc
Jan 29, 2025
Merged

Feature/dm 215,dm 113,dm 90 genome browser exports, test phase 2, read depth qc#15
thekugelmeister merged 31 commits intomainfrom
feature/DM-90-read-depth-qc

Conversation

@OberonDixon
Copy link

@OberonDixon OberonDixon commented Jan 8, 2025

Added bigwig export in export module.

Updated test framework to include export and plot_read_browser, as well as a new test case. Fixed some error reporting and some code that was generating warnings.

Revamped test target generator code to be more modular and maintainable.

Created read depth profile and read depth histogram plotting modules, with corresponding supporting functions and test coverage.

Oberon Dixon-Luinenburg and others added 19 commits January 5, 2025 14:21
…sed functions to cover future plans to collapse redundant code
…t bigwig code. Added checkpoints to gitignore.
…cases. Added error handling for utils.parse_region_string as this is called directly by plot_read_browser rather than routing through utils.add_region_to_dict. This may want to be consolidated later by re-ordering the logic in add_region_to_dict, tbd.
…g. Adjustments to test routine and error throwing for plot_read_browser. Re-generated all test targets, after verifying that dimelo passes both old and new targets.
…management. Confirmed that generation and test run successfully on macos.
…functions in load_processed (load region-by-region) and utils (hist plotter).
…processed.regions_to_list, and thus a speedup for plot_depth_histogram functions. Added a useful little cores_available utility for use by other functions.
Copy link

@thekugelmeister thekugelmeister left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I stopped this review halfway through after spending an hour on the read depth plotting. It appears to me that there are 2 or 3 different things being combined into one PR, and I started losing the thread of what has been done here. In general, PRs should be constructed in more compartmentalized ways.

That being said, I covered a lot in the plotting sections, which is a start, at least.

Can you add comments to the PR describing the changes you made to testing, so that I have some context for what to review when I return to it?

dimelo/export.py Outdated
# note: the tqdm progress bar slows things down by about 33%, which was deemed better at the time of writing this than
# 90 seconds without any status updates
rows_count, last_row = list(
tail(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand part of this comment, but am concerned or confused about others.

  • hard agree that progress bars are good
  • tqdm is very performant. Usually, if adding a progress bar is making things slow, that means that something else is going wrong in the calculation of progress
  • fetching the last item in an iterator is supposed to be slow
  • unless I misunderstand something, constructing a deque just to get the last item seems wasteful?
  • TL;DR: naively, it seems like the call to tail should be what makes it slow? But I might need to put more thought into this.

dimelo/export.py Outdated
output_file_path = (
bigwig_file
if bigwig_file is not None
else Path(bedmethyl_file).parent / "pileup.fractions.bigwig"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Path(bedmethyl_file) is constructed multiple times throughout this method; should be declared once and referenced later.

dimelo/export.py Outdated
)
os.makedirs(output_file_path.parent, exist_ok=True)

# Because we need to set up the bigwig header for we start writing data to it, we need to pre-index the length of each contig

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: "before"

window_size: half-size of the desired window to plot; how far the window stretches on either side of the center point
single_strand: True means we only grab counts from reads from the same strand as
the region of interest, False means we always grab both strands within the regions
average_within_region: if True, each region will only report a single depth value, averaging across all non-zero depths

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if it's false? It has to perform some sort of aggregation for each region, right?

**kwargs,
) -> Axes:
"""
Plot depth histograms, overlaying the resulting traces on top of each other.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's nuance in this method that could require some more explanation.

If I understand this method correctly, I propose this expansion of the docstring introduction:

Plot a histogram of the read depths observed for each provided region.

If this is correct, I'm not sure I understand the options regarding averaging, and what value is actually reported for each region. I think that needs to be documented more clearly as well.

Also might need some clarification about the single-stranded-ness, and the difference between true read depth and what is reported here.

If this is all explained somewhere else already, a reference for where to look for details would suffice.

axes = utils.hist_plot(
value_vectors=depth_vectors,
value_names=sample_names,
x_label="per strand read\ndepth in region"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this is pythonic and sensible enough, but I found it weird to look at now that ruff has reformatted it. Can the x_label be assigned above so that it's easier to read?

**kwargs,
) -> Axes:
"""
Plot depth profiles, overlaying the resulting traces on top of each other.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a lot of the same commentary from the histogram methods applies to this file.

region=region, window_size=None
)
except ValueError as err:
raise ValueError(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this just tidying up error reporting, or is this solving a problem you encountered?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of the test cases deliberately trip this and so I wanted to handle it in a way where they could check that the error message is right

dimelo/export.py Outdated
you can specify the regions at parsing time, rather than re-implementing the subset handling logic here.

Args:
bedmethyl_file: Path to the input bedmethyl file

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

*tabix-indexed gzipped files

"title": "megalodon_peaks_190",
},
# outputs dict function:values
{}, # populated in subsequent cells

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change "cells" to "generate_targets.py"

regions_list: list[str | Path | list[str | Path]],
motifs: list[str],
sample_names: list[str],
window_size: int,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oberon says you're wrong

@OberonDixon OberonDixon changed the title Feature/dm 90 read depth qc Feature/dm 215,dm 113,dm 90 genome browser exports, test phase 2, read depth qc Jan 14, 2025
dimelo/utils.py Outdated
Takes arbitrarily many counts vectors and plots on same histogram.

Args:
value_vectors: parallel with each entry in vectors

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replace vectors with value_names?

dimelo/utils.py Outdated
x=x_label,
hue=y_label,
multiple="dodge",
**{**kwargs, **({"bins": bins} if bins is not None else {})},

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand what this is doing, but looking at it makes me cringe 😅 python is dumb

I think it's clearer if the bins entry in the dictionary is set in an if statement earlier. Assuming we're keeping the integer_values argument and the defaults, etc., here's how I might organize it for clarity:

if integer_values:
    # Warn user that passed bins are being overwritten
    if "bins" in kwargs:
        print("Warning: bin settings overwritten by defaults")
    kwargs["bins"] = np.arange(data_table[x_label].min() - 0.5, data_table[x_label].max() + 1.5, 1)

# plot histogram
ax = sns.histplot(
    data=data_table,
    x=x_label,
    hue=y_label,
    multiple="dodge",
    **kwargs,
)

Note that I'm also warning the user that we might be overwriting their manual settings; that's kind of unrelated but also came up while drafting this code snippet.

test_data_dir = Path("./data")
output_dir = test_data_dir / "test_targets"

region = "chr1:114357437-114359753" # 'chr1:9167177-9169177'

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you document where this region came from, even if it's totally random or a historical anecdote? Also, what is the commented out region?


# Paths to input files
ctcf_bam_file = test_data_dir / "ctcf_demo.sorted.bam"
# ctcf_guppy_bam_file = test_data_dir / 'winnowmap_guppy_merge_subset.updated.bam'

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this necessary? If so, can you document why it's here?

outputs section of dimelo/test/generate_test_targets.ipynb.
"""

def test_unit__regions_list_list(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't tell what this is testing from the method name. Can it be clarified, either in name or documentation?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, does this method test for values being correct? On first glance it does not appear to. Based on my rough understanding and the docstring for the class, I would have assumed it would do so.

Tests file export functionality in export module.

This test currently simply checks that we can make the appropriate output files without raising errors.
The values stored in the files are not verified.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason for this other than not really needing to right now?

@thekugelmeister thekugelmeister merged commit 821a2c2 into main Jan 29, 2025
3 of 4 checks passed
@thekugelmeister thekugelmeister deleted the feature/DM-90-read-depth-qc branch January 29, 2025 07:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants