feat: add haplotype.py shared utilities module#4
Conversation
📝 WalkthroughWalkthroughAdds a new Hail-based haplotype utilities module and accompanying tests, plus CI Java setup and an OpenJDK version constraint change. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
b01de52 to
639072a
Compare
f98f6db to
5245c8e
Compare
639072a to
8b16ab6
Compare
Add HailPath type alias and shared helper functions used across multiple Hail-based pipeline tools: to_hashable_items, get_haplo_sequence, variant_distance, and split_haplotypes. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
5245c8e to
06e1fab
Compare
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@divref/divref/haplotype.py`:
- Around line 39-41: Before indexing sorted_variants, add an explicit
empty-input guard for the input variable variants: check if not variants (or
len(variants) == 0) and either raise a clear exception (e.g., ValueError with a
descriptive message) or return a defined sentinel (e.g., None) depending on the
function's contract; implement this check before calling hl.sorted and only
compute min_variant and max_variant from sorted_variants when non-empty, and
update any callers/tests to expect the chosen behavior.
- Around line 11-125: Add unit tests for each new public function: for
to_hashable_items write a happy-path test converting a dict to a stable sorted
tuple and an error case for non-dict input or values that should still be
preserved; for get_haplo_sequence write a happy-path test that constructs a
small Hail variants array (multiple variants with known loci/alleles and GRCh38
reference) and asserts the produced sequence includes expected
flanking/context/allele concatenation, and an error case for empty/invalid
variants or negative context_size; for variant_distance add a happy-path test
covering simple SNPs and indels (e.g., distances that become zero due to
deletions) and an error case for invalid variant structs or same-locus inputs;
for split_haplotypes add a happy-path test that supplies a table with
multi-variant haplotypes and verifies splits/filters occur correctly given a
window_size and an error case for invalid window_size or malformed ht (e.g.,
missing fields); place tests in a new test_divref_haplotype.py using the project
test harness and include at least one positive and one negative case per
function to satisfy the guideline.
- Around line 106-108: The split threshold in the breakpoints computation uses
>= which splits when gap == window_size, but the docstring says to split only
for gaps larger than window_size; in the breakpoints expression (variable
breakpoints referencing ht.variants and variant_distance and window_size) change
the comparison from >= to > so segmentation occurs only for gaps strictly
greater than window_size; verify the lambda remains applied to the intended
indices (ht.variants) after this change.
- Around line 11-21: to_hashable_items currently returns
tuple(sorted(d.items())) which fails when values are unhashable (nested
list/dict/set); update the to_hashable_items function to recursively convert
nested structures into immutable/hashable equivalents (e.g., lists -> tuples,
sets -> frozensets, dicts -> sorted tuples via a helper like _freeze(value) that
handles nested dicts by sorting keys and freezing values) and use that helper
when building the top-level items so the resulting tuple contains only hashable
members; optionally raise TypeError for unsupported types to tighten the
contract.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: c676053d-beb5-4891-86e6-cf394ca9db91
📒 Files selected for processing (1)
divref/divref/haplotype.py
divref/divref/haplotype.py
Outdated
| def to_hashable_items(d: dict[str, Any]) -> tuple[tuple[str, Any], ...]: | ||
| """ | ||
| Convert a dictionary to a sorted tuple of items for use as a hashable key. | ||
|
|
||
| Args: | ||
| d: Dictionary to convert. | ||
|
|
||
| Returns: | ||
| Sorted tuple of (key, value) pairs. | ||
| """ | ||
| return tuple(sorted(d.items())) | ||
|
|
||
|
|
||
| def get_haplo_sequence(context_size: int, variants: Any) -> Any: | ||
| """ | ||
| Construct a haplotype sequence string with flanking genomic context. | ||
|
|
||
| Builds a sequence by combining alternate alleles from each variant with | ||
| intervening reference sequence, bounded by context_size flanking bases on | ||
| each side. | ||
|
|
||
| Args: | ||
| context_size: Number of reference bases to include flanking each end. | ||
| variants: Hail array expression of variant structs with locus and alleles fields. | ||
|
|
||
| Returns: | ||
| Hail string expression representing the full haplotype sequence. | ||
| """ | ||
| sorted_variants = hl.sorted(variants, key=lambda x: x.locus.position) | ||
| min_variant = sorted_variants[0] | ||
| max_variant = sorted_variants[-1] | ||
| min_pos = min_variant.locus.position | ||
| max_pos = max_variant.locus.position | ||
| max_variant_size = hl.len(max_variant.alleles[0]) | ||
| full_context = hl.get_sequence( | ||
| min_variant.locus.contig, | ||
| min_pos, | ||
| before=context_size, | ||
| after=(max_pos - min_pos + max_variant_size + context_size - 1), | ||
| reference_genome="GRCh38", | ||
| ) | ||
|
|
||
| # (min_pos - index_translation) equals context_size, mapping locus positions to string indices | ||
| index_translation = min_pos - context_size | ||
|
|
||
| def get_chunk_until_next_variant(i: Any) -> Any: | ||
| v = sorted_variants[i] | ||
| variant_size = hl.len(v.alleles[0]) | ||
| reference_buffer_size = hl.if_else( | ||
| i == hl.len(sorted_variants) - 1, | ||
| context_size, | ||
| sorted_variants[i + 1].locus.position - (v.locus.position + variant_size), | ||
| ) | ||
| start = v.locus.position - index_translation + variant_size | ||
| return v.alleles[1] + full_context[start : start + reference_buffer_size] | ||
|
|
||
| return full_context[:context_size] + hl.delimit( | ||
| hl.range(hl.len(sorted_variants)).map(get_chunk_until_next_variant), | ||
| "", | ||
| ) | ||
|
|
||
|
|
||
| def variant_distance(v1: Any, v2: Any) -> Any: | ||
| """ | ||
| Calculate the number of reference bases between two variants. | ||
|
|
||
| For example: 1:1:A:T and 1:3:A:T have distance 1 (one base between them). | ||
| 1:1:AA:T and 1:3:A:T have distance 0 (deletion closes the gap). | ||
|
|
||
| Args: | ||
| v1: First variant Hail struct with locus and alleles fields. | ||
| v2: Second variant Hail struct with locus and alleles fields. | ||
|
|
||
| Returns: | ||
| Hail int32 expression for the number of reference bases between v1 and v2. | ||
| """ | ||
| return v2.locus.position - v1.locus.position - hl.len(v1.alleles[0]) | ||
|
|
||
|
|
||
| def split_haplotypes(ht: Any, window_size: int) -> Any: | ||
| """ | ||
| Split multi-variant haplotypes at gaps larger than window_size bases. | ||
|
|
||
| Haplotypes spanning variants further than window_size bases apart are broken | ||
| into sub-haplotypes at those gaps. Sub-haplotypes with fewer than two variants | ||
| are discarded. | ||
|
|
||
| Args: | ||
| ht: Hail table with variants, haplotype, and gnomad_freqs array fields. | ||
| window_size: Maximum reference bases allowed between adjacent variants in | ||
| a haplotype segment. | ||
|
|
||
| Returns: | ||
| Hail table with haplotypes exploded into sub-haplotypes by window. | ||
| """ | ||
| breakpoints = hl.range(1, hl.len(ht.variants)).filter( | ||
| lambda i: (i == 0) | (variant_distance(ht.variants[i - 1], ht.variants[i]) >= window_size) | ||
| ) | ||
|
|
||
| def get_range(i: Any) -> Any: | ||
| start_index = hl.if_else(i == 0, 0, breakpoints[i - 1]) | ||
| end_index = hl.if_else(i == hl.len(breakpoints), hl.len(ht.variants), breakpoints[i]) | ||
| return hl.range(start_index, end_index) | ||
|
|
||
| split_hap_indices = ( | ||
| hl.range(0, hl.len(breakpoints) + 1).map(get_range).filter(lambda r: hl.len(r) > 1) | ||
| ) | ||
| ht = ht.annotate(haplotype_indices=split_hap_indices) | ||
| ht = ht.explode("haplotype_indices") | ||
| ht = ht.annotate( | ||
| haplotype=ht.haplotype_indices.map(lambda i: ht.haplotype[i]), | ||
| variants=ht.haplotype_indices.map(lambda i: ht.variants[i]), | ||
| gnomad_freqs=ht.haplotype_indices.map(lambda i: ht.gnomad_freqs[i]), | ||
| ) | ||
| return ht.drop("haplotype_indices") |
There was a problem hiding this comment.
Add required tests for all new public functions before merge.
This PR adds new public functions, but no happy-path + error-case tests are included in the provided changes.
As per coding guidelines, **/*.py: "New public functions require at least one happy-path test + one error case; bug fixes should include regression test".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@divref/divref/haplotype.py` around lines 11 - 125, Add unit tests for each
new public function: for to_hashable_items write a happy-path test converting a
dict to a stable sorted tuple and an error case for non-dict input or values
that should still be preserved; for get_haplo_sequence write a happy-path test
that constructs a small Hail variants array (multiple variants with known
loci/alleles and GRCh38 reference) and asserts the produced sequence includes
expected flanking/context/allele concatenation, and an error case for
empty/invalid variants or negative context_size; for variant_distance add a
happy-path test covering simple SNPs and indels (e.g., distances that become
zero due to deletions) and an error case for invalid variant structs or
same-locus inputs; for split_haplotypes add a happy-path test that supplies a
table with multi-variant haplotypes and verifies splits/filters occur correctly
given a window_size and an error case for invalid window_size or malformed ht
(e.g., missing fields); place tests in a new test_divref_haplotype.py using the
project test harness and include at least one positive and one negative case per
function to satisfy the guideline.
Raises ValueError before any Hail call when variants is an empty list or tuple. Uses isinstance check to avoid calling bool() on Hail ArrayExpressions, which raise TypeError. Adds tests for both cases. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…plotypes Covers the three non-reference-data functions in haplotype.py: - to_hashable_items: empty dict, single entry, sort order - variant_distance: adjacent SNPs (distance 0), SNPs with gap, deletion that closes a gap - split_haplotypes: no split needed, split at large gap, singleton segment discarded The variant_distance and split_haplotypes tests require a Hail context and request the session-scoped hail_context fixture from conftest.py. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds tests/conftest.py with a session-scoped hail_context fixture that calls hl.init(quiet=True) once per test session and hl.stop() on exit. Removes the @pytest.mark.skip from test_get_haplo_sequence_edge_cases, which already mocks hl.get_sequence and only needed an active Hail context to run hl.eval(). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds actions/setup-java (temurin JDK 11) before the uv test step so that hl.init() in the Hail session fixture can find a JVM. Without this, Hail tests would error on the GitHub Actions runner. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@divref/divref/haplotype.py`:
- Around line 48-54: The current get_haplo_sequence check only rejects empty
Python lists/tuples but allows an empty Hail ArrayExpression to reach
sorted_variants[0]; add an explicit guard for Hail array expressions: detect if
variants is a Hail ArrayExpression (e.g. isinstance(variants,
hl.expr.ArrayExpression)), and in that branch either (a) raise a clear Python
exception (TypeError/ValueError) instructing callers to materialize to a Python
list before calling get_haplo_sequence, or (b) implement a Hail-deferred guard
using hl.len(variants) (or variants.size()) with hl.cond/hl.error to fail early;
update the code paths around sorted_variants, min_variant and max_variant
accordingly and/or document that Hail ArrayExpression inputs are unsupported.
- Around line 119-121: The filter predicate for computing breakpoints contains a
dead check `(i == 0)` because the range starts at 1; remove that vestigial
condition so the lambda only tests distance: update the breakpoints computation
(variable name breakpoints, lambda over ht.variants) to use `lambda i:
variant_distance(ht.variants[i - 1], ht.variants[i]) >= window_size` (keeping
window_size and variant_distance unchanged).
In `@divref/tests/test_haplotype.py`:
- Around line 15-30: Add a happy-path unit test for get_haplo_sequence that
constructs a small valid variants collection (e.g., a Hail variants
array/iterable with 2–3 variants including reference and alternate alleles),
calls get_haplo_sequence with a nonzero context_size (e.g., 1 or 2), and asserts
the returned haplotype sequence contains the expected left/right flanking bases
and concatenated alleles in the correct order; target the test function name
test_get_haplo_sequence_happy_path and reference get_haplo_sequence and
context_size to locate where to add it, and ensure the assertions check both the
flanking context length and that alleles are joined as specified by the function
contract.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 6a7a91e6-278d-450c-9256-e688229efc7d
⛔ Files ignored due to path filters (1)
pixi.lockis excluded by!**/*.lock
📒 Files selected for processing (5)
.github/workflows/python_package.ymldivref/divref/haplotype.pydivref/tests/conftest.pydivref/tests/test_haplotype.pypixi.toml
✅ Files skipped from review due to trivial changes (2)
- pixi.toml
- divref/tests/conftest.py
| # --------------------------------------------------------------------------- | ||
| # get_haplo_sequence | ||
| # --------------------------------------------------------------------------- | ||
|
|
||
|
|
||
| def test_get_haplo_sequence_empty_list_raises() -> None: | ||
| """get_haplo_sequence should raise ValueError when given an empty list.""" | ||
| with pytest.raises(ValueError, match="at least one variant"): | ||
| get_haplo_sequence(context_size=2, variants=[]) | ||
|
|
||
|
|
||
| def test_get_haplo_sequence_empty_tuple_raises() -> None: | ||
| """get_haplo_sequence should raise ValueError when given an empty tuple.""" | ||
| with pytest.raises(ValueError, match="at least one variant"): | ||
| get_haplo_sequence(context_size=2, variants=()) | ||
|
|
There was a problem hiding this comment.
Missing happy-path test for get_haplo_sequence.
Only error cases are tested. Per coding guidelines, new public functions require "at least one happy-path test + one error case." Consider adding a test that constructs a small Hail variants array and asserts the produced sequence includes expected flanking context and allele concatenation.
As per coding guidelines: "New public functions require at least one happy-path test + one error case."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@divref/tests/test_haplotype.py` around lines 15 - 30, Add a happy-path unit
test for get_haplo_sequence that constructs a small valid variants collection
(e.g., a Hail variants array/iterable with 2–3 variants including reference and
alternate alleles), calls get_haplo_sequence with a nonzero context_size (e.g.,
1 or 2), and asserts the returned haplotype sequence contains the expected
left/right flanking bases and concatenated alleles in the correct order; target
the test function name test_get_haplo_sequence_happy_path and reference
get_haplo_sequence and context_size to locate where to add it, and ensure the
assertions check both the flanking context length and that alleles are joined as
specified by the function contract.
Summary
divref/haplotype.pywith shared helpers used across multiple Hail-based pipeline tools:HailPath— type alias for paths accepted by Hail (str, accepting local,gs://,hdfs://)to_hashable_items— converts a Hail struct to a hashable tuple for use as a dict keyget_haplo_sequence— builds haplotype sequence strings with flanking reference contextvariant_distance— computes the distance between two variantssplit_haplotypes— splits multi-variant haplotypes at gaps larger thanwindow_sizebasesTest plan
uv run --directory divref poe check-allpasses🤖 Generated with Claude Code
Summary by CodeRabbit