Skip to content

feat: add haplotype.py shared utilities module#4

Merged
ameynert merged 7 commits intomainfrom
am_03_haplotype_utils
Apr 13, 2026
Merged

feat: add haplotype.py shared utilities module#4
ameynert merged 7 commits intomainfrom
am_03_haplotype_utils

Conversation

@ameynert
Copy link
Copy Markdown
Collaborator

@ameynert ameynert commented Apr 10, 2026

Summary

  • Adds divref/haplotype.py with shared helpers used across multiple Hail-based pipeline tools:
    • HailPath — type alias for paths accepted by Hail (str, accepting local, gs://, hdfs://)
    • to_hashable_items — converts a Hail struct to a hashable tuple for use as a dict key
    • get_haplo_sequence — builds haplotype sequence strings with flanking reference context
    • variant_distance — computes the distance between two variants
    • split_haplotypes — splits multi-variant haplotypes at gaps larger than window_size bases

Test plan

  • uv run --directory divref poe check-all passes

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features
    • Haplotype processing utilities for reconstructing sequences with genomic flanking context and segmenting variants by spacing.
  • Bug Fixes / Improvements
    • Improved handling of variant distances and sequence assembly to yield consistent haplotype outputs.
  • Tests
    • Added comprehensive tests covering sequence reconstruction, hashing behavior, distance calculations, and window-based splitting.
  • Chores
    • CI: ensure Java runtime is configured for tests; restrict supported OpenJDK versions for Hail-related workflows.

@ameynert ameynert had a problem deploying to github-actions-snakemake-linting April 10, 2026 22:58 — with GitHub Actions Failure
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 10, 2026

📝 Walkthrough

Walkthrough

Adds a new Hail-based haplotype utilities module and accompanying tests, plus CI Java setup and an OpenJDK version constraint change.

Changes

Cohort / File(s) Summary
Haplotype Processing Utilities
divref/divref/haplotype.py
New module introducing HailPath alias and functions: to_hashable_items(), get_haplo_sequence(), variant_distance(), and split_haplotypes() for deterministic dict hashing, haplotype sequence construction with flanking context, variant-gap calculation, and segmentation by window size.
Tests / Hail fixture
divref/tests/conftest.py, divref/tests/test_haplotype.py
Adds a session-scoped hail_context pytest fixture initializing/tearing down Hail, and tests that verify empty-input behavior, hashing order, variant distance scenarios, and haplotype splitting (including singleton-dropping behavior).
CI workflow
.github/workflows/python_package.yml
Adds a step to install/configure Temurin Java 11 (actions/setup-java) before running test/check commands.
Feature constraints
pixi.toml
Restricts feature.hail.dependencies OpenJDK constraint from >=11 to >=11,<17.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 I stitched the bases, one by one,
Flanks and gaps till the job was done,
Keys sorted neat, segments split fine,
Hail sings softly — code in a line,
Hop, inspect, merge — carrot time! 🥕

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 35.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the main change: adding a new haplotype.py module with shared utility functions for the DivRef pipeline.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch am_03_haplotype_utils

Comment @coderabbitai help to get the list of available commands and usage tips.

@ameynert ameynert force-pushed the am_03_haplotype_utils branch from f98f6db to 5245c8e Compare April 10, 2026 23:16
@ameynert ameynert temporarily deployed to github-actions-snakemake-linting April 10, 2026 23:16 — with GitHub Actions Inactive
Base automatically changed from am_02_claude_md to main April 13, 2026 18:18
Add HailPath type alias and shared helper functions used across
multiple Hail-based pipeline tools: to_hashable_items, get_haplo_sequence,
variant_distance, and split_haplotypes.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@ameynert ameynert force-pushed the am_03_haplotype_utils branch from 5245c8e to 06e1fab Compare April 13, 2026 18:18
@ameynert ameynert temporarily deployed to github-actions-snakemake-linting April 13, 2026 18:18 — with GitHub Actions Inactive
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@divref/divref/haplotype.py`:
- Around line 39-41: Before indexing sorted_variants, add an explicit
empty-input guard for the input variable variants: check if not variants (or
len(variants) == 0) and either raise a clear exception (e.g., ValueError with a
descriptive message) or return a defined sentinel (e.g., None) depending on the
function's contract; implement this check before calling hl.sorted and only
compute min_variant and max_variant from sorted_variants when non-empty, and
update any callers/tests to expect the chosen behavior.
- Around line 11-125: Add unit tests for each new public function: for
to_hashable_items write a happy-path test converting a dict to a stable sorted
tuple and an error case for non-dict input or values that should still be
preserved; for get_haplo_sequence write a happy-path test that constructs a
small Hail variants array (multiple variants with known loci/alleles and GRCh38
reference) and asserts the produced sequence includes expected
flanking/context/allele concatenation, and an error case for empty/invalid
variants or negative context_size; for variant_distance add a happy-path test
covering simple SNPs and indels (e.g., distances that become zero due to
deletions) and an error case for invalid variant structs or same-locus inputs;
for split_haplotypes add a happy-path test that supplies a table with
multi-variant haplotypes and verifies splits/filters occur correctly given a
window_size and an error case for invalid window_size or malformed ht (e.g.,
missing fields); place tests in a new test_divref_haplotype.py using the project
test harness and include at least one positive and one negative case per
function to satisfy the guideline.
- Around line 106-108: The split threshold in the breakpoints computation uses
>= which splits when gap == window_size, but the docstring says to split only
for gaps larger than window_size; in the breakpoints expression (variable
breakpoints referencing ht.variants and variant_distance and window_size) change
the comparison from >= to > so segmentation occurs only for gaps strictly
greater than window_size; verify the lambda remains applied to the intended
indices (ht.variants) after this change.
- Around line 11-21: to_hashable_items currently returns
tuple(sorted(d.items())) which fails when values are unhashable (nested
list/dict/set); update the to_hashable_items function to recursively convert
nested structures into immutable/hashable equivalents (e.g., lists -> tuples,
sets -> frozensets, dicts -> sorted tuples via a helper like _freeze(value) that
handles nested dicts by sorting keys and freezing values) and use that helper
when building the top-level items so the resulting tuple contains only hashable
members; optionally raise TypeError for unsupported types to tighten the
contract.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: c676053d-beb5-4891-86e6-cf394ca9db91

📥 Commits

Reviewing files that changed from the base of the PR and between 8d82aa3 and 06e1fab.

📒 Files selected for processing (1)
  • divref/divref/haplotype.py

Comment on lines +11 to +125
def to_hashable_items(d: dict[str, Any]) -> tuple[tuple[str, Any], ...]:
"""
Convert a dictionary to a sorted tuple of items for use as a hashable key.

Args:
d: Dictionary to convert.

Returns:
Sorted tuple of (key, value) pairs.
"""
return tuple(sorted(d.items()))


def get_haplo_sequence(context_size: int, variants: Any) -> Any:
"""
Construct a haplotype sequence string with flanking genomic context.

Builds a sequence by combining alternate alleles from each variant with
intervening reference sequence, bounded by context_size flanking bases on
each side.

Args:
context_size: Number of reference bases to include flanking each end.
variants: Hail array expression of variant structs with locus and alleles fields.

Returns:
Hail string expression representing the full haplotype sequence.
"""
sorted_variants = hl.sorted(variants, key=lambda x: x.locus.position)
min_variant = sorted_variants[0]
max_variant = sorted_variants[-1]
min_pos = min_variant.locus.position
max_pos = max_variant.locus.position
max_variant_size = hl.len(max_variant.alleles[0])
full_context = hl.get_sequence(
min_variant.locus.contig,
min_pos,
before=context_size,
after=(max_pos - min_pos + max_variant_size + context_size - 1),
reference_genome="GRCh38",
)

# (min_pos - index_translation) equals context_size, mapping locus positions to string indices
index_translation = min_pos - context_size

def get_chunk_until_next_variant(i: Any) -> Any:
v = sorted_variants[i]
variant_size = hl.len(v.alleles[0])
reference_buffer_size = hl.if_else(
i == hl.len(sorted_variants) - 1,
context_size,
sorted_variants[i + 1].locus.position - (v.locus.position + variant_size),
)
start = v.locus.position - index_translation + variant_size
return v.alleles[1] + full_context[start : start + reference_buffer_size]

return full_context[:context_size] + hl.delimit(
hl.range(hl.len(sorted_variants)).map(get_chunk_until_next_variant),
"",
)


def variant_distance(v1: Any, v2: Any) -> Any:
"""
Calculate the number of reference bases between two variants.

For example: 1:1:A:T and 1:3:A:T have distance 1 (one base between them).
1:1:AA:T and 1:3:A:T have distance 0 (deletion closes the gap).

Args:
v1: First variant Hail struct with locus and alleles fields.
v2: Second variant Hail struct with locus and alleles fields.

Returns:
Hail int32 expression for the number of reference bases between v1 and v2.
"""
return v2.locus.position - v1.locus.position - hl.len(v1.alleles[0])


def split_haplotypes(ht: Any, window_size: int) -> Any:
"""
Split multi-variant haplotypes at gaps larger than window_size bases.

Haplotypes spanning variants further than window_size bases apart are broken
into sub-haplotypes at those gaps. Sub-haplotypes with fewer than two variants
are discarded.

Args:
ht: Hail table with variants, haplotype, and gnomad_freqs array fields.
window_size: Maximum reference bases allowed between adjacent variants in
a haplotype segment.

Returns:
Hail table with haplotypes exploded into sub-haplotypes by window.
"""
breakpoints = hl.range(1, hl.len(ht.variants)).filter(
lambda i: (i == 0) | (variant_distance(ht.variants[i - 1], ht.variants[i]) >= window_size)
)

def get_range(i: Any) -> Any:
start_index = hl.if_else(i == 0, 0, breakpoints[i - 1])
end_index = hl.if_else(i == hl.len(breakpoints), hl.len(ht.variants), breakpoints[i])
return hl.range(start_index, end_index)

split_hap_indices = (
hl.range(0, hl.len(breakpoints) + 1).map(get_range).filter(lambda r: hl.len(r) > 1)
)
ht = ht.annotate(haplotype_indices=split_hap_indices)
ht = ht.explode("haplotype_indices")
ht = ht.annotate(
haplotype=ht.haplotype_indices.map(lambda i: ht.haplotype[i]),
variants=ht.haplotype_indices.map(lambda i: ht.variants[i]),
gnomad_freqs=ht.haplotype_indices.map(lambda i: ht.gnomad_freqs[i]),
)
return ht.drop("haplotype_indices")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Add required tests for all new public functions before merge.

This PR adds new public functions, but no happy-path + error-case tests are included in the provided changes.

As per coding guidelines, **/*.py: "New public functions require at least one happy-path test + one error case; bug fixes should include regression test".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@divref/divref/haplotype.py` around lines 11 - 125, Add unit tests for each
new public function: for to_hashable_items write a happy-path test converting a
dict to a stable sorted tuple and an error case for non-dict input or values
that should still be preserved; for get_haplo_sequence write a happy-path test
that constructs a small Hail variants array (multiple variants with known
loci/alleles and GRCh38 reference) and asserts the produced sequence includes
expected flanking/context/allele concatenation, and an error case for
empty/invalid variants or negative context_size; for variant_distance add a
happy-path test covering simple SNPs and indels (e.g., distances that become
zero due to deletions) and an error case for invalid variant structs or
same-locus inputs; for split_haplotypes add a happy-path test that supplies a
table with multi-variant haplotypes and verifies splits/filters occur correctly
given a window_size and an error case for invalid window_size or malformed ht
(e.g., missing fields); place tests in a new test_divref_haplotype.py using the
project test harness and include at least one positive and one negative case per
function to satisfy the guideline.

ameynert and others added 6 commits April 13, 2026 14:03
Raises ValueError before any Hail call when variants is an empty list
or tuple. Uses isinstance check to avoid calling bool() on Hail
ArrayExpressions, which raise TypeError. Adds tests for both cases.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…plotypes

Covers the three non-reference-data functions in haplotype.py:
- to_hashable_items: empty dict, single entry, sort order
- variant_distance: adjacent SNPs (distance 0), SNPs with gap,
  deletion that closes a gap
- split_haplotypes: no split needed, split at large gap, singleton
  segment discarded

The variant_distance and split_haplotypes tests require a Hail context
and request the session-scoped hail_context fixture from conftest.py.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds tests/conftest.py with a session-scoped hail_context fixture that
calls hl.init(quiet=True) once per test session and hl.stop() on exit.
Removes the @pytest.mark.skip from test_get_haplo_sequence_edge_cases,
which already mocks hl.get_sequence and only needed an active Hail
context to run hl.eval().

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Adds actions/setup-java (temurin JDK 11) before the uv test step so
that hl.init() in the Hail session fixture can find a JVM. Without
this, Hail tests would error on the GitHub Actions runner.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@ameynert ameynert temporarily deployed to github-actions-snakemake-linting April 13, 2026 21:57 — with GitHub Actions Inactive
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@divref/divref/haplotype.py`:
- Around line 48-54: The current get_haplo_sequence check only rejects empty
Python lists/tuples but allows an empty Hail ArrayExpression to reach
sorted_variants[0]; add an explicit guard for Hail array expressions: detect if
variants is a Hail ArrayExpression (e.g. isinstance(variants,
hl.expr.ArrayExpression)), and in that branch either (a) raise a clear Python
exception (TypeError/ValueError) instructing callers to materialize to a Python
list before calling get_haplo_sequence, or (b) implement a Hail-deferred guard
using hl.len(variants) (or variants.size()) with hl.cond/hl.error to fail early;
update the code paths around sorted_variants, min_variant and max_variant
accordingly and/or document that Hail ArrayExpression inputs are unsupported.
- Around line 119-121: The filter predicate for computing breakpoints contains a
dead check `(i == 0)` because the range starts at 1; remove that vestigial
condition so the lambda only tests distance: update the breakpoints computation
(variable name breakpoints, lambda over ht.variants) to use `lambda i:
variant_distance(ht.variants[i - 1], ht.variants[i]) >= window_size` (keeping
window_size and variant_distance unchanged).

In `@divref/tests/test_haplotype.py`:
- Around line 15-30: Add a happy-path unit test for get_haplo_sequence that
constructs a small valid variants collection (e.g., a Hail variants
array/iterable with 2–3 variants including reference and alternate alleles),
calls get_haplo_sequence with a nonzero context_size (e.g., 1 or 2), and asserts
the returned haplotype sequence contains the expected left/right flanking bases
and concatenated alleles in the correct order; target the test function name
test_get_haplo_sequence_happy_path and reference get_haplo_sequence and
context_size to locate where to add it, and ensure the assertions check both the
flanking context length and that alleles are joined as specified by the function
contract.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 6a7a91e6-278d-450c-9256-e688229efc7d

📥 Commits

Reviewing files that changed from the base of the PR and between 06e1fab and 5887480.

⛔ Files ignored due to path filters (1)
  • pixi.lock is excluded by !**/*.lock
📒 Files selected for processing (5)
  • .github/workflows/python_package.yml
  • divref/divref/haplotype.py
  • divref/tests/conftest.py
  • divref/tests/test_haplotype.py
  • pixi.toml
✅ Files skipped from review due to trivial changes (2)
  • pixi.toml
  • divref/tests/conftest.py

Comment on lines +15 to +30
# ---------------------------------------------------------------------------
# get_haplo_sequence
# ---------------------------------------------------------------------------


def test_get_haplo_sequence_empty_list_raises() -> None:
"""get_haplo_sequence should raise ValueError when given an empty list."""
with pytest.raises(ValueError, match="at least one variant"):
get_haplo_sequence(context_size=2, variants=[])


def test_get_haplo_sequence_empty_tuple_raises() -> None:
"""get_haplo_sequence should raise ValueError when given an empty tuple."""
with pytest.raises(ValueError, match="at least one variant"):
get_haplo_sequence(context_size=2, variants=())

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Missing happy-path test for get_haplo_sequence.

Only error cases are tested. Per coding guidelines, new public functions require "at least one happy-path test + one error case." Consider adding a test that constructs a small Hail variants array and asserts the produced sequence includes expected flanking context and allele concatenation.

As per coding guidelines: "New public functions require at least one happy-path test + one error case."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@divref/tests/test_haplotype.py` around lines 15 - 30, Add a happy-path unit
test for get_haplo_sequence that constructs a small valid variants collection
(e.g., a Hail variants array/iterable with 2–3 variants including reference and
alternate alleles), calls get_haplo_sequence with a nonzero context_size (e.g.,
1 or 2), and asserts the returned haplotype sequence contains the expected
left/right flanking bases and concatenated alleles in the correct order; target
the test function name test_get_haplo_sequence_happy_path and reference
get_haplo_sequence and context_size to locate where to add it, and ensure the
assertions check both the flanking context length and that alleles are joined as
specified by the function contract.

@ameynert ameynert merged commit 403ece8 into main Apr 13, 2026
4 checks passed
@ameynert ameynert deleted the am_03_haplotype_utils branch April 13, 2026 23:02
This was referenced Apr 14, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant