feat: add haplotype.py shared utilities module by ameynert · Pull Request #4 · fg-labs/divref-wf

ameynert · 2026-04-10T22:58:18Z

Summary

Adds divref/haplotype.py with shared helpers used across multiple Hail-based pipeline tools:
- HailPath — type alias for paths accepted by Hail (str, accepting local, gs://, hdfs://)
- to_hashable_items — converts a Hail struct to a hashable tuple for use as a dict key
- get_haplo_sequence — builds haplotype sequence strings with flanking reference context
- variant_distance — computes the distance between two variants
- split_haplotypes — splits multi-variant haplotypes at gaps larger than window_size bases

Test plan

uv run --directory divref poe check-all passes

🤖 Generated with Claude Code

Summary by CodeRabbit

New Features
- Haplotype processing utilities for reconstructing sequences with genomic flanking context and segmenting variants by spacing.
Bug Fixes / Improvements
- Improved handling of variant distances and sequence assembly to yield consistent haplotype outputs.
Tests
- Added comprehensive tests covering sequence reconstruction, hashing behavior, distance calculations, and window-based splitting.
Chores
- CI: ensure Java runtime is configured for tests; restrict supported OpenJDK versions for Hail-related workflows.

coderabbitai · 2026-04-10T22:58:25Z

📝 Walkthrough

Walkthrough

Adds a new Hail-based haplotype utilities module and accompanying tests, plus CI Java setup and an OpenJDK version constraint change.

Changes

Cohort / File(s)	Summary
Haplotype Processing Utilities `divref/divref/haplotype.py`	New module introducing `HailPath` alias and functions: `to_hashable_items()`, `get_haplo_sequence()`, `variant_distance()`, and `split_haplotypes()` for deterministic dict hashing, haplotype sequence construction with flanking context, variant-gap calculation, and segmentation by window size.
Tests / Hail fixture `divref/tests/conftest.py`, `divref/tests/test_haplotype.py`	Adds a session-scoped `hail_context` pytest fixture initializing/tearing down Hail, and tests that verify empty-input behavior, hashing order, variant distance scenarios, and haplotype splitting (including singleton-dropping behavior).
CI workflow `.github/workflows/python_package.yml`	Adds a step to install/configure Temurin Java 11 (`actions/setup-java`) before running test/check commands.
Feature constraints `pixi.toml`	Restricts `feature.hail.dependencies` OpenJDK constraint from `>=11` to `>=11,<17`.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 I stitched the bases, one by one,
Flanks and gaps till the job was done,
Keys sorted neat, segments split fine,
Hail sings softly — code in a line,
Hop, inspect, merge — carrot time! 🥕

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 35.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately describes the main change: adding a new haplotype.py module with shared utility functions for the DivRef pipeline.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch am_03_haplotype_utils

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Add HailPath type alias and shared helper functions used across multiple Hail-based pipeline tools: to_hashable_items, get_haplo_sequence, variant_distance, and split_haplotypes. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

coderabbitai

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@divref/divref/haplotype.py`:
- Around line 39-41: Before indexing sorted_variants, add an explicit
empty-input guard for the input variable variants: check if not variants (or
len(variants) == 0) and either raise a clear exception (e.g., ValueError with a
descriptive message) or return a defined sentinel (e.g., None) depending on the
function's contract; implement this check before calling hl.sorted and only
compute min_variant and max_variant from sorted_variants when non-empty, and
update any callers/tests to expect the chosen behavior.
- Around line 11-125: Add unit tests for each new public function: for
to_hashable_items write a happy-path test converting a dict to a stable sorted
tuple and an error case for non-dict input or values that should still be
preserved; for get_haplo_sequence write a happy-path test that constructs a
small Hail variants array (multiple variants with known loci/alleles and GRCh38
reference) and asserts the produced sequence includes expected
flanking/context/allele concatenation, and an error case for empty/invalid
variants or negative context_size; for variant_distance add a happy-path test
covering simple SNPs and indels (e.g., distances that become zero due to
deletions) and an error case for invalid variant structs or same-locus inputs;
for split_haplotypes add a happy-path test that supplies a table with
multi-variant haplotypes and verifies splits/filters occur correctly given a
window_size and an error case for invalid window_size or malformed ht (e.g.,
missing fields); place tests in a new test_divref_haplotype.py using the project
test harness and include at least one positive and one negative case per
function to satisfy the guideline.
- Around line 106-108: The split threshold in the breakpoints computation uses
>= which splits when gap == window_size, but the docstring says to split only
for gaps larger than window_size; in the breakpoints expression (variable
breakpoints referencing ht.variants and variant_distance and window_size) change
the comparison from >= to > so segmentation occurs only for gaps strictly
greater than window_size; verify the lambda remains applied to the intended
indices (ht.variants) after this change.
- Around line 11-21: to_hashable_items currently returns
tuple(sorted(d.items())) which fails when values are unhashable (nested
list/dict/set); update the to_hashable_items function to recursively convert
nested structures into immutable/hashable equivalents (e.g., lists -> tuples,
sets -> frozensets, dicts -> sorted tuples via a helper like _freeze(value) that
handles nested dicts by sorting keys and freezing values) and use that helper
when building the top-level items so the resulting tuple contains only hashable
members; optionally raise TypeError for unsupported types to tighten the
contract.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: c676053d-beb5-4891-86e6-cf394ca9db91

📥 Commits

Reviewing files that changed from the base of the PR and between 8d82aa3 and 06e1fab.

📒 Files selected for processing (1)

divref/divref/haplotype.py

coderabbitai · 2026-04-13T18:21:59Z

divref/divref/haplotype.py

+def to_hashable_items(d: dict[str, Any]) -> tuple[tuple[str, Any], ...]:
+    """
+    Convert a dictionary to a sorted tuple of items for use as a hashable key.
+
+    Args:
+        d: Dictionary to convert.
+
+    Returns:
+        Sorted tuple of (key, value) pairs.
+    """
+    return tuple(sorted(d.items()))
+
+
+def get_haplo_sequence(context_size: int, variants: Any) -> Any:
+    """
+    Construct a haplotype sequence string with flanking genomic context.
+
+    Builds a sequence by combining alternate alleles from each variant with
+    intervening reference sequence, bounded by context_size flanking bases on
+    each side.
+
+    Args:
+        context_size: Number of reference bases to include flanking each end.
+        variants: Hail array expression of variant structs with locus and alleles fields.
+
+    Returns:
+        Hail string expression representing the full haplotype sequence.
+    """
+    sorted_variants = hl.sorted(variants, key=lambda x: x.locus.position)
+    min_variant = sorted_variants[0]
+    max_variant = sorted_variants[-1]
+    min_pos = min_variant.locus.position
+    max_pos = max_variant.locus.position
+    max_variant_size = hl.len(max_variant.alleles[0])
+    full_context = hl.get_sequence(
+        min_variant.locus.contig,
+        min_pos,
+        before=context_size,
+        after=(max_pos - min_pos + max_variant_size + context_size - 1),
+        reference_genome="GRCh38",
+    )
+
+    # (min_pos - index_translation) equals context_size, mapping locus positions to string indices
+    index_translation = min_pos - context_size
+
+    def get_chunk_until_next_variant(i: Any) -> Any:
+        v = sorted_variants[i]
+        variant_size = hl.len(v.alleles[0])
+        reference_buffer_size = hl.if_else(
+            i == hl.len(sorted_variants) - 1,
+            context_size,
+            sorted_variants[i + 1].locus.position - (v.locus.position + variant_size),
+        )
+        start = v.locus.position - index_translation + variant_size
+        return v.alleles[1] + full_context[start : start + reference_buffer_size]
+
+    return full_context[:context_size] + hl.delimit(
+        hl.range(hl.len(sorted_variants)).map(get_chunk_until_next_variant),
+        "",
+    )
+
+
+def variant_distance(v1: Any, v2: Any) -> Any:
+    """
+    Calculate the number of reference bases between two variants.
+
+    For example: 1:1:A:T and 1:3:A:T have distance 1 (one base between them).
+    1:1:AA:T and 1:3:A:T have distance 0 (deletion closes the gap).
+
+    Args:
+        v1: First variant Hail struct with locus and alleles fields.
+        v2: Second variant Hail struct with locus and alleles fields.
+
+    Returns:
+        Hail int32 expression for the number of reference bases between v1 and v2.
+    """
+    return v2.locus.position - v1.locus.position - hl.len(v1.alleles[0])
+
+
+def split_haplotypes(ht: Any, window_size: int) -> Any:
+    """
+    Split multi-variant haplotypes at gaps larger than window_size bases.
+
+    Haplotypes spanning variants further than window_size bases apart are broken
+    into sub-haplotypes at those gaps. Sub-haplotypes with fewer than two variants
+    are discarded.
+
+    Args:
+        ht: Hail table with variants, haplotype, and gnomad_freqs array fields.
+        window_size: Maximum reference bases allowed between adjacent variants in
+            a haplotype segment.
+
+    Returns:
+        Hail table with haplotypes exploded into sub-haplotypes by window.
+    """
+    breakpoints = hl.range(1, hl.len(ht.variants)).filter(
+        lambda i: (i == 0) | (variant_distance(ht.variants[i - 1], ht.variants[i]) >= window_size)
+    )
+
+    def get_range(i: Any) -> Any:
+        start_index = hl.if_else(i == 0, 0, breakpoints[i - 1])
+        end_index = hl.if_else(i == hl.len(breakpoints), hl.len(ht.variants), breakpoints[i])
+        return hl.range(start_index, end_index)
+
+    split_hap_indices = (
+        hl.range(0, hl.len(breakpoints) + 1).map(get_range).filter(lambda r: hl.len(r) > 1)
+    )
+    ht = ht.annotate(haplotype_indices=split_hap_indices)
+    ht = ht.explode("haplotype_indices")
+    ht = ht.annotate(
+        haplotype=ht.haplotype_indices.map(lambda i: ht.haplotype[i]),
+        variants=ht.haplotype_indices.map(lambda i: ht.variants[i]),
+        gnomad_freqs=ht.haplotype_indices.map(lambda i: ht.gnomad_freqs[i]),
+    )
+    return ht.drop("haplotype_indices")


⚠️ Potential issue | 🟠 Major

Add required tests for all new public functions before merge.

This PR adds new public functions, but no happy-path + error-case tests are included in the provided changes.

As per coding guidelines, **/*.py: "New public functions require at least one happy-path test + one error case; bug fixes should include regression test".

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@divref/divref/haplotype.py` around lines 11 - 125, Add unit tests for each new public function: for to_hashable_items write a happy-path test converting a dict to a stable sorted tuple and an error case for non-dict input or values that should still be preserved; for get_haplo_sequence write a happy-path test that constructs a small Hail variants array (multiple variants with known loci/alleles and GRCh38 reference) and asserts the produced sequence includes expected flanking/context/allele concatenation, and an error case for empty/invalid variants or negative context_size; for variant_distance add a happy-path test covering simple SNPs and indels (e.g., distances that become zero due to deletions) and an error case for invalid variant structs or same-locus inputs; for split_haplotypes add a happy-path test that supplies a table with multi-variant haplotypes and verifies splits/filters occur correctly given a window_size and an error case for invalid window_size or malformed ht (e.g., missing fields); place tests in a new test_divref_haplotype.py using the project test harness and include at least one positive and one negative case per function to satisfy the guideline.

divref/divref/haplotype.py

Raises ValueError before any Hail call when variants is an empty list or tuple. Uses isinstance check to avoid calling bool() on Hail ArrayExpressions, which raise TypeError. Adds tests for both cases. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…plotypes Covers the three non-reference-data functions in haplotype.py: - to_hashable_items: empty dict, single entry, sort order - variant_distance: adjacent SNPs (distance 0), SNPs with gap, deletion that closes a gap - split_haplotypes: no split needed, split at large gap, singleton segment discarded The variant_distance and split_haplotypes tests require a Hail context and request the session-scoped hail_context fixture from conftest.py. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Adds tests/conftest.py with a session-scoped hail_context fixture that calls hl.init(quiet=True) once per test session and hl.stop() on exit. Removes the @pytest.mark.skip from test_get_haplo_sequence_edge_cases, which already mocks hl.get_sequence and only needed an active Hail context to run hl.eval(). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Adds actions/setup-java (temurin JDK 11) before the uv test step so that hl.init() in the Hail session fixture can find a JVM. Without this, Hail tests would error on the GitHub Actions runner. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

coderabbitai

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@divref/divref/haplotype.py`:
- Around line 48-54: The current get_haplo_sequence check only rejects empty
Python lists/tuples but allows an empty Hail ArrayExpression to reach
sorted_variants[0]; add an explicit guard for Hail array expressions: detect if
variants is a Hail ArrayExpression (e.g. isinstance(variants,
hl.expr.ArrayExpression)), and in that branch either (a) raise a clear Python
exception (TypeError/ValueError) instructing callers to materialize to a Python
list before calling get_haplo_sequence, or (b) implement a Hail-deferred guard
using hl.len(variants) (or variants.size()) with hl.cond/hl.error to fail early;
update the code paths around sorted_variants, min_variant and max_variant
accordingly and/or document that Hail ArrayExpression inputs are unsupported.
- Around line 119-121: The filter predicate for computing breakpoints contains a
dead check `(i == 0)` because the range starts at 1; remove that vestigial
condition so the lambda only tests distance: update the breakpoints computation
(variable name breakpoints, lambda over ht.variants) to use `lambda i:
variant_distance(ht.variants[i - 1], ht.variants[i]) >= window_size` (keeping
window_size and variant_distance unchanged).

In `@divref/tests/test_haplotype.py`:
- Around line 15-30: Add a happy-path unit test for get_haplo_sequence that
constructs a small valid variants collection (e.g., a Hail variants
array/iterable with 2–3 variants including reference and alternate alleles),
calls get_haplo_sequence with a nonzero context_size (e.g., 1 or 2), and asserts
the returned haplotype sequence contains the expected left/right flanking bases
and concatenated alleles in the correct order; target the test function name
test_get_haplo_sequence_happy_path and reference get_haplo_sequence and
context_size to locate where to add it, and ensure the assertions check both the
flanking context length and that alleles are joined as specified by the function
contract.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 6a7a91e6-278d-450c-9256-e688229efc7d

📥 Commits

Reviewing files that changed from the base of the PR and between 06e1fab and 5887480.

⛔ Files ignored due to path filters (1)

pixi.lock is excluded by !**/*.lock

📒 Files selected for processing (5)

.github/workflows/python_package.yml
divref/divref/haplotype.py
divref/tests/conftest.py
divref/tests/test_haplotype.py
pixi.toml

✅ Files skipped from review due to trivial changes (2)

pixi.toml
divref/tests/conftest.py

divref/divref/haplotype.py

coderabbitai · 2026-04-13T22:04:25Z

divref/tests/test_haplotype.py

+# ---------------------------------------------------------------------------
+# get_haplo_sequence
+# ---------------------------------------------------------------------------
+
+
+def test_get_haplo_sequence_empty_list_raises() -> None:
+    """get_haplo_sequence should raise ValueError when given an empty list."""
+    with pytest.raises(ValueError, match="at least one variant"):
+        get_haplo_sequence(context_size=2, variants=[])
+
+
+def test_get_haplo_sequence_empty_tuple_raises() -> None:
+    """get_haplo_sequence should raise ValueError when given an empty tuple."""
+    with pytest.raises(ValueError, match="at least one variant"):
+        get_haplo_sequence(context_size=2, variants=())
+


⚠️ Potential issue | 🟡 Minor

Missing happy-path test for get_haplo_sequence.

Only error cases are tested. Per coding guidelines, new public functions require "at least one happy-path test + one error case." Consider adding a test that constructs a small Hail variants array and asserts the produced sequence includes expected flanking context and allele concatenation.

As per coding guidelines: "New public functions require at least one happy-path test + one error case."

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@divref/tests/test_haplotype.py` around lines 15 - 30, Add a happy-path unit test for get_haplo_sequence that constructs a small valid variants collection (e.g., a Hail variants array/iterable with 2–3 variants including reference and alternate alleles), calls get_haplo_sequence with a nonzero context_size (e.g., 1 or 2), and asserts the returned haplotype sequence contains the expected left/right flanking bases and concatenated alleles in the correct order; target the test function name test_get_haplo_sequence_happy_path and reference get_haplo_sequence and context_size to locate where to add it, and ensure the assertions check both the flanking context length and that alleles are joined as specified by the function contract.

ameynert had a problem deploying to github-actions-snakemake-linting April 10, 2026 22:58 — with GitHub Actions Failure

ameynert force-pushed the am_02_claude_md branch from b01de52 to 639072a Compare April 10, 2026 23:16

ameynert force-pushed the am_03_haplotype_utils branch from f98f6db to 5245c8e Compare April 10, 2026 23:16

ameynert temporarily deployed to github-actions-snakemake-linting April 10, 2026 23:16 — with GitHub Actions Inactive

ameynert force-pushed the am_02_claude_md branch from 639072a to 8b16ab6 Compare April 13, 2026 18:15

Base automatically changed from am_02_claude_md to main April 13, 2026 18:18

feat: add haplotype.py shared utilities module

06e1fab

Add HailPath type alias and shared helper functions used across multiple Hail-based pipeline tools: to_hashable_items, get_haplo_sequence, variant_distance, and split_haplotypes. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

ameynert force-pushed the am_03_haplotype_utils branch from 5245c8e to 06e1fab Compare April 13, 2026 18:18

ameynert temporarily deployed to github-actions-snakemake-linting April 13, 2026 18:18 — with GitHub Actions Inactive

coderabbitai bot reviewed Apr 13, 2026

View reviewed changes

ameynert and others added 6 commits April 13, 2026 14:03

ci: pin setup-java to v5.2.0

29e93be

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

chore: fix openjdk to <17

5887480

ameynert temporarily deployed to github-actions-snakemake-linting April 13, 2026 21:57 — with GitHub Actions Inactive

coderabbitai bot reviewed Apr 13, 2026

View reviewed changes

ameynert merged commit 403ece8 into main Apr 13, 2026
4 checks passed

ameynert deleted the am_03_haplotype_utils branch April 13, 2026 23:02

This was referenced Apr 14, 2026

feat: get basic tests working #18

Merged

feat: add rewrite_fasta tool #5

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add haplotype.py shared utilities module#4

feat: add haplotype.py shared utilities module#4
ameynert merged 7 commits intomainfrom
am_03_haplotype_utils

ameynert commented Apr 10, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Apr 10, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Apr 13, 2026

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot Apr 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ameynert commented Apr 10, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Apr 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ameynert commented Apr 10, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Apr 10, 2026 •

edited

Loading