Skip to content

Implement bedtools correctness test suite#75

Merged
conradbzura merged 6 commits intomainfrom
74-bedtools-integration-tests
Mar 25, 2026
Merged

Implement bedtools correctness test suite#75
conradbzura merged 6 commits intomainfrom
74-bedtools-integration-tests

Conversation

@conradbzura
Copy link
Copy Markdown
Collaborator

@conradbzura conradbzura commented Mar 25, 2026

Summary

Add an integration test suite that validates GIQL operator correctness against bedtools (the de facto standard for genomic interval operations) and known expected results. Each test generates controlled genomic interval datasets, executes the equivalent operation in both GIQL (transpiled to SQL, executed via DuckDB) and the reference implementation, then compares results. Tests skip gracefully when bedtools or Python dependencies are not installed.

Closes #74

Proposed changes

Test infrastructure (tests/integration/bedtools/utils/)

  • data_models.pyGenomicInterval dataclass with validation and to_tuple(), ComparisonResult with failure_message() for readable assertion output
  • bedtools_wrapper.py — Thin wrappers around pybedtools for intersect, merge, and closest with strand mode support and bedtool_to_tuples() converter
  • comparison.py — Order-independent row comparison with epsilon tolerance for floats and deterministic None handling
  • duckdb_loader.pyload_intervals() helper to create DuckDB tables with GIQL default column names
  • conftest.pyduckdb_connection fixture (function-scoped), giql_query fixture (load + transpile + execute in one call), pytest.importorskip guards for duckdb/pybedtools, shutil.which guard for bedtools binary, pytestmark = pytest.mark.integration

Operator test files

  • test_intersect.py — 9 tests: basic overlap, partial, no overlap, adjacent (half-open boundary), multi-chromosome, literal range, literal cross-chromosome, INTERSECTS ANY, INTERSECTS ALL
  • test_contains.py — 5 tests: point containment, range containment, column-to-column, cross-chromosome, CONTAINS ALL
  • test_within.py — 4 tests: basic, narrow range, column-to-column, exact boundary
  • test_merge.py — 6 tests: adjacent, overlapping, separated, multi-chromosome, distance parameter, stranded
  • test_cluster.py — 5 tests: basic (cross-validated against bedtools merge count), separated, multi-chromosome, stranded, distance parameter
  • test_nearest.py — 8 tests: non-overlapping, equidistant candidates, cross-chromosome, boundary (adjacent), k > 1, k > available, max_distance filter, standalone literal reference
  • test_distance.py — 9 tests: non-overlapping, overlapping, adjacent, cross-chromosome (NULL), signed downstream/upstream, stranded with unstranded input (NULL), stranded same-strand, signed + stranded minus-strand (sign flip)
  • test_strand_aware.py — 8 tests: INTERSECTS same/opposite/ignore/mixed strand, NEAREST same-strand, NEAREST opposite-strand (bedtools reference), NEAREST ignore-strand, MERGE strand-specific

Test cases

# Test Suite Given When Then Coverage Target
1 test_intersect Two tables with overlapping intervals a.interval INTERSECTS b.interval Results match bedtools intersect -u Basic overlap
2 test_intersect Intervals with partial overlaps INTERSECTS query executed Partially overlapping intervals returned Partial overlap
3 test_intersect Non-overlapping intervals INTERSECTS query executed Empty result No overlap
4 test_intersect Adjacent intervals (half-open touching) INTERSECTS query executed Empty result (adjacent != overlapping) Boundary: adjacent
5 test_intersect Intervals on chr1 and chr2 INTERSECTS query executed Only same-chromosome matches Multi-chromosome
6 test_intersect Table with intervals on chr1 interval INTERSECTS 'chr1:150-220' Only overlapping intervals returned Literal range
7 test_intersect Table with intervals on chr1 and chr2 interval INTERSECTS 'chr2:150-250' Only chr2 overlapping intervals returned Literal cross-chrom
8 test_intersect Table with intervals across chroms interval INTERSECTS ANY(...) Intervals overlapping any range returned Set predicate ANY
9 test_intersect Table with intervals of varying sizes interval INTERSECTS ALL(...) Only intervals overlapping both ranges returned Set predicate ALL
10 test_contains Table with intervals of varying sizes interval CONTAINS 'chr1:150' Only intervals containing point 150 returned Point containment
11 test_contains Table with intervals of varying sizes interval CONTAINS 'chr1:150-250' Only intervals fully containing range returned Range containment
12 test_contains Two tables with intervals a.interval CONTAINS b.interval Only pairs where A fully contains B returned Column-to-column
13 test_contains Table with intervals on multiple chroms interval CONTAINS 'chr1:150' Only chr1 intervals considered Cross-chromosome
14 test_contains Table with intervals of varying sizes interval CONTAINS ALL(...) Only intervals containing all points returned Set predicate ALL
15 test_within Table with intervals of varying sizes interval WITHIN 'chr1:100-300' Only intervals fully within range returned Basic containment
16 test_within Table with intervals of varying sizes interval WITHIN 'chr1:150-160' Only intervals small enough to fit returned Narrow range
17 test_within Two tables with intervals a.interval WITHIN b.interval Only pairs where A is within B returned Column-to-column
18 test_within Interval matching query range exactly interval WITHIN 'chr1:100-200' Exact-match interval returned Boundary: exact
19 test_merge Adjacent intervals (bookended, half-open) MERGE(interval) Single merged interval Adjacent merge
20 test_merge Overlapping intervals MERGE(interval) Merged into minimal covering interval Overlapping merge
21 test_merge Intervals with gaps MERGE(interval) Each interval remains separate No merge
22 test_merge Intervals on chr1 and chr2 MERGE(interval) Per-chromosome merging Multi-chromosome
23 test_merge Intervals with 50bp and 150bp gaps MERGE(interval, 100) Gaps <= 100bp bridged, > 100bp separate Distance parameter
24 test_merge Overlapping intervals on +/- strands MERGE(interval, stranded := true) Per-strand merging matches bedtools Stranded
25 test_cluster Two overlapping groups CLUSTER(interval) Shared cluster IDs; count matches merge Basic clustering
26 test_cluster Non-overlapping intervals CLUSTER(interval) Each interval gets separate cluster_id Separated
27 test_cluster Intervals on chr1 and chr2 CLUSTER(interval) Per-chromosome clustering Multi-chromosome
28 test_cluster Overlapping intervals on +/- strands CLUSTER(interval, stranded := true) Per-strand clustering Stranded
29 test_cluster Intervals with 50bp and 150bp gaps CLUSTER(interval, 100) Gaps <= 100bp same cluster, > 100bp separate Distance parameter
30 test_nearest Non-overlapping intervals NEAREST(b, reference := a.interval, k := 1) Correct nearest neighbor found Basic nearest
31 test_nearest Equidistant candidates NEAREST(b, reference := a.interval, k := 1) One of the equidistant returned Tie-breaking
32 test_nearest Intervals on chr1 and chr2 NEAREST(b, reference := a.interval, k := 1) Same-chromosome only Cross-chromosome
33 test_nearest Adjacent intervals NEAREST(b, reference := a.interval, k := 1) Adjacent interval is nearest Boundary: adjacent
34 test_nearest One query, three database intervals NEAREST(b, reference := a.interval, k := 3) All 3 neighbors returned k > 1
35 test_nearest One query, two database intervals NEAREST(b, reference := a.interval, k := 5) Only 2 rows returned k > available
36 test_nearest One near and one far interval NEAREST(b, ..., max_distance := 50) Only near interval returned max_distance filter
37 test_nearest Table with intervals NEAREST(t, reference := 'chr1:350-360', k := 2) 2 nearest to literal returned Standalone mode
38 test_distance Non-overlapping with 100bp gap DISTANCE(a.interval, b.interval) Returns 100 (half-open) Non-overlapping
39 test_distance Overlapping intervals DISTANCE(a.interval, b.interval) Returns 0 Overlapping
40 test_distance Adjacent intervals DISTANCE(a.interval, b.interval) Returns 0 (half-open) Adjacent
41 test_distance Different chromosomes DISTANCE(a.interval, b.interval) Returns NULL Cross-chromosome
42 test_distance B downstream of A on + strand DISTANCE(..., signed := true) Returns positive Signed downstream
43 test_distance B upstream of A on + strand DISTANCE(..., signed := true) Returns negative Signed upstream
44 test_distance One interval with strand "." DISTANCE(..., stranded := true) Returns NULL Stranded: unstranded
45 test_distance Both intervals on + strand DISTANCE(..., stranded := true) Returns 100 normally Stranded: same strand
46 test_distance Both intervals on - strand DISTANCE(..., signed+stranded := true) Returns negative (sign flip) Signed + stranded
47 test_strand_aware Intervals on + and - strands INTERSECTS + a.strand = b.strand Only same-strand overlaps Strand: same
48 test_strand_aware Intervals on + and - strands INTERSECTS + a.strand != b.strand Only opposite-strand overlaps Strand: opposite
49 test_strand_aware Intervals with various strands INTERSECTS without strand filter All overlaps regardless of strand Strand: ignore
50 test_strand_aware +, -, and "." strands INTERSECTS same-strand filter Handles unstranded correctly Strand: mixed
51 test_strand_aware Same and opposite strand candidates NEAREST(..., stranded := true) Only same-strand nearest NEAREST strand: same
52 test_strand_aware Same and opposite strand candidates bedtools closest -S Opposite-strand nearest (reference) NEAREST strand: opposite
53 test_strand_aware Intervals on different strands NEAREST(...) unstranded Closest regardless of strand NEAREST strand: ignore
54 test_strand_aware Overlapping intervals on +/- strands MERGE(interval, stranded := true) 2 merged intervals (one per strand) MERGE strand-specific

Implementation plan

    • Create test directory structure (tests/integration/bedtools/ with utils/ subpackage)
    • Implement data models: GenomicInterval, ComparisonResult in utils/data_models.py
    • Implement pybedtools wrappers: intersect, merge, closest in utils/bedtools_wrapper.py
    • Implement result comparison with epsilon tolerance in utils/comparison.py
    • Implement DuckDB table loader in utils/duckdb_loader.py
    • Implement conftest with duckdb_connection, giql_query fixtures, skip guards, integration marker
    • Implement INTERSECTS tests (basic + literal range + ANY/ALL set predicates)
    • Implement CONTAINS tests (point, range, column-to-column, cross-chrom, ALL)
    • Implement WITHIN tests (basic, narrow range, column-to-column, exact boundary)
    • Implement MERGE tests (adjacent, overlapping, separated, multi-chrom, distance, stranded)
    • Implement CLUSTER tests (basic, separated, multi-chrom, stranded, distance)
    • Implement NEAREST tests (basic, ties, cross-chrom, boundary, k>1, k>available, max_distance, standalone)
    • Implement DISTANCE tests (non-overlapping, overlapping, adjacent, cross-chrom, signed, stranded, signed+stranded)
    • Implement strand-aware cross-operator tests
    • Address code review findings (docstring convention, modern types, giql_query fixture, unused models, rename bed_export)

Data models, pybedtools wrapper, result comparison logic, DuckDB
table loader, and pytest fixtures for validating GIQL operator
correctness against bedtools. Tests skip gracefully when bedtools
binary or Python dependencies are not installed.

References #74
Integration tests covering INTERSECTS, MERGE, NEAREST, CLUSTER, and
DISTANCE operators. Each test generates controlled genomic intervals,
executes the equivalent operation via GIQL (transpiled to SQL, run on
DuckDB) and bedtools (via pybedtools), then compares results.

Includes strand-aware tests for INTERSECTS and NEAREST with same-strand,
opposite-strand, and ignore-strand modes.

Closes #74
@conradbzura conradbzura self-assigned this Mar 25, 2026
Point containment, range containment, column-to-column, cross-
chromosome, and CONTAINS ALL set predicate tests. WITHIN tests
cover basic, narrow range, column-to-column, and exact boundary.

References #74
DISTANCE: signed downstream/upstream, stranded with unstranded input,
stranded same-strand, signed+stranded minus-strand sign flip.

NEAREST: k>1, k exceeding available count, max_distance filter,
standalone mode with literal reference.

MERGE: distance parameter bridging gaps, stranded GIQL execution.

CLUSTER: distance parameter grouping with gap tolerance.

INTERSECTS: literal range, literal cross-chromosome, ANY and ALL set
predicates.

References #74
@conradbzura conradbzura marked this pull request as ready for review March 25, 2026 20:34
- Fix test_nearest_opposite_strand: remove unused duckdb_connection
  param, document as bedtools-only reference validation
- Fix test_merge_strand_specific: now executes GIQL MERGE with
  stranded := true and compares against bedtools, assert == 2
- Remove unused SimulatedDataset and IntervalGeneratorConfig
- Align docstrings to GIVEN/WHEN/THEN convention (all caps, no colons)
- Modernize type hints: list[tuple] instead of typing.List[Tuple]
- Rename bed_export.py to duckdb_loader.py
- Rename format param to bed_format to avoid shadowing builtin
- Add pytest.mark.integration marker via pytestmark
- Add giql_query fixture to reduce load/transpile/execute boilerplate
- Fix _sort_key to use sentinel tuple for None values
- Remove pass from BedtoolsError (docstring suffices)
- Add identifier safety comment to load_intervals
- Add field layout comment to bedtool_to_tuples closest format
Ruff auto-formatted these files when they were touched by pre-commit
hooks during the integration test work. Changes are purely cosmetic:
trailing commas, line-length wrapping, import sorting, and a missing
newline at end of file.
@conradbzura conradbzura merged commit 409640d into main Mar 25, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Implement bedtools correctness test suite with DataFusion execution engine

1 participant