Skip to content

Implement bedtools correctness test suite with DataFusion execution engine #74

@conradbzura

Description

@conradbzura

Summary

Implement a functional/integration test suite that validates GIQL operator correctness against bedtools. Each test generates controlled genomic interval datasets, executes the equivalent operation in both GIQL (transpiled to SQL and executed via DataFusion) and bedtools (via pybedtools), then compares the results.

Operators to cover

  • INTERSECTS — validate against bedtools intersect, including strand-aware modes (-s, -S)
  • MERGE — validate against bedtools merge, including strand-aware merging
  • NEAREST — validate against bedtools closest, including k-nearest and distance calculations
  • CLUSTER — validate against bedtools cluster
  • DISTANCE — validate against bedtools closest -d distance output

Architecture

tests/integration/bedtools/
├── conftest.py                  # Fixtures: DataFusion session, interval generator
├── test_intersect.py            # INTERSECTS correctness tests
├── test_merge.py                # MERGE correctness tests
├── test_nearest.py              # NEAREST correctness tests
├── test_cluster.py              # CLUSTER correctness tests
├── test_distance.py             # DISTANCE correctness tests
├── test_strand_aware.py         # Cross-operator strand-specific tests
└── utils/
    ├── bedtools_wrapper.py      # Pybedtools wrapper for each operation
    ├── comparison.py            # Result comparison with epsilon tolerance
    ├── data_models.py           # GenomicInterval, ComparisonResult, etc.
    ├── datafusion_engine.py     # DataFusion session setup and GIQL execution
    └── interval_generator.py    # Seeded interval generation for reproducibility

Test pattern

Each test follows a consistent pattern:

  1. Arrange — Generate intervals using IntervalGenerator with a deterministic seed. Load into both a DataFusion session (as Arrow tables) and pybedtools BedTool objects.
  2. Act (bedtools) — Execute the operation via the pybedtools wrapper.
  3. Act (GIQL) — Transpile the GIQL query and execute it against DataFusion.
  4. Assert — Compare results using order-independent comparison with epsilon tolerance for floats and exact matching for integer positions.

DataFusion execution

Use datafusion (PyArrow-based Python bindings) as the execution engine. GIQL transpiles to SQL; DataFusion executes it. This validates that GIQL's generated SQL is portable and correct on the engine the project targets for production use. The test engine wrapper registers Arrow tables from the interval generator and executes transpiled GIQL queries via SessionContext.sql().

Dependencies

  • pybedtools — Python wrapper for bedtools CLI
  • bedtools — System dependency (tests skip gracefully if not installed)
  • datafusion — Apache DataFusion Python bindings
  • hypothesis — Property-based test data generation for edge-case discovery

Motivation

GIQL transpiles spatial genomic queries into SQL, but the existing unit tests only verify that the generated SQL has the expected structure — they do not verify that the SQL produces correct results on real data. bedtools is the de facto standard for genomic interval operations, making it the ideal oracle for correctness testing. Using DataFusion as the execution engine ensures the suite validates correctness on GIQL's target production engine and catches any SQL dialect incompatibilities early.

Expected outcome

  • Integration test suite under tests/integration/bedtools/ covers the five merged GIQL operators (INTERSECTS, MERGE, NEAREST, CLUSTER, DISTANCE)
  • Tests use DataFusion as the execution engine for GIQL queries
  • Tests skip gracefully when bedtools is not installed
  • Interval generation is seeded and reproducible
  • Strand-aware modes are tested for operators that support them
  • The test suite passes against the current GIQL transpiler output
  • COVERAGE integration tests are deferred to the COVERAGE operator PR (Add binned summary statistic aggregation for genomic intervals — Closes #61 #62)

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions