-
Notifications
You must be signed in to change notification settings - Fork 2
Implement bedtools correctness test suite with DataFusion execution engine #74
Description
Summary
Implement a functional/integration test suite that validates GIQL operator correctness against bedtools. Each test generates controlled genomic interval datasets, executes the equivalent operation in both GIQL (transpiled to SQL and executed via DataFusion) and bedtools (via pybedtools), then compares the results.
Operators to cover
- INTERSECTS — validate against
bedtools intersect, including strand-aware modes (-s,-S) - MERGE — validate against
bedtools merge, including strand-aware merging - NEAREST — validate against
bedtools closest, including k-nearest and distance calculations - CLUSTER — validate against
bedtools cluster - DISTANCE — validate against
bedtools closest -ddistance output
Architecture
tests/integration/bedtools/
├── conftest.py # Fixtures: DataFusion session, interval generator
├── test_intersect.py # INTERSECTS correctness tests
├── test_merge.py # MERGE correctness tests
├── test_nearest.py # NEAREST correctness tests
├── test_cluster.py # CLUSTER correctness tests
├── test_distance.py # DISTANCE correctness tests
├── test_strand_aware.py # Cross-operator strand-specific tests
└── utils/
├── bedtools_wrapper.py # Pybedtools wrapper for each operation
├── comparison.py # Result comparison with epsilon tolerance
├── data_models.py # GenomicInterval, ComparisonResult, etc.
├── datafusion_engine.py # DataFusion session setup and GIQL execution
└── interval_generator.py # Seeded interval generation for reproducibility
Test pattern
Each test follows a consistent pattern:
- Arrange — Generate intervals using
IntervalGeneratorwith a deterministic seed. Load into both a DataFusion session (as Arrow tables) and pybedtools BedTool objects. - Act (bedtools) — Execute the operation via the pybedtools wrapper.
- Act (GIQL) — Transpile the GIQL query and execute it against DataFusion.
- Assert — Compare results using order-independent comparison with epsilon tolerance for floats and exact matching for integer positions.
DataFusion execution
Use datafusion (PyArrow-based Python bindings) as the execution engine. GIQL transpiles to SQL; DataFusion executes it. This validates that GIQL's generated SQL is portable and correct on the engine the project targets for production use. The test engine wrapper registers Arrow tables from the interval generator and executes transpiled GIQL queries via SessionContext.sql().
Dependencies
pybedtools— Python wrapper for bedtools CLIbedtools— System dependency (tests skip gracefully if not installed)datafusion— Apache DataFusion Python bindingshypothesis— Property-based test data generation for edge-case discovery
Motivation
GIQL transpiles spatial genomic queries into SQL, but the existing unit tests only verify that the generated SQL has the expected structure — they do not verify that the SQL produces correct results on real data. bedtools is the de facto standard for genomic interval operations, making it the ideal oracle for correctness testing. Using DataFusion as the execution engine ensures the suite validates correctness on GIQL's target production engine and catches any SQL dialect incompatibilities early.
Expected outcome
- Integration test suite under
tests/integration/bedtools/covers the five merged GIQL operators (INTERSECTS, MERGE, NEAREST, CLUSTER, DISTANCE) - Tests use DataFusion as the execution engine for GIQL queries
- Tests skip gracefully when bedtools is not installed
- Interval generation is seeded and reproducible
- Strand-aware modes are tested for operators that support them
- The test suite passes against the current GIQL transpiler output
- COVERAGE integration tests are deferred to the COVERAGE operator PR (Add binned summary statistic aggregation for genomic intervals — Closes #61 #62)