Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
17 commits
Select commit Hold shift + click to select a range
347d193
feat: Transpile INTERSECTS joins to binned equi-joins
conradbzura Mar 31, 2026
8465472
build: Add datafusion to dev dependencies
conradbzura Mar 31, 2026
1c9cc08
feat: Extend binned join rewrite to implicit cross-joins
conradbzura Mar 31, 2026
d7b4a81
test: Add binned join unit and DataFusion correctness tests
conradbzura Mar 31, 2026
04bd0c0
fix: Place overlap filter in ON and support multi-join queries
conradbzura Mar 31, 2026
13a4375
test: Add multi-join and bin_size validation tests
conradbzura Mar 31, 2026
20140ee
fix: Limit binned join to explicit JOINs and skip DataFusion tests
conradbzura Mar 31, 2026
7394f65
build: Add datafusion to pixi dependencies for CI
conradbzura Mar 31, 2026
b31b6f7
feat: Use key-only bridge CTEs to eliminate __giql_bin column leak
conradbzura Mar 31, 2026
8ead19f
perf: Use 1-join full-CTE path when SELECT has no wildcards
conradbzura Mar 31, 2026
3a17063
style: Remove structural comment dividers
conradbzura Apr 1, 2026
8986276
test: Add regression tests for three binned join bugs
conradbzura Apr 1, 2026
0530c5d
fix: Preserve outer join semantics and extra ON conditions
conradbzura Apr 1, 2026
4c3526c
docs: Document INTERSECTS binned join deduplication behavior
conradbzura Apr 1, 2026
a702a6e
refactor: Address review findings on IntersectsBinnedJoinTransformer
conradbzura Apr 1, 2026
109fceb
style: Remove structural comment from transpile.py
conradbzura Apr 1, 2026
5b4a94a
test: Add property-based bedtools correctness tests for INTERSECTS
conradbzura Apr 1, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 18 additions & 0 deletions docs/dialect/spatial-operators.rst
Original file line number Diff line number Diff line change
Expand Up @@ -99,6 +99,24 @@ Find all variants, with gene information where available:
FROM variants v
LEFT JOIN genes g ON v.interval INTERSECTS g.interval

Deduplication Behavior
~~~~~~~~~~~~~~~~~~~~~~

Column-to-column ``INTERSECTS`` joins use a binned equi-join strategy internally: each interval is assigned to one or more fixed-width bins, and the join is performed on ``(chrom, bin)`` pairs. Because an interval that spans a bin boundary belongs to more than one bin, a single source row can match the same result row more than once. GIQL adds ``SELECT DISTINCT`` automatically to remove these duplicate rows.

This deduplication is usually transparent, but it has one observable side effect: ``DISTINCT`` operates on the entire set of selected columns, so rows that are genuinely identical across every selected column will also be collapsed into one. This matters when a table contains duplicate source records with no distinguishing column.

To prevent unintended deduplication, include any column that makes rows distinguishable — such as a primary key, name, or score — in the ``SELECT`` list:

.. code-block:: sql

-- score distinguishes otherwise-identical rows
SELECT v.chrom, v.start, v.end, v.score, g.name
FROM variants v
INNER JOIN genes g ON v.interval INTERSECTS g.interval

If all columns are identical across two source rows (including any unique identifier), those rows represent the same logical record and collapsing them is correct behavior.

Related Operators
~~~~~~~~~~~~~~~~~

Expand Down
2 changes: 2 additions & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,7 @@ dev = [
"pytest-cov>=4.0.0",
"pytest>=7.0.0",
"ruff>=0.1.0",
"datafusion>=52.3.0",
]
docs = [
"sphinx>=7.0",
Expand Down Expand Up @@ -82,6 +83,7 @@ bedtools = ">=2.31.0"
pybedtools = ">=0.9.0"
pytest = ">=7.0.0"
pytest-cov = ">=4.0.0"
datafusion = ">=43.0.0"
duckdb = ">=1.4.0"
pandas = ">=2.0.0"
sqlglot = ">=20.0.0,<30"
Expand Down
2 changes: 2 additions & 0 deletions src/giql/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,13 +3,15 @@
A SQL dialect for genomic range queries.
"""

from giql.constants import DEFAULT_BIN_SIZE
from giql.table import Table
from giql.transpile import transpile

__version__ = "0.1.0"


__all__ = [
"DEFAULT_BIN_SIZE",
"Table",
"transpile",
]
3 changes: 3 additions & 0 deletions src/giql/constants.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,3 +9,6 @@
DEFAULT_END_COL = "end"
DEFAULT_STRAND_COL = "strand"
DEFAULT_GENOMIC_COL = "interval"

# Default bin size for INTERSECTS binned equi-join optimization
DEFAULT_BIN_SIZE = 10_000
Loading
Loading