Skip to content

Conversation

@Kontinuation
Copy link
Member

This is an intermediate step that integrates multiple pieces together. The overall behavior is unchanged when memory limit of DataFusion is not set (which is the default case). The collected bounding box samples are unused for now, the performance overhead of sampling boxes is negligible according to our tests. Committing this won't drive the project into an unstable or unreleasable state.

The next step will be adding a partitioned spatial index provider and integrate spatial partitioner into the main spatial join workflow, but will effectively only work on one single partition for now. This will also be a non-breaking change.

@Kontinuation Kontinuation marked this pull request as ready for review January 23, 2026 09:47
@Kontinuation Kontinuation force-pushed the pr-integrate-bbox-sampler-and-spiller branch from b54ba58 to 17a9261 Compare January 23, 2026 09:48
Copy link
Member

@paleolimbot paleolimbot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

filter.as_ref(),
converted_from_hash_join,
)?;
let seed = fastrand::u64(0..0xFFFF);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to do this here, but should this be configurable in some way or logged so we can reproduce failures?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added an option sedona.spatial_join.debug.random_seed for setting the random seed; also the seed is printed in debug log.

geos = { workspace = true }
float_next_after = { workspace = true }
fastrand = { workspace = true }
log = { workspace = true }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are log and env_logger doing the same thing or do we need both of them?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need both. log is the logging facade, env_logger is the logging implementation we actually use.

Reference: https://docs.rs/log/latest/log/#available-logging-implementations

@Kontinuation Kontinuation force-pushed the pr-integrate-bbox-sampler-and-spiller branch from 5bc89e1 to 52ec01e Compare January 26, 2026 08:16
@Kontinuation Kontinuation merged commit e31d554 into apache:main Jan 27, 2026
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants