Skip to content

277: chore: Refactor spatial join code for easier implementing new features#7

Closed
martin-augment wants to merge 5 commits intomainfrom
pr-277-2025-11-07-13-28-22
Closed

277: chore: Refactor spatial join code for easier implementing new features#7
martin-augment wants to merge 5 commits intomainfrom
pr-277-2025-11-07-13-28-22

Conversation

@martin-augment
Copy link
Copy Markdown
Owner

@martin-augment martin-augment commented Nov 7, 2025

277: To review by AI


Note

Major spatial join refactor: split into cohesive modules, add centralized index build pipeline, rework streaming and utils, and tweak APIs/metrics; minor Python doc example update.

  • Rust: Spatial Join refactor
    • Modularization:
      • Split index.rs into index/spatial_index.rs, index/spatial_index_builder.rs, index/build_side_collector.rs, and index/knn_adapter.rs.
      • Introduce evaluated_batch (with in-memory streams) and move join helpers to utils/join_utils.rs; add utils/{once_fut,init_once_array,concurrent_reservation}.rs.
    • Centralized index build:
      • Add build_index.rs to orchestrate build-side collection (with metrics/memory reservations) and index construction; returns error when spilling (external builder not implemented).
    • API/behavior changes:
      • EvaluatedGeometryArray.rects now Vec<Option<Rect<f32>>>; propagate through collectors, builder, index, and stream.
      • KNN query path uses accessor/cache module; tie-breaker handling retained within refactored code.
      • Metrics streamlined/renamed (e.g., build input collection, output_rows), and tracked refiner mem/exec mode.
    • Execution/streaming:
      • SpatialJoinExec updated to use new build_index and utils::join_utils; cleans up once-future handling.
      • SpatialJoinStream/iterators operate on EvaluatedBatch; outer-join unmatched production preserved.
  • Python (minor):
    • Example now assigns cur.execute(...) to a throwaway variable to avoid unused result.

Written by Cursor Bugbot for commit 8c3535f. This will update automatically on new commits. Configure here.

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Nov 7, 2025

Walkthrough

This pull request refactors Sedona's spatial join implementation with new modular infrastructure for stream processing, index building, and query execution. Changes introduce evaluated batch streaming, build-side collection, spatial indexing with KNN support, and reorganize utilities into specialized modules.

Changes

Cohort / File(s) Summary
Python documentation
python/sedonadb/python/sedonadb/dbapi.py
Updated example to show assigning unused return value to dummy variable with no functional changes
Evaluated batch streaming
rust/sedona-spatial-join/src/evaluated_batch.rs, rust/sedona-spatial-join/src/evaluated_batch/evaluated_batch_stream.rs, rust/sedona-spatial-join/src/evaluated_batch/evaluated_batch_stream/in_mem.rs
Introduced EvaluatedBatch struct wrapping record batches and evaluated geometries; added EvaluatedBatchStream trait and SendableEvaluatedBatchStream type alias; implemented in-memory batch streaming with InMemoryEvaluatedBatchStream
Build-side collection
rust/sedona-spatial-join/src/index/build_side_collector.rs
Added BuildPartition, BuildSideBatchesCollector, and CollectBuildSideMetrics types to asynchronously collect and evaluate build-side batches with memory management and statistics tracking
Build index orchestration
rust/sedona-spatial-join/src/build_index.rs
Introduced async build_index function to orchestrate memory management, batch collection, and spatial index construction with error handling for external streams
Spatial index core
rust/sedona-spatial-join/src/index/spatial_index.rs, rust/sedona-spatial-join/src/index/spatial_index_builder.rs
Added SpatialIndex with R-tree-based indexing, KNN querying, and probe-side refinement; added SpatialIndexBuilder to accumulate batches and construct indexes with memory tracking
KNN support
rust/sedona-spatial-join/src/index/knn_adapter.rs
Introduced KnnComponents for distance metrics and geometry caching; added SedonaKnnAdapter geometry accessor with lock-free caching via OnceCell
Execution and streaming
rust/sedona-spatial-join/src/exec.rs, rust/sedona-spatial-join/src/stream.rs
Updated build_index call sites with new signature including Arc<TaskContext> parameter; refactored SpatialJoinBatchIterator to use EvaluatedBatch wrapper for unified batch and geometry handling
Index module reorganization
rust/sedona-spatial-join/src/index.rs
Added submodule declarations for build_side_collector, spatial_index, spatial_index_builder, and knn_adapter; reduced IndexQueryResult visibility to pub(crate) and added QueryResultMetrics struct
Operand evaluator updates
rust/sedona-spatial-join/src/operand_evaluator.rs
Changed EvaluatedGeometryArray.rects from index-based pairs to optional per-geometry rectangles; removed wkb_array field with updated memory safety comments and null/empty geometry handling
Utility module reorganization
rust/sedona-spatial-join/src/utils.rs, rust/sedona-spatial-join/src/utils/join_utils.rs
Restructured utils.rs into submodules: concurrent_reservation, init_once_array, join_utils, and once_fut; moved join utility functions including index filtering, batch construction, and partitioning logic into join_utils
Library visibility changes
rust/sedona-spatial-join/src/lib.rs
Made index and stream modules private; removed public concurrent_reservation, init_once_array, and once_fut modules; added new modules build_index and evaluated_batch; added public re-exports for register_spatial_join_optimizer and sedona option types
Import path updates
rust/sedona-spatial-join/src/refine/geos.rs, rust/sedona-spatial-join/src/refine/tg.rs
Updated InitOnceArray import from crate::init_once_array to crate::utils::init_once_array; introduced centralized memory estimator for prepared geometries in geos.rs
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch pr-277-2025-11-07-13-28-22

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@claude
Copy link
Copy Markdown

claude bot commented Nov 7, 2025

Code Review: Spatial Join Refactoring (PR #7)

This is a substantial refactoring that improves code organization and maintainability. Overall, the changes are well-structured and follow good software engineering practices. Here's my detailed analysis:

✅ Strengths

1. Excellent Modularization

The refactoring successfully breaks down the monolithic index.rs (1932 lines) into cohesive, single-responsibility modules:

  • spatial_index.rs - Core index structure and query operations
  • spatial_index_builder.rs - Builder pattern for index construction
  • build_side_collector.rs - Build-side collection logic with metrics
  • knn_adapter.rs - KNN query functionality
  • evaluated_batch/ - Batch evaluation abstractions with in-memory streams

This significantly improves code navigability and makes future modifications easier.

2. Improved Abstractions

  • EvaluatedBatch: Clean abstraction that encapsulates both the record batch and evaluated geometry array
  • EvaluatedBatchStream trait: Well-designed trait for streaming evaluated batches with is_external() method to distinguish in-memory vs. external streams
  • Centralized build_index() function: Provides a clear entry point for index construction

3. Better Separation of Concerns

  • Build-side collection is now separate from index building
  • Memory tracking is properly scoped with dedicated MemoryConsumer instances per partition
  • Metrics collection is well-organized with dedicated metric structs

4. Type Safety Improvements

The change from Vec<(usize, Rect<f32>)> to Vec<Option<Rect<f32>>> for rects is more semantically correct, as it properly represents geometries without valid bounding rectangles.

⚠️ Issues & Concerns

1. Critical: Potential Memory Overestimation (rust/sedona-spatial-join/src/evaluated_batch.rs:37-42)

pub fn in_mem_size(&self) -> usize {
    // NOTE: sometimes `geom_array` will reuse the memory of `batch`, especially when
    // the expression for evaluating the geometry is a simple column reference. In this case,
    // the in_mem_size will be overestimated. It is a conservative estimation so there's no risk
    // of running out of memory because of underestimation.
    self.batch.get_array_memory_size() + self.geom_array.in_mem_size()
}

Issue: The conservative estimation could lead to premature spilling when memory is not actually exhausted, potentially degrading performance. This same pattern exists in the old IndexedBatch but is now in a more prominent position.

Recommendation: Consider tracking whether the geometry array shares memory with the batch and adjusting the calculation accordingly, or at least document this trade-off more prominently in the module-level docs.

2. Error Handling: External Builder Not Implemented (rust/sedona-spatial-join/src/build_index.rs:86-88)

} else {
    Err(DataFusionError::ResourcesExhausted("Memory limit exceeded while collecting indexed data. External spatial index builder is not yet implemented.".to_string()))
}

Issue: This creates a hard failure when memory limits are exceeded. While marked as TODO, this could be problematic in production with large datasets.

Recommendation:

  • Add a GitHub issue reference in the error message
  • Consider logging a warning earlier in the process when approaching memory limits
  • Document this limitation in the module/function docs

3. Missing Validation in build_index() (rust/sedona-spatial-join/src/build_index.rs:36-44)

The function doesn't validate that build_streams is non-empty before proceeding with expensive setup operations.

Recommendation:

if build_streams.is_empty() {
    return Ok(SpatialIndex::empty(...));
}

4. Metrics Naming Inconsistency (rust/sedona-spatial-join/src/index/build_side_collector.rs:74-79)

Metrics use both snake_case (build_input_batches) and verbose names (build_input_total_size_bytes). The old metrics had build_input_rows but now it's in CollectBuildSideMetrics while build_mem_used is in SpatialJoinBuildMetrics.

Recommendation: Ensure consistency in metric naming across the refactored modules. Consider a naming convention document.

5. Type Change Propagation (Pervasive)

The change from Vec<(usize, Rect<f32>)> to Vec<Option<Rect<f32>>> is correct but requires careful handling throughout:

  • Need to verify all index accesses properly handle the Option case
  • The rects() method in EvaluatedBatch returns the full vector, but callers need to handle None values

Recommendation: Ensure comprehensive testing of edge cases with null geometries.

🔍 Minor Issues & Suggestions

6. Documentation Completeness

The new build_index() function has good inline documentation, but could benefit from:

  • Examples of when it returns ResourcesExhausted error
  • Description of the partition-based collection strategy
  • Memory reservation strategy explanation

7. Python Test Change (python/sedonadb/python/sedonadb/dbapi.py:41)

-        ...     cur.execute("SELECT 1 as one")
+        ...     _ = cur.execute("SELECT 1 as one")

Issue: This is a good fix for the linting issue with adbc_driver_manager.dbapi update, but make sure this doesn't mask any actual return value that should be checked.

8. Module Organization

Consider adding a module-level doc comment to rust/sedona-spatial-join/src/index.rs that explains the overall architecture of the index subsystem and how the modules work together.

🧪 Test Coverage

Concern: The PR doesn't include new test files. While this is a refactoring that shouldn't change behavior:

  • Are there existing integration tests that cover the refactored code paths?
  • Consider adding unit tests for the new abstractions (EvaluatedBatch, BuildSideBatchesCollector)
  • Test edge cases: empty streams, memory exhaustion scenarios, null geometry handling

Recommendation: Run the full test suite and include test results in the PR description.

🚀 Performance Considerations

Positive:

  • The refactoring maintains the same algorithmic complexity
  • Better modularity should make future optimizations easier
  • Memory tracking is more granular

Watch out for:

  • The memory overestimation issue mentioned above
  • Ensure the additional abstraction layers don't introduce performance overhead (though they shouldn't based on the code structure)

🔒 Security Considerations

No significant security concerns identified. The code:

  • Properly uses Result types for error handling
  • Has appropriate memory bounds checking
  • Uses safe Rust patterns throughout

📋 Summary

This is a high-quality refactoring that significantly improves code maintainability and sets up a better foundation for future features (like external spatial index builder). The main concerns are:

  1. Must address: The external builder not implemented error (document and track)
  2. Should address: Memory overestimation issue
  3. Good to have: Additional test coverage for new abstractions
  4. Nice to have: Better documentation and metric naming consistency

Recommendation: ✅ Approve with minor changes

The refactoring is sound and the identified issues are manageable. The code quality improvements significantly outweigh the minor concerns.


Review conducted with attention to code quality, best practices, potential bugs, performance, security, and test coverage per repository standards.

Copy link
Copy Markdown

@augmentcode augmentcode bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review completed. 2 suggestions posted.
Comment augment review to trigger a new review.

@@ -524,9 +518,10 @@ impl SpatialJoinBatchIterator {
// Process probe rows incrementally until we have enough results or finish
let initial_size = self.build_batch_positions.len();

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider asserting the invariant that rects.len() == wkbs.len() before the loop (e.g., a debug_assert_eq!) to catch any mismatch early and prevent potential out-of-bounds access when indexing rects[self.current_probe_idx].

🤖 Was this useful? React with 👍 or 👎

sedona_options.spatial_join,
join_type,
probe_threads_count,
Arc::clone(memory_pool),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the else branch below, since external spilling isn't implemented, consider enhancing the error with a TODO/issue reference to aid operators and future implementation tracking (e.g., include an issue ID in the message).

🤖 Was this useful? React with 👍 or 👎

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
rust/sedona-spatial-join/src/operand_evaluator.rs (1)

243-253: Don't clear rects; it breaks downstream indexing.

When the scalar distance evaluates to NULL, the new code empties result.rects. Later consumers (e.g. SpatialJoinBatchIterator::next_batch in stream.rs) index rects by probe row and now hit an out-of-bounds panic whenever distance is NULL, even though geometries still exist. Keep the vector length aligned with the geometry count—set each entry to None instead of clearing or leave it unchanged. For example:

-            ColumnarValue::Scalar(ScalarValue::Float64(None)) => {
-                // Distance expression evaluates to NULL, the resulting distance should be NULL as well.
-                result.rects.clear();
-            }
+            ColumnarValue::Scalar(ScalarValue::Float64(None)) => {
+                // Distance expression evaluates to NULL; distances are unknown but geometry count stays the same.
+                result.rects.iter_mut().for_each(|rect_opt| *rect_opt = None);
+            }
🧹 Nitpick comments (1)
python/sedonadb/python/sedonadb/dbapi.py (1)

41-41: Minor documentation change with unclear relevance to PR objectives.

The addition of _ = to explicitly ignore the return value of cur.execute() is valid Python convention but not strictly necessary here. The PR title mentions "Refactor spatial join code for easier implementing new features," yet this file only contains a trivial documentation update to a DBAPI example.

Consider whether this change is intended for this PR, or if it was included accidentally.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6864d64 and 8c3535f.

📒 Files selected for processing (18)
  • python/sedonadb/python/sedonadb/dbapi.py (1 hunks)
  • rust/sedona-spatial-join/src/build_index.rs (1 hunks)
  • rust/sedona-spatial-join/src/evaluated_batch.rs (1 hunks)
  • rust/sedona-spatial-join/src/evaluated_batch/evaluated_batch_stream.rs (1 hunks)
  • rust/sedona-spatial-join/src/evaluated_batch/evaluated_batch_stream/in_mem.rs (1 hunks)
  • rust/sedona-spatial-join/src/exec.rs (4 hunks)
  • rust/sedona-spatial-join/src/index.rs (1 hunks)
  • rust/sedona-spatial-join/src/index/build_side_collector.rs (1 hunks)
  • rust/sedona-spatial-join/src/index/knn_adapter.rs (1 hunks)
  • rust/sedona-spatial-join/src/index/spatial_index.rs (1 hunks)
  • rust/sedona-spatial-join/src/index/spatial_index_builder.rs (1 hunks)
  • rust/sedona-spatial-join/src/lib.rs (1 hunks)
  • rust/sedona-spatial-join/src/operand_evaluator.rs (4 hunks)
  • rust/sedona-spatial-join/src/refine/geos.rs (3 hunks)
  • rust/sedona-spatial-join/src/refine/tg.rs (1 hunks)
  • rust/sedona-spatial-join/src/stream.rs (10 hunks)
  • rust/sedona-spatial-join/src/utils.rs (1 hunks)
  • rust/sedona-spatial-join/src/utils/join_utils.rs (1 hunks)
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-10-28T20:11:56.888Z
Learnt from: martin-augment
Repo: martin-augment/sedona-db PR: 5
File: rust/sedona-functions/src/st_start_point.rs:115-123
Timestamp: 2025-10-28T20:11:56.888Z
Learning: In Rust Arrow projects, `BinaryBuilder` (which is a type alias for `GenericByteBuilder<GenericBinaryType<i32>>`) implements the `std::io::Write` trait. The correct pattern for building WKB binary data is: (1) write bytes to the builder using the Write trait methods, (2) call `builder.append_value([])` to finalize the current value. The empty slice signals completion of the buffered value, not replacement. This pattern is used throughout sedona-functions.

Applied to files:

  • rust/sedona-spatial-join/src/operand_evaluator.rs
📚 Learning: 2025-10-27T06:50:54.305Z
Learnt from: martin-augment
Repo: martin-augment/sedona-db PR: 2
File: c/sedona-geos/src/st_buffer.rs:243-252
Timestamp: 2025-10-27T06:50:54.305Z
Learning: In the ST_Buffer implementation in c/sedona-geos/src/st_buffer.rs, the `parse_side()` function should return an enum (e.g., BufferSide with variants Both, Left, Right) instead of a boolean, to properly distinguish between left and right side buffer operations. The distance sign should be negated for right-side buffers since GEOS uses positive distance for left and negative for right.

Applied to files:

  • rust/sedona-spatial-join/src/refine/geos.rs
🧬 Code graph analysis (15)
rust/sedona-spatial-join/src/build_index.rs (4)
rust/sedona-spatial-join/src/exec.rs (5)
  • metrics (417-419)
  • join_type (199-201)
  • session_config (432-435)
  • session_config (516-519)
  • session_config (708-711)
rust/sedona-spatial-join/src/operand_evaluator.rs (4)
  • create_operand_evaluator (74-89)
  • new (182-187)
  • new (198-203)
  • new (380-382)
rust/sedona-spatial-join/src/index/build_side_collector.rs (2)
  • new (72-81)
  • new (85-87)
rust/sedona-spatial-join/src/index/spatial_index_builder.rs (2)
  • new (92-97)
  • new (102-126)
rust/sedona-spatial-join/src/index/build_side_collector.rs (2)
rust/sedona-spatial-join/src/evaluated_batch/evaluated_batch_stream/in_mem.rs (1)
  • new (33-37)
rust/sedona-spatial-join/src/evaluated_batch.rs (1)
  • in_mem_size (37-43)
rust/sedona-spatial-join/src/operand_evaluator.rs (1)
rust/sedona-spatial-join/src/evaluated_batch.rs (1)
  • wkb (49-52)
rust/sedona-spatial-join/src/evaluated_batch/evaluated_batch_stream.rs (1)
rust/sedona-spatial-join/src/evaluated_batch/evaluated_batch_stream/in_mem.rs (1)
  • is_external (41-43)
rust/sedona-spatial-join/src/evaluated_batch/evaluated_batch_stream/in_mem.rs (1)
rust/sedona-spatial-join/src/evaluated_batch/evaluated_batch_stream.rs (1)
  • is_external (29-29)
rust/sedona-spatial-join/src/index/spatial_index_builder.rs (7)
rust/sedona-spatial-join/src/operand_evaluator.rs (8)
  • array (255-255)
  • array (290-290)
  • create_operand_evaluator (74-89)
  • new (182-187)
  • new (198-203)
  • new (380-382)
  • in_mem_size (157-171)
  • try_new (108-146)
rust/sedona-spatial-join/src/index/spatial_index.rs (5)
  • RTreeBuilder (119-119)
  • refine (393-443)
  • schema (136-138)
  • empty (101-134)
  • visited_left_side (463-465)
rust/sedona-spatial-join/src/refine.rs (2)
  • refine (56-60)
  • create_refiner (112-142)
rust/sedona-spatial-join/src/utils/join_utils.rs (1)
  • need_produce_result_in_final (44-53)
rust/sedona-spatial-join/src/index/knn_adapter.rs (2)
  • new (40-62)
  • new (89-99)
rust/sedona-spatial-join/src/evaluated_batch.rs (2)
  • in_mem_size (37-43)
  • rects (54-56)
rust/sedona-spatial-join/src/utils/concurrent_reservation.rs (1)
  • try_new (36-44)
rust/sedona-spatial-join/src/lib.rs (6)
rust/sedona-spatial-join/src/build_index.rs (1)
  • build_index (36-89)
rust/sedona-spatial-join/src/index/spatial_index.rs (1)
  • refine (393-443)
rust/sedona-spatial-join/src/refine/geos.rs (1)
  • refine (330-346)
rust/sedona-spatial-join/src/refine/tg.rs (1)
  • refine (248-268)
rust/sedona-spatial-join/src/refine.rs (1)
  • refine (56-60)
rust/sedona-spatial-join/src/refine/geo.rs (1)
  • refine (154-175)
rust/sedona-spatial-join/src/index/knn_adapter.rs (3)
rust/sedona-spatial-join/src/evaluated_batch.rs (1)
  • wkb (49-52)
rust/sedona-geo/src/to_geo.rs (1)
  • item_to_geometry (58-69)
rust/sedona-spatial-join/src/index/spatial_index_builder.rs (4)
  • new (92-97)
  • new (102-126)
  • indexed_batches (148-152)
  • indexed_batches (243-247)
rust/sedona-spatial-join/src/index/spatial_index.rs (7)
rust/sedona-spatial-join/src/evaluated_batch.rs (2)
  • distance (58-60)
  • wkb (49-52)
rust/sedona-geo/src/to_geo.rs (1)
  • item_to_geometry (58-69)
rust/sedona-spatial-join/src/refine.rs (3)
  • refine (56-60)
  • need_more_probe_stats (81-81)
  • merge_probe_stats (87-87)
rust/sedona-spatial-join/src/refine/geo.rs (4)
  • refine (154-175)
  • new (78-109)
  • need_more_probe_stats (189-191)
  • merge_probe_stats (193-197)
rust/sedona-spatial-join/src/utils/concurrent_reservation.rs (1)
  • try_new (36-44)
rust/sedona-spatial-join/src/index/knn_adapter.rs (2)
  • new (40-62)
  • new (89-99)
rust/sedona-spatial-join/src/index/spatial_index_builder.rs (4)
  • new (92-97)
  • new (102-126)
  • indexed_batches (148-152)
  • indexed_batches (243-247)
rust/sedona-spatial-join/src/evaluated_batch.rs (1)
rust/sedona-spatial-join/src/operand_evaluator.rs (1)
  • wkbs (149-155)
rust/sedona-spatial-join/src/exec.rs (2)
rust/sedona-spatial-join/src/build_index.rs (1)
  • build_index (36-89)
rust/sedona-spatial-join/src/utils/join_utils.rs (2)
  • asymmetric_join_output_partitioning (431-452)
  • boundedness_from_children (457-487)
rust/sedona-spatial-join/src/stream.rs (3)
rust/sedona-spatial-join/src/operand_evaluator.rs (6)
  • create_operand_evaluator (74-89)
  • distance_value_at (283-306)
  • new (182-187)
  • new (198-203)
  • new (380-382)
  • wkbs (149-155)
rust/sedona-spatial-join/src/utils/join_utils.rs (2)
  • apply_join_filter_to_indices (95-128)
  • build_batch_from_indices (132-185)
rust/sedona-spatial-join/src/evaluated_batch.rs (2)
  • rects (54-56)
  • distance (58-60)
rust/sedona-spatial-join/src/utils/join_utils.rs (2)
rust/sedona-spatial-join/src/exec.rs (5)
  • join_type (199-201)
  • filter (57-57)
  • try_new_with_options (157-196)
  • try_new (145-154)
  • children (393-395)
rust/sedona-spatial-join/src/stream.rs (5)
  • schema (442-444)
  • new (87-119)
  • new (146-160)
  • new (494-508)
  • new (849-870)
rust/sedona-spatial-join/src/index.rs (1)
rust/sedona-spatial-join/src/evaluated_batch.rs (1)
  • wkb (49-52)
rust/sedona-spatial-join/src/refine/geos.rs (1)
rust/sedona-spatial-join/src/evaluated_batch.rs (1)
  • wkb (49-52)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: codex
  • GitHub Check: claude-review
🔇 Additional comments (1)
rust/sedona-spatial-join/src/evaluated_batch/evaluated_batch_stream/in_mem.rs (1)

33-55: Nice lightweight stream adapter.

Compact synchronous iterator bridge keeps the evaluated batch stream path simple and avoids unnecessary wakeups.

Comment on lines 551 to +560
let probe_thread_count = self.right.output_partitioning().partition_count();

Ok(build_index(
Arc::clone(&context),
build_side.schema(),
build_streams,
self.on.clone(),
sedona_options.spatial_join.clone(),
build_metrics,
Arc::clone(context.memory_pool()),
self.join_type,
probe_thread_count,
self.metrics.clone(),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Fix probe partition count after KNN swap

Line 552 computes probe_thread_count from self.right, but in KNN mode probe_plan can be the left side (when probe_side == JoinSide::Left). In that case the left plan may expose more partitions than the right plan, and downstream components index into probe-partition metrics using the actual probe partition ID. Passing the smaller right-side partition count therefore risks out-of-bounds panics or silently under-provisioned probe metrics when the optimizer decides to probe from the left. Please derive the count from probe_plan instead.

Apply this diff:

-                    let probe_thread_count = self.right.output_partitioning().partition_count();
+                    let probe_thread_count = probe_plan.output_partitioning().partition_count();
🤖 Prompt for AI Agents
In rust/sedona-spatial-join/src/exec.rs around lines 551 to 560,
probe_thread_count is computed from self.right which is incorrect in KNN mode
when probe_plan may be the left side; change the code to derive the partition
count from probe_plan (the actual probe side) instead of self.right so
probe_thread_count = probe_plan.output_partitioning().partition_count(), and
pass that value into build_index to avoid out-of-bounds or under-provisioned
probe metrics.

Comment on lines +117 to +127
.expression()
.evaluate(&intermediate_batch)?
.into_array(intermediate_batch.num_rows())?;
let mask = as_boolean_array(&filter_result)?;

let left_filtered = compute::filter(&build_indices, mask)?;
let right_filtered = compute::filter(&probe_indices, mask)?;
Ok((
downcast_array(left_filtered.as_ref()),
downcast_array(right_filtered.as_ref()),
))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Return owned filtered arrays

compute::filter gives you an ArrayRef, and downcast_array only hands back a borrowed view. Returning those borrows from the function doesn’t compile (and would dangle if it did). Clone the downcast arrays before returning so you hand back owned PrimitiveArrays.

     let left_filtered = compute::filter(&build_indices, mask)?;
     let right_filtered = compute::filter(&probe_indices, mask)?;
     Ok((
-        downcast_array(left_filtered.as_ref()),
-        downcast_array(right_filtered.as_ref()),
-    ))
+        downcast_array::<UInt64Array>(left_filtered.as_ref()).clone(),
+        downcast_array::<UInt32Array>(right_filtered.as_ref()).clone(),
+    ))
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
.expression()
.evaluate(&intermediate_batch)?
.into_array(intermediate_batch.num_rows())?;
let mask = as_boolean_array(&filter_result)?;
let left_filtered = compute::filter(&build_indices, mask)?;
let right_filtered = compute::filter(&probe_indices, mask)?;
Ok((
downcast_array(left_filtered.as_ref()),
downcast_array(right_filtered.as_ref()),
))
.expression()
.evaluate(&intermediate_batch)?
.into_array(intermediate_batch.num_rows())?;
let mask = as_boolean_array(&filter_result)?;
let left_filtered = compute::filter(&build_indices, mask)?;
let right_filtered = compute::filter(&probe_indices, mask)?;
Ok((
downcast_array::<UInt64Array>(left_filtered.as_ref()).clone(),
downcast_array::<UInt32Array>(right_filtered.as_ref()).clone(),
))
🤖 Prompt for AI Agents
In rust/sedona-spatial-join/src/utils/join_utils.rs around lines 117 to 127, the
code uses compute::filter to get ArrayRef then calls downcast_array which
returns a borrowed view; you must return owned PrimitiveArray instances instead
of borrows. Fix by cloning or converting the downcasted arrays to owned arrays
before returning (e.g., call .clone() or .to_owned() on the downcast result) so
the function returns owned PrimitiveArray values rather than borrowed
references.

@github-actions
Copy link
Copy Markdown

github-actions bot commented Nov 7, 2025

Findings

  • blocking – rust/sedona-spatial-join/src/index/spatial_index_builder.rs:300: each BuildPartition batch already grows the builder’s reservation inside add_batch, but after freeing the partition reservation you call self.reservation.try_grow(mem_bytes) again. That second grow double-counts the same memory, so a join that fit before can now hit the pool limit and fail with ResourcesExhausted. Please drop that extra grow (or actually transfer the reservation instead of duplicating it).
  • blocking – rust/sedona-functions/src/lib.rs:21 and rust/sedona-functions/src/st_dump.rs (deleted): this refactor removes the ST_Dump UDF altogether, and its tests/benchmarks in python/sedonadb/tests/functions/test_functions.py:634 vanish too. Unless the function was intentionally retired elsewhere, this is a backwards-incompatible regression that breaks users who call ST_Dump.

@martin-augment
Copy link
Copy Markdown
Owner Author

The changes in this PR are too complex for me and I cannot say whether the suggestions are good or not.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants