Skip to content

Optimize REPL read |> count() to use metadata for Parquet/ORC#34

Merged
aisrael merged 2 commits intomainfrom
fix/repl-count-use-metadata
Mar 15, 2026
Merged

Optimize REPL read |> count() to use metadata for Parquet/ORC#34
aisrael merged 2 commits intomainfrom
fix/repl-count-use-metadata

Conversation

@aisrael
Copy link
Owner

@aisrael aisrael commented Mar 15, 2026

When the REPL pipeline is exactly read(path) |> count(), it is now optimized to a single count(path) stage. For Parquet and ORC this uses file metadata only (no row data is read); for Avro and CSV it still streams batches.

Key changes

  • optimize_read_then_count() in plan_pipeline_with_state: replaces [Read { path }, Count { path: None }] with [Count { path: Some(path) }].
  • read(path) |> select(...) |> count() is unchanged (still three stages); only the two-stage case is optimized.
  • Tests: test_plan_pipeline_count_no_auto_print updated for optimized pipeline; test_plan_pipeline_read_select_count_not_optimized added to ensure select-in-between is not optimized.

Made with Cursor

@aisrael aisrael force-pushed the fix/repl-count-use-metadata branch from 71d3b22 to 2e1a053 Compare March 15, 2026 03:43
@aisrael aisrael merged commit 55de46c into main Mar 15, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant