feat: fix schema mismatch between native and python #28

SemyonSinchenko · 2025-07-17T05:01:35Z

Add basic tests that assert that functionality works at least.

SemyonSinchenko · 2025-07-17T05:02:42Z

SemyonSinchenko · 2025-07-17T05:09:14Z

@zhuqi-lucas unfortunately I don't think anyone will review it fast enough (as usual here), so I can actually bypass review, merge and release it. Sorry, it was my fault (tbh, I didn't think that arrow is SO strict about schemas and cannot merge non-null values to the nullable column; like why not? non-null < nullable for me). Of course I had to add more tests after adding joins but I didn't. Now schemas are synced between lib.rs and python CLI wrapper and it should work. At least I hope so. If not, I will try to fix it ASAP, just ping me.

zhuqi-lucas · 2025-07-17T05:19:19Z

@zhuqi-lucas unfortunately I don't think anyone will review it fast enough (as usual here), so I can actually bypass review, merge and release it. Sorry, it was my fault (tbh, I didn't think that arrow is SO strict about schemas and cannot merge non-null values to the nullable column; like why not? non-null < nullable for me). Of course I had to add more tests after adding joins but I didn't. Now schemas are synced between lib.rs and python CLI wrapper and it should work. At least I hope so. If not, I will try to fix it ASAP, just ping me.

Thank you @SemyonSinchenko for quick response and fix!

And i agree, arrow should support merge non-null values to the nullable column, it may also a bug from pyarrow side.

SemyonSinchenko · 2025-07-17T05:22:55Z

I will merge it and make a release, ping me if it won't work!

zhuqi-lucas · 2025-07-17T05:27:14Z

I will merge it and make a release, ping me if it won't work!

Thank you @SemyonSinchenko , i will try it after release.

zhuqi-lucas · 2025-07-17T05:32:02Z

It works:

./bench.sh data h2o_small_join_parquet
***************************
DataFusion Benchmark Runner and Data Generator
COMMAND: data
BENCHMARK: h2o_small_join_parquet
DATA_DIR: /Users/zhuqi/arrow-datafusion/benchmarks/data
CARGO_COMMAND: cargo run --release
PREFER_HASH_JOIN: true
***************************
Found Python version 3.13, which is suitable.
Using Python command: /opt/homebrew/bin/python3
Installing falsa...
Generating h2o test data in /Users/zhuqi/arrow-datafusion/benchmarks/data/h2o with size=SMALL and format=PARQUET
10 rows will be saved into: /Users/zhuqi/arrow-datafusion/benchmarks/data/h2o/J1_1e7_1e1_0.parquet

10000 rows will be saved into: /Users/zhuqi/arrow-datafusion/benchmarks/data/h2o/J1_1e7_1e4_0.parquet

10000000 rows will be saved into: /Users/zhuqi/arrow-datafusion/benchmarks/data/h2o/J1_1e7_1e7_NA.parquet

An SMALL data schema is the following:
id1: int64 not null
id4: string not null
v2: double not null

An output format is PARQUET

Batch mode is supported.
In case of memory problems you can try to reduce a batch_size.


Working... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00

An MEDIUM data schema is the following:
id1: int64 not null
id2: int64 not null
id4: string not null
id5: string not null
v2: double not null

An output format is PARQUET

Batch mode is supported.
In case of memory problems you can try to reduce a batch_size.


Working... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00

An BIG data schema is the following:
id1: int64 not null
id2: int64 not null
id3: int64 not null
id4: string not null
id5: string not null
id6: string not null
v2: double not null

An output format is PARQUET

Batch mode is supported.
In case of memory problems you can try to reduce a batch_size.


Working... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:02

An LSH data schema is the following:
id1: int64 not null
id2: int64 not null
id3: int64 not null
id4: string not null
id5: string not null
id6: string not null
v1: double not null

An output format is PARQUET

Batch mode is supported.
In case of memory problems you can try to reduce a batch_size.


Working... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00

Fix schema mismatch between native and python

9ce137e

Add basic tests that assert that functionality works at least.

SemyonSinchenko self-assigned this Jul 17, 2025

SemyonSinchenko requested a review from MrPowers July 17, 2025 05:01

SemyonSinchenko linked an issue Jul 17, 2025 that may be closed by this pull request

Generate parquet dataset failed for join #27

Closed

SemyonSinchenko merged commit e9fdd04 into mrpowers-io:main Jul 17, 2025
39 checks passed

zhuqi-lucas mentioned this pull request Jul 17, 2025

benchmark: Add parquet h2o support apache/datafusion#16804

Merged

SemyonSinchenko deleted the 27-join-parquet-bug branch July 17, 2025 08:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: fix schema mismatch between native and python #28

feat: fix schema mismatch between native and python #28

Uh oh!

SemyonSinchenko commented Jul 17, 2025

Uh oh!

SemyonSinchenko commented Jul 17, 2025

Uh oh!

SemyonSinchenko commented Jul 17, 2025

Uh oh!

zhuqi-lucas commented Jul 17, 2025 •

edited

Loading

Uh oh!

SemyonSinchenko commented Jul 17, 2025

Uh oh!

Uh oh!

zhuqi-lucas commented Jul 17, 2025

Uh oh!

zhuqi-lucas commented Jul 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat: fix schema mismatch between native and python #28

feat: fix schema mismatch between native and python #28

Uh oh!

Conversation

SemyonSinchenko commented Jul 17, 2025

Uh oh!

SemyonSinchenko commented Jul 17, 2025

Uh oh!

SemyonSinchenko commented Jul 17, 2025

Uh oh!

zhuqi-lucas commented Jul 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SemyonSinchenko commented Jul 17, 2025

Uh oh!

Uh oh!

zhuqi-lucas commented Jul 17, 2025

Uh oh!

zhuqi-lucas commented Jul 17, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

zhuqi-lucas commented Jul 17, 2025 •

edited

Loading