Skip to content

Conversation

@SemyonSinchenko
Copy link
Collaborator

Add basic tests that assert that functionality works at least.

Add basic tests that assert that functionality works at least.
@SemyonSinchenko SemyonSinchenko self-assigned this Jul 17, 2025
@SemyonSinchenko SemyonSinchenko requested a review from MrPowers July 17, 2025 05:01
@SemyonSinchenko
Copy link
Collaborator Author

@mrpowers-wb cc

@SemyonSinchenko SemyonSinchenko linked an issue Jul 17, 2025 that may be closed by this pull request
@SemyonSinchenko
Copy link
Collaborator Author

@zhuqi-lucas unfortunately I don't think anyone will review it fast enough (as usual here), so I can actually bypass review, merge and release it. Sorry, it was my fault (tbh, I didn't think that arrow is SO strict about schemas and cannot merge non-null values to the nullable column; like why not? non-null < nullable for me). Of course I had to add more tests after adding joins but I didn't. Now schemas are synced between lib.rs and python CLI wrapper and it should work. At least I hope so. If not, I will try to fix it ASAP, just ping me.

@zhuqi-lucas
Copy link

zhuqi-lucas commented Jul 17, 2025

@zhuqi-lucas unfortunately I don't think anyone will review it fast enough (as usual here), so I can actually bypass review, merge and release it. Sorry, it was my fault (tbh, I didn't think that arrow is SO strict about schemas and cannot merge non-null values to the nullable column; like why not? non-null < nullable for me). Of course I had to add more tests after adding joins but I didn't. Now schemas are synced between lib.rs and python CLI wrapper and it should work. At least I hope so. If not, I will try to fix it ASAP, just ping me.

Thank you @SemyonSinchenko for quick response and fix!

And i agree, arrow should support merge non-null values to the nullable column, it may also a bug from pyarrow side.

@SemyonSinchenko
Copy link
Collaborator Author

I will merge it and make a release, ping me if it won't work!

@SemyonSinchenko SemyonSinchenko merged commit e9fdd04 into mrpowers-io:main Jul 17, 2025
39 checks passed
@zhuqi-lucas
Copy link

I will merge it and make a release, ping me if it won't work!

Thank you @SemyonSinchenko , i will try it after release.

@zhuqi-lucas
Copy link

It works:

./bench.sh data h2o_small_join_parquet
***************************
DataFusion Benchmark Runner and Data Generator
COMMAND: data
BENCHMARK: h2o_small_join_parquet
DATA_DIR: /Users/zhuqi/arrow-datafusion/benchmarks/data
CARGO_COMMAND: cargo run --release
PREFER_HASH_JOIN: true
***************************
Found Python version 3.13, which is suitable.
Using Python command: /opt/homebrew/bin/python3
Installing falsa...
Generating h2o test data in /Users/zhuqi/arrow-datafusion/benchmarks/data/h2o with size=SMALL and format=PARQUET
10 rows will be saved into: /Users/zhuqi/arrow-datafusion/benchmarks/data/h2o/J1_1e7_1e1_0.parquet

10000 rows will be saved into: /Users/zhuqi/arrow-datafusion/benchmarks/data/h2o/J1_1e7_1e4_0.parquet

10000000 rows will be saved into: /Users/zhuqi/arrow-datafusion/benchmarks/data/h2o/J1_1e7_1e7_NA.parquet

An SMALL data schema is the following:
id1: int64 not null
id4: string not null
v2: double not null

An output format is PARQUET

Batch mode is supported.
In case of memory problems you can try to reduce a batch_size.


Working... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00

An MEDIUM data schema is the following:
id1: int64 not null
id2: int64 not null
id4: string not null
id5: string not null
v2: double not null

An output format is PARQUET

Batch mode is supported.
In case of memory problems you can try to reduce a batch_size.


Working... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00

An BIG data schema is the following:
id1: int64 not null
id2: int64 not null
id3: int64 not null
id4: string not null
id5: string not null
id6: string not null
v2: double not null

An output format is PARQUET

Batch mode is supported.
In case of memory problems you can try to reduce a batch_size.


Working... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:02

An LSH data schema is the following:
id1: int64 not null
id2: int64 not null
id3: int64 not null
id4: string not null
id5: string not null
id6: string not null
v1: double not null

An output format is PARQUET

Batch mode is supported.
In case of memory problems you can try to reduce a batch_size.


Working... ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100% 0:00:00

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Generate parquet dataset failed for join

2 participants