Skip to content

Commit 4cf2221

Browse files
Merge branch 'apache:main' into copy_to_preserve_ordering
2 parents 9432ce8 + c4b9995 commit 4cf2221

55 files changed

Lines changed: 2534 additions & 524 deletions

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

.github/workflows/rust.yml

Lines changed: 11 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -218,6 +218,8 @@ jobs:
218218
run: cargo check --profile ci --no-default-features -p datafusion --features=string_expressions
219219
- name: Check datafusion (unicode_expressions)
220220
run: cargo check --profile ci --no-default-features -p datafusion --features=unicode_expressions
221+
- name: Check parquet encryption (parquet_encryption)
222+
run: cargo check --profile ci --no-default-features -p datafusion --features=parquet_encryption
221223

222224
# Check datafusion-functions crate features
223225
#
@@ -312,18 +314,6 @@ jobs:
312314
fetch-depth: 1
313315
- name: Setup Rust toolchain
314316
run: rustup toolchain install stable
315-
- name: Setup Minio - S3-compatible storage
316-
run: |
317-
docker run -d --name minio-container \
318-
-p 9000:9000 \
319-
-e MINIO_ROOT_USER=TEST-DataFusionLogin -e MINIO_ROOT_PASSWORD=TEST-DataFusionPassword \
320-
-v $(pwd)/datafusion/core/tests/data:/source quay.io/minio/minio \
321-
server /data
322-
docker exec minio-container /bin/sh -c "\
323-
mc ready local
324-
mc alias set localminio http://localhost:9000 TEST-DataFusionLogin TEST-DataFusionPassword && \
325-
mc mb localminio/data && \
326-
mc cp -r /source/* localminio/data"
327317
- name: Run tests (excluding doctests)
328318
env:
329319
RUST_BACKTRACE: 1
@@ -335,9 +325,6 @@ jobs:
335325
run: cargo test --profile ci -p datafusion-cli --lib --tests --bins
336326
- name: Verify Working Directory Clean
337327
run: git diff --exit-code
338-
- name: Minio Output
339-
if: ${{ !cancelled() }}
340-
run: docker logs minio-container
341328

342329

343330
linux-test-example:
@@ -769,10 +756,15 @@ jobs:
769756
# `rust-version` key of `Cargo.toml`.
770757
#
771758
# To reproduce:
772-
# 1. Install the version of Rust that is failing. Example:
773-
# rustup install 1.80.1
774-
# 2. Run the command that failed with that version. Example:
775-
# cargo +1.80.1 check -p datafusion
759+
# 1. Install the version of Rust that is failing.
760+
# 2. Run the command that failed with that version.
761+
#
762+
# Example:
763+
# # MSRV looks like "1.80.0" and is specified in Cargo.toml. We can read the value with the following command:
764+
# msrv="$(cargo metadata --format-version=1 | jq '.packages[] | select( .name == "datafusion" ) | .rust_version' -r)"
765+
# echo "MSRV: ${msrv}"
766+
# rustup install "${msrv}"
767+
# cargo "+${msrv}" check
776768
#
777769
# To resolve, either:
778770
# 1. Change your code to use older Rust features,

Cargo.lock

Lines changed: 19 additions & 36 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

Cargo.toml

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -150,6 +150,7 @@ env_logger = "0.11"
150150
futures = "0.3"
151151
half = { version = "2.6.0", default-features = false }
152152
hashbrown = { version = "0.14.5", features = ["raw"] }
153+
hex = { version = "0.4.3" }
153154
indexmap = "2.10.0"
154155
itertools = "0.14"
155156
log = "^0.4"
@@ -173,6 +174,8 @@ rstest = "0.25.0"
173174
serde_json = "1"
174175
sqlparser = { version = "0.55.0", default-features = false, features = ["std", "visitor"] }
175176
tempfile = "3"
177+
testcontainers = { version = "0.24", features = ["default"] }
178+
testcontainers-modules = { version = "0.12" }
176179
tokio = { version = "1.46", features = ["macros", "rt", "sync"] }
177180
url = "2.5.4"
178181

README.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -120,6 +120,7 @@ Default features:
120120
- `datetime_expressions`: date and time functions such as `to_timestamp`
121121
- `encoding_expressions`: `encode` and `decode` functions
122122
- `parquet`: support for reading the [Apache Parquet] format
123+
- `parquet_encryption`: support for using [Parquet Modular Encryption]
123124
- `regex_expressions`: regular expression functions, such as `regexp_match`
124125
- `unicode_expressions`: Include unicode aware functions such as `character_length`
125126
- `unparser`: enables support to reverse LogicalPlans back into SQL
@@ -134,6 +135,7 @@ Optional features:
134135

135136
[apache avro]: https://avro.apache.org/
136137
[apache parquet]: https://parquet.apache.org/
138+
[parquet modular encryption]: https://parquet.apache.org/docs/file-format/data-pages/encryption/
137139

138140
## DataFusion API Evolution and Deprecation Guidelines
139141

benchmarks/bench.sh

Lines changed: 20 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -95,7 +95,8 @@ external_aggr: External aggregation benchmark on TPC-H dataset (SF=1)
9595
9696
# ClickBench Benchmarks
9797
clickbench_1: ClickBench queries against a single parquet file
98-
clickbench_partitioned: ClickBench queries against a partitioned (100 files) parquet
98+
clickbench_partitioned: ClickBench queries against partitioned (100 files) parquet
99+
clickbench_pushdown: ClickBench queries against partitioned (100 files) parquet w/ filter_pushdown enabled
99100
clickbench_extended: ClickBench \"inspired\" queries against a single parquet (DataFusion specific)
100101
101102
# H2O.ai Benchmarks (Group By, Join, Window)
@@ -207,6 +208,9 @@ main() {
207208
clickbench_partitioned)
208209
data_clickbench_partitioned
209210
;;
211+
clickbench_pushdown)
212+
data_clickbench_partitioned # same data as clickbench_partitioned
213+
;;
210214
clickbench_extended)
211215
data_clickbench_1
212216
;;
@@ -303,6 +307,7 @@ main() {
303307
run_cancellation
304308
run_clickbench_1
305309
run_clickbench_partitioned
310+
run_clickbench_pushdown
306311
run_clickbench_extended
307312
run_h2o "SMALL" "PARQUET" "groupby"
308313
run_h2o "MEDIUM" "PARQUET" "groupby"
@@ -340,6 +345,9 @@ main() {
340345
clickbench_partitioned)
341346
run_clickbench_partitioned
342347
;;
348+
clickbench_pushdown)
349+
run_clickbench_pushdown
350+
;;
343351
clickbench_extended)
344352
run_clickbench_extended
345353
;;
@@ -572,14 +580,24 @@ run_clickbench_1() {
572580
debug_run $CARGO_COMMAND --bin dfbench -- clickbench --iterations 5 --path "${DATA_DIR}/hits.parquet" --queries-path "${SCRIPT_DIR}/queries/clickbench/queries" -o "${RESULTS_FILE}" ${QUERY_ARG}
573581
}
574582

575-
# Runs the clickbench benchmark with the partitioned parquet files
583+
# Runs the clickbench benchmark with the partitioned parquet dataset (100 files)
576584
run_clickbench_partitioned() {
577585
RESULTS_FILE="${RESULTS_DIR}/clickbench_partitioned.json"
578586
echo "RESULTS_FILE: ${RESULTS_FILE}"
579587
echo "Running clickbench (partitioned, 100 files) benchmark..."
580588
debug_run $CARGO_COMMAND --bin dfbench -- clickbench --iterations 5 --path "${DATA_DIR}/hits_partitioned" --queries-path "${SCRIPT_DIR}/queries/clickbench/queries" -o "${RESULTS_FILE}" ${QUERY_ARG}
581589
}
582590

591+
592+
# Runs the clickbench benchmark with the partitioned parquet files and filter_pushdown enabled
593+
run_clickbench_pushdown() {
594+
RESULTS_FILE="${RESULTS_DIR}/clickbench_pushdown.json"
595+
echo "RESULTS_FILE: ${RESULTS_FILE}"
596+
echo "Running clickbench (partitioned, 100 files) benchmark with pushdown_filters=true, reorder_filters=true..."
597+
debug_run $CARGO_COMMAND --bin dfbench -- clickbench --pushdown --iterations 5 --path "${DATA_DIR}/hits_partitioned" --queries-path "${SCRIPT_DIR}/queries/clickbench/queries" -o "${RESULTS_FILE}" ${QUERY_ARG}
598+
}
599+
600+
583601
# Runs the clickbench "extended" benchmark with a single large parquet file
584602
run_clickbench_extended() {
585603
RESULTS_FILE="${RESULTS_DIR}/clickbench_extended.json"

benchmarks/src/clickbench.rs

Lines changed: 15 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -29,7 +29,7 @@ use datafusion_common::exec_datafusion_err;
2929
use datafusion_common::instant::Instant;
3030
use structopt::StructOpt;
3131

32-
/// Run the clickbench benchmark
32+
/// Driver program to run the ClickBench benchmark
3333
///
3434
/// The ClickBench[1] benchmarks are widely cited in the industry and
3535
/// focus on grouping / aggregation / filtering. This runner uses the
@@ -44,6 +44,14 @@ pub struct RunOpt {
4444
#[structopt(short, long)]
4545
query: Option<usize>,
4646

47+
/// If specified, enables Parquet Filter Pushdown.
48+
///
49+
/// Specifically, it enables:
50+
/// * `pushdown_filters = true`
51+
/// * `reorder_filters = true`
52+
#[structopt(long = "pushdown")]
53+
pushdown: bool,
54+
4755
/// Common options
4856
#[structopt(flatten)]
4957
common: CommonOpt,
@@ -122,6 +130,12 @@ impl RunOpt {
122130
// The hits_partitioned dataset specifies string columns
123131
// as binary due to how it was written. Force it to strings
124132
parquet_options.binary_as_string = true;
133+
134+
// Turn on Parquet filter pushdown if requested
135+
if self.pushdown {
136+
parquet_options.pushdown_filters = true;
137+
parquet_options.reorder_filters = true;
138+
}
125139
}
126140

127141
let rt_builder = self.common.runtime_env_builder()?;

datafusion-cli/CONTRIBUTING.md

Lines changed: 14 additions & 35 deletions
Original file line numberDiff line numberDiff line change
@@ -29,47 +29,26 @@ cargo test
2929

3030
## Running Storage Integration Tests
3131

32-
By default, storage integration tests are not run. To run them you will need to set `TEST_STORAGE_INTEGRATION=1` and
33-
then provide the necessary configuration for that object store.
32+
By default, storage integration tests are not run. These test use the `testcontainers` crate to start up a local MinIO server using docker on port 9000.
3433

35-
For some of the tests, [snapshots](https://datafusion.apache.org/contributor-guide/testing.html#snapshot-testing) are used.
36-
37-
### AWS
38-
39-
To test the S3 integration against [Minio](https://github.com/minio/minio)
40-
41-
First start up a container with Minio and load test files.
34+
To run them you will need to set `TEST_STORAGE_INTEGRATION`:
4235

4336
```shell
44-
docker run -d \
45-
--name datafusion-test-minio \
46-
-p 9000:9000 \
47-
-e MINIO_ROOT_USER=TEST-DataFusionLogin \
48-
-e MINIO_ROOT_PASSWORD=TEST-DataFusionPassword \
49-
-v $(pwd)/../datafusion/core/tests/data:/source \
50-
quay.io/minio/minio server /data
51-
52-
docker exec datafusion-test-minio /bin/sh -c "\
53-
mc ready local
54-
mc alias set localminio http://localhost:9000 TEST-DataFusionLogin TEST-DataFusionPassword && \
55-
mc mb localminio/data && \
56-
mc cp -r /source/* localminio/data"
37+
TEST_STORAGE_INTEGRATION=1 cargo test
5738
```
5839

59-
Setup environment
40+
For some of the tests, [snapshots](https://datafusion.apache.org/contributor-guide/testing.html#snapshot-testing) are used.
6041

61-
```shell
62-
export TEST_STORAGE_INTEGRATION=1
63-
export AWS_ACCESS_KEY_ID=TEST-DataFusionLogin
64-
export AWS_SECRET_ACCESS_KEY=TEST-DataFusionPassword
65-
export AWS_ENDPOINT=http://127.0.0.1:9000
66-
export AWS_ALLOW_HTTP=true
67-
```
42+
### AWS
6843

69-
Note that `AWS_ENDPOINT` is set without slash at the end.
44+
S3 integration is tested against [Minio](https://github.com/minio/minio) with [TestContainers](https://github.com/testcontainers/testcontainers-rs)
45+
This requires Docker to be running on your machine and port 9000 to be free.
7046

71-
Run tests
47+
If you see an error mentioning "failed to load IMDS session token" such as
7248

73-
```shell
74-
cargo test
75-
```
49+
> ---- object_storage::tests::s3_object_store_builder_resolves_region_when_none_provided stdout ----
50+
> Error: ObjectStore(Generic { store: "S3", source: "Error getting credentials from provider: an error occurred while loading credentials: failed to load IMDS session token" })
51+
52+
You my need to disable trying to fetch S3 credentials from the environment using the `AWS_EC2_METADATA_DISABLED`, for example:
53+
54+
> $ AWS_EC2_METADATA_DISABLED=true TEST_STORAGE_INTEGRATION=1 cargo test

datafusion-cli/Cargo.toml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -72,3 +72,5 @@ insta = { workspace = true }
7272
insta-cmd = "0.6.0"
7373
predicates = "3.0"
7474
rstest = { workspace = true }
75+
testcontainers = { workspace = true }
76+
testcontainers-modules = { workspace = true, features = ["minio"] }

0 commit comments

Comments
 (0)