Skip to content

Commit 0a8b2e9

Browse files
Merge branch 'apache:main' into main
2 parents ddb9375 + aaaa6a6 commit 0a8b2e9

30 files changed

Lines changed: 1693 additions & 1884 deletions

File tree

.github/workflows/codeql.yml

Lines changed: 55 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,55 @@
1+
# Licensed to the Apache Software Foundation (ASF) under one
2+
# or more contributor license agreements. See the NOTICE file
3+
# distributed with this work for additional information
4+
# regarding copyright ownership. The ASF licenses this file
5+
# to you under the Apache License, Version 2.0 (the
6+
# "License"); you may not use this file except in compliance
7+
# with the License. You may obtain a copy of the License at
8+
#
9+
# http://www.apache.org/licenses/LICENSE-2.0
10+
#
11+
# Unless required by applicable law or agreed to in writing,
12+
# software distributed under the License is distributed on an
13+
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
14+
# KIND, either express or implied. See the License for the
15+
# specific language governing permissions and limitations
16+
# under the License.
17+
#
18+
19+
name: "CodeQL"
20+
21+
on:
22+
push:
23+
branches: [ "main" ]
24+
pull_request:
25+
branches: [ "main" ]
26+
schedule:
27+
- cron: '16 4 * * 1'
28+
29+
permissions:
30+
contents: read
31+
32+
jobs:
33+
analyze:
34+
name: Analyze Actions
35+
runs-on: ubuntu-latest
36+
permissions:
37+
contents: read
38+
security-events: write
39+
packages: read
40+
41+
steps:
42+
- name: Checkout repository
43+
uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6
44+
with:
45+
persist-credentials: false
46+
47+
- name: Initialize CodeQL
48+
uses: github/codeql-action/init@c793b717bc78562f491db7b0e93a3a178b099162 # v4
49+
with:
50+
languages: actions
51+
52+
- name: Perform CodeQL Analysis
53+
uses: github/codeql-action/analyze@c793b717bc78562f491db7b0e93a3a178b099162 # v4
54+
with:
55+
category: "/language:actions"

.github/workflows/pr_build_linux.yml

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -69,7 +69,7 @@ jobs:
6969
build-native:
7070
needs: lint
7171
name: Build Native Library
72-
runs-on: ${{ github.repository_owner == 'apache' && format('runs-on={0},family=m8a,cpu=8,image=ubuntu24-full-x64,extras=s3-cache,disk=large,tag=datafusion-comet', github.run_id) || 'ubuntu-latest' }}
72+
runs-on: ${{ github.repository_owner == 'apache' && format('runs-on={0},family=m8a+m7a+c8a,cpu=8,image=ubuntu24-full-x64,extras=s3-cache,disk=large,tag=datafusion-comet', github.run_id) || 'ubuntu-latest' }}
7373
container:
7474
image: amd64/rust
7575
steps:
@@ -122,7 +122,7 @@ jobs:
122122
linux-test-rust:
123123
needs: lint
124124
name: ubuntu-latest/rust-test
125-
runs-on: ${{ github.repository_owner == 'apache' && format('runs-on={0},family=m8a,cpu=16,image=ubuntu24-full-x64,extras=s3-cache,disk=large,tag=datafusion-comet', github.run_id) || 'ubuntu-latest' }}
125+
runs-on: ${{ github.repository_owner == 'apache' && format('runs-on={0},family=m8a+m7a+c8a,cpu=16,image=ubuntu24-full-x64,extras=s3-cache,disk=large,tag=datafusion-comet', github.run_id) || 'ubuntu-latest' }}
126126
container:
127127
image: amd64/rust
128128
steps:
@@ -277,7 +277,7 @@ jobs:
277277
org.apache.spark.sql.CometToPrettyStringSuite
278278
fail-fast: false
279279
name: ${{ matrix.profile.name }}/${{ matrix.profile.scan_impl }} [${{ matrix.suite.name }}]
280-
runs-on: ${{ github.repository_owner == 'apache' && format('runs-on={0},family=m8a,cpu=16,image=ubuntu24-full-x64,extras=s3-cache,disk=large,tag=datafusion-comet', github.run_id) || 'ubuntu-latest' }}
280+
runs-on: ${{ github.repository_owner == 'apache' && format('runs-on={0},family=m8a+m7a+c8a,cpu=16,image=ubuntu24-full-x64,extras=s3-cache,disk=large,tag=datafusion-comet', github.run_id) || 'ubuntu-latest' }}
281281
container:
282282
image: amd64/rust
283283
env:
@@ -325,7 +325,7 @@ jobs:
325325
verify-benchmark-results-tpch:
326326
needs: build-native
327327
name: Verify TPC-H Results
328-
runs-on: ${{ github.repository_owner == 'apache' && format('runs-on={0},family=m8a,cpu=16,image=ubuntu24-full-x64,extras=s3-cache,disk=large,tag=datafusion-comet', github.run_id) || 'ubuntu-latest' }}
328+
runs-on: ${{ github.repository_owner == 'apache' && format('runs-on={0},family=m8a+m7a+c8a,cpu=16,image=ubuntu24-full-x64,extras=s3-cache,disk=large,tag=datafusion-comet', github.run_id) || 'ubuntu-latest' }}
329329
container:
330330
image: amd64/rust
331331
steps:
@@ -379,7 +379,7 @@ jobs:
379379
verify-benchmark-results-tpcds:
380380
needs: build-native
381381
name: Verify TPC-DS Results (${{ matrix.join }})
382-
runs-on: ${{ github.repository_owner == 'apache' && format('runs-on={0},family=m8a,cpu=16,image=ubuntu24-full-x64,extras=s3-cache,disk=large,tag=datafusion-comet', github.run_id) || 'ubuntu-latest' }}
382+
runs-on: ${{ github.repository_owner == 'apache' && format('runs-on={0},family=m8a+m7a+c8a,cpu=16,image=ubuntu24-full-x64,extras=s3-cache,disk=large,tag=datafusion-comet', github.run_id) || 'ubuntu-latest' }}
383383
container:
384384
image: amd64/rust
385385
strategy:

benchmarks/tpc/README.md

Lines changed: 100 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -38,17 +38,21 @@ All benchmarks are run via `run.py`:
3838
python3 run.py --engine <engine> --benchmark <tpch|tpcds> [options]
3939
```
4040

41-
| Option | Description |
42-
| -------------- | -------------------------------------------------------- |
43-
| `--engine` | Engine name (matches a TOML file in `engines/`) |
44-
| `--benchmark` | `tpch` or `tpcds` |
45-
| `--iterations` | Number of iterations (default: 1) |
46-
| `--output` | Output directory (default: `.`) |
47-
| `--query` | Run a single query number |
48-
| `--no-restart` | Skip Spark master/worker restart |
49-
| `--dry-run` | Print the spark-submit command without executing |
50-
| `--jfr` | Enable Java Flight Recorder profiling |
51-
| `--jfr-dir` | Directory for JFR output files (default: `/results/jfr`) |
41+
| Option | Description |
42+
| ------------------------- | ------------------------------------------------------------------------------- |
43+
| `--engine` | Engine name (matches a TOML file in `engines/`) |
44+
| `--benchmark` | `tpch` or `tpcds` |
45+
| `--iterations` | Number of iterations (default: 1) |
46+
| `--output` | Output directory (default: `.`) |
47+
| `--query` | Run a single query number |
48+
| `--no-restart` | Skip Spark master/worker restart |
49+
| `--dry-run` | Print the spark-submit command without executing |
50+
| `--jfr` | Enable Java Flight Recorder profiling |
51+
| `--jfr-dir` | Directory for JFR output files (default: `/results/jfr`) |
52+
| `--async-profiler` | Enable async-profiler (profiles Java + native code) |
53+
| `--async-profiler-dir` | Directory for async-profiler output (default: `/results/async-profiler`) |
54+
| `--async-profiler-event` | Event type: `cpu`, `wall`, `alloc`, `lock`, etc. (default: `cpu`) |
55+
| `--async-profiler-format` | Output format: `flamegraph`, `jfr`, `collapsed`, `text` (default: `flamegraph`) |
5256

5357
Available engines: `spark`, `comet`, `comet-iceberg`, `gluten`
5458

@@ -392,3 +396,88 @@ docker compose -f benchmarks/tpc/infra/docker/docker-compose.yml \
392396

393397
Open the `.jfr` files with [JDK Mission Control](https://jdk.java.net/jmc/),
394398
IntelliJ IDEA's profiler, or `jfr` CLI tool (`jfr summary driver.jfr`).
399+
400+
## async-profiler Profiling
401+
402+
Use the `--async-profiler` flag to capture profiles with
403+
[async-profiler](https://github.com/async-profiler/async-profiler). Unlike JFR,
404+
async-profiler can profile **both Java and native (Rust/C++) code** in the same
405+
flame graph, making it especially useful for profiling Comet workloads.
406+
407+
### Prerequisites
408+
409+
async-profiler must be installed on every node where the driver or executors run.
410+
Set `ASYNC_PROFILER_HOME` to the installation directory:
411+
412+
```shell
413+
# Download and extract (Linux x64 example)
414+
wget https://github.com/async-profiler/async-profiler/releases/download/v3.0/async-profiler-3.0-linux-x64.tar.gz
415+
tar xzf async-profiler-3.0-linux-x64.tar.gz -C /opt/async-profiler --strip-components=1
416+
export ASYNC_PROFILER_HOME=/opt/async-profiler
417+
```
418+
419+
On Linux, `perf_event_paranoid` must be set to allow profiling:
420+
421+
```shell
422+
sudo sysctl kernel.perf_event_paranoid=1 # or 0 / -1 for full access
423+
sudo sysctl kernel.kptr_restrict=0 # optional: enable kernel symbols
424+
```
425+
426+
### Basic usage
427+
428+
```shell
429+
python3 run.py --engine comet --benchmark tpch --async-profiler
430+
```
431+
432+
This produces HTML flame graphs in `/results/async-profiler/` by default
433+
(`driver.html` and `executor.html`).
434+
435+
### Choosing events and output format
436+
437+
```shell
438+
# Wall-clock profiling (includes time spent waiting/sleeping)
439+
python3 run.py --engine comet --benchmark tpch \
440+
--async-profiler --async-profiler-event wall
441+
442+
# Allocation profiling with JFR output
443+
python3 run.py --engine comet --benchmark tpch \
444+
--async-profiler --async-profiler-event alloc --async-profiler-format jfr
445+
446+
# Lock contention profiling
447+
python3 run.py --engine comet --benchmark tpch \
448+
--async-profiler --async-profiler-event lock
449+
```
450+
451+
| Event | Description |
452+
| ------- | --------------------------------------------------- |
453+
| `cpu` | On-CPU time (default). Shows where CPU cycles go. |
454+
| `wall` | Wall-clock time. Includes threads that are blocked. |
455+
| `alloc` | Heap allocation profiling. |
456+
| `lock` | Lock contention profiling. |
457+
458+
| Format | Extension | Description |
459+
| ------------ | --------- | ---------------------------------------- |
460+
| `flamegraph` | `.html` | Interactive HTML flame graph (default). |
461+
| `jfr` | `.jfr` | JFR format, viewable in JMC or IntelliJ. |
462+
| `collapsed` | `.txt` | Collapsed stacks for FlameGraph scripts. |
463+
| `text` | `.txt` | Flat text summary of hot methods. |
464+
465+
### Docker usage
466+
467+
The Docker image includes async-profiler pre-installed at
468+
`/opt/async-profiler`. The `ASYNC_PROFILER_HOME` environment variable is
469+
already set in the compose files, so no extra configuration is needed:
470+
471+
```shell
472+
docker compose -f benchmarks/tpc/infra/docker/docker-compose.yml \
473+
run --rm bench \
474+
python3 /opt/benchmarks/run.py \
475+
--engine comet --benchmark tpch --output /results --no-restart --async-profiler
476+
```
477+
478+
Output files are collected in `$RESULTS_DIR/async-profiler/` on the host.
479+
480+
**Note:** On Linux, the Docker container needs `--privileged` or
481+
`SYS_PTRACE` capability and `perf_event_paranoid <= 1` on the host for
482+
`cpu`/`wall` events. Allocation (`alloc`) and lock (`lock`) events work
483+
without special privileges.

benchmarks/tpc/infra/docker/Dockerfile

Lines changed: 14 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -29,10 +29,23 @@ RUN apt-get update \
2929
&& apt-get install -y --no-install-recommends \
3030
openjdk-8-jdk-headless \
3131
openjdk-17-jdk-headless \
32-
python3 python3-pip procps \
32+
python3 python3-pip procps wget \
3333
&& apt-get clean \
3434
&& rm -rf /var/lib/apt/lists/*
3535

36+
# Install async-profiler for profiling Java + native (Rust/C++) code.
37+
ARG ASYNC_PROFILER_VERSION=3.0
38+
RUN ARCH=$(uname -m) && \
39+
if [ "$ARCH" = "x86_64" ]; then AP_ARCH="linux-x64"; \
40+
elif [ "$ARCH" = "aarch64" ]; then AP_ARCH="linux-aarch64"; \
41+
else echo "Unsupported architecture: $ARCH" && exit 1; fi && \
42+
wget -q "https://github.com/async-profiler/async-profiler/releases/download/v${ASYNC_PROFILER_VERSION}/async-profiler-${ASYNC_PROFILER_VERSION}-${AP_ARCH}.tar.gz" \
43+
-O /tmp/async-profiler.tar.gz && \
44+
mkdir -p /opt/async-profiler && \
45+
tar xzf /tmp/async-profiler.tar.gz -C /opt/async-profiler --strip-components=1 && \
46+
rm /tmp/async-profiler.tar.gz
47+
ENV ASYNC_PROFILER_HOME=/opt/async-profiler
48+
3649
# Default to Java 17 (override with JAVA_HOME at runtime for Gluten).
3750
# Detect architecture (amd64 or arm64) so the image works on both Linux and macOS.
3851
ARG TARGETARCH

benchmarks/tpc/infra/docker/docker-compose-laptop.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -30,6 +30,7 @@
3030
# ICEBERG_JAR - Host path to Iceberg Spark runtime JAR
3131
# BENCH_JAVA_HOME - Java home inside container (default: /usr/lib/jvm/java-17-openjdk)
3232
# Set to /usr/lib/jvm/java-8-openjdk for Gluten
33+
# ASYNC_PROFILER_HOME - async-profiler install path (default: /opt/async-profiler)
3334

3435
x-volumes: &volumes
3536
- ${DATA_DIR:-/tmp/tpc-data}:/data:ro
@@ -95,5 +96,6 @@ services:
9596
- TPCH_DATA=/data
9697
- TPCDS_DATA=/data
9798
- SPARK_EVENT_LOG_DIR=/results/spark-events
99+
- ASYNC_PROFILER_HOME=/opt/async-profiler
98100
mem_limit: 4g
99101
memswap_limit: 4g

benchmarks/tpc/infra/docker/docker-compose.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -33,6 +33,7 @@
3333
# BENCH_MEM_LIMIT - Hard memory limit for the bench runner (default: 10g)
3434
# BENCH_JAVA_HOME - Java home inside container (default: /usr/lib/jvm/java-17-openjdk)
3535
# Set to /usr/lib/jvm/java-8-openjdk for Gluten
36+
# ASYNC_PROFILER_HOME - async-profiler install path (default: /opt/async-profiler)
3637

3738
x-volumes: &volumes
3839
- ${DATA_DIR:-/tmp/tpc-data}:/data:ro
@@ -109,6 +110,7 @@ services:
109110
- TPCH_DATA=/data
110111
- TPCDS_DATA=/data
111112
- SPARK_EVENT_LOG_DIR=/results/spark-events
113+
- ASYNC_PROFILER_HOME=/opt/async-profiler
112114
mem_limit: ${BENCH_MEM_LIMIT:-10g}
113115
memswap_limit: ${BENCH_MEM_LIMIT:-10g}
114116

benchmarks/tpc/run.py

Lines changed: 59 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -279,6 +279,38 @@ def build_spark_submit_cmd(config, benchmark, args):
279279
existing = conf.get(spark_key, "")
280280
conf[spark_key] = f"{existing} {jfr_opts}".strip()
281281

282+
# async-profiler: attach as a Java agent via -agentpath
283+
if args.async_profiler:
284+
ap_home = os.environ.get("ASYNC_PROFILER_HOME", "")
285+
if not ap_home:
286+
print(
287+
"Error: ASYNC_PROFILER_HOME is not set. "
288+
"Set it to the async-profiler installation directory.",
289+
file=sys.stderr,
290+
)
291+
sys.exit(1)
292+
lib_ext = "dylib" if sys.platform == "darwin" else "so"
293+
ap_lib = os.path.join(ap_home, "lib", f"libasyncProfiler.{lib_ext}")
294+
ap_dir = args.async_profiler_dir
295+
ap_event = args.async_profiler_event
296+
ap_fmt = args.async_profiler_format
297+
ext = {"flamegraph": "html", "jfr": "jfr", "collapsed": "txt", "text": "txt"}[ap_fmt]
298+
299+
driver_ap = (
300+
f"-agentpath:{ap_lib}=start,event={ap_event},"
301+
f"{ap_fmt},file={ap_dir}/driver.{ext}"
302+
)
303+
executor_ap = (
304+
f"-agentpath:{ap_lib}=start,event={ap_event},"
305+
f"{ap_fmt},file={ap_dir}/executor.{ext}"
306+
)
307+
for spark_key, ap_opts in [
308+
("spark.driver.extraJavaOptions", driver_ap),
309+
("spark.executor.extraJavaOptions", executor_ap),
310+
]:
311+
existing = conf.get(spark_key, "")
312+
conf[spark_key] = f"{existing} {ap_opts}".strip()
313+
282314
for key, val in sorted(conf.items()):
283315
cmd += ["--conf", f"{key}={val}"]
284316

@@ -385,6 +417,27 @@ def main():
385417
default="/results/jfr",
386418
help="Directory for JFR output files (default: /results/jfr)",
387419
)
420+
parser.add_argument(
421+
"--async-profiler",
422+
action="store_true",
423+
help="Enable async-profiler for driver and executors (profiles Java + native code)",
424+
)
425+
parser.add_argument(
426+
"--async-profiler-dir",
427+
default="/results/async-profiler",
428+
help="Directory for async-profiler output files (default: /results/async-profiler)",
429+
)
430+
parser.add_argument(
431+
"--async-profiler-event",
432+
default="cpu",
433+
help="async-profiler event type: cpu, wall, alloc, lock, etc. (default: cpu)",
434+
)
435+
parser.add_argument(
436+
"--async-profiler-format",
437+
default="flamegraph",
438+
choices=["flamegraph", "jfr", "collapsed", "text"],
439+
help="async-profiler output format (default: flamegraph)",
440+
)
388441
args = parser.parse_args()
389442

390443
config = load_engine_config(args.engine)
@@ -401,9 +454,12 @@ def main():
401454
if not args.no_restart and not args.dry_run:
402455
restart_spark()
403456

404-
# Create JFR output directory if profiling is enabled
405-
if args.jfr:
406-
os.makedirs(args.jfr_dir, exist_ok=True)
457+
# Create profiling output directories (skip for dry-run)
458+
if not args.dry_run:
459+
if args.jfr:
460+
os.makedirs(args.jfr_dir, exist_ok=True)
461+
if args.async_profiler:
462+
os.makedirs(args.async_profiler_dir, exist_ok=True)
407463

408464
cmd = build_spark_submit_cmd(config, args.benchmark, args)
409465

docs/source/contributor-guide/parquet_scans.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -49,10 +49,10 @@ The following features are not supported by either scan implementation, and Come
4949

5050
The following shared limitation may produce incorrect results without falling back to Spark:
5151

52-
- No support for datetime rebasing detection or the `spark.comet.exceptionOnDatetimeRebase` configuration. When
53-
reading Parquet files containing dates or timestamps written before Spark 3.0 (which used a hybrid
54-
Julian/Gregorian calendar), dates/timestamps will be read as if they were written using the Proleptic Gregorian
55-
calendar. This may produce incorrect results for dates before October 15, 1582.
52+
- No support for datetime rebasing. When reading Parquet files containing dates or timestamps written before
53+
Spark 3.0 (which used a hybrid Julian/Gregorian calendar), dates/timestamps will be read as if they were
54+
written using the Proleptic Gregorian calendar. This may produce incorrect results for dates before
55+
October 15, 1582.
5656

5757
The `native_datafusion` scan has some additional limitations, mostly related to Parquet metadata. All of these
5858
cause Comet to fall back to Spark.

docs/source/index.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -86,7 +86,7 @@ To get started with Apache DataFusion Comet, follow the
8686
[DataFusion Slack and Discord channels](https://datafusion.apache.org/contributor-guide/communication.html) to connect
8787
with other users, ask questions, and share your experiences with Comet.
8888

89-
Follow [Apache DataFusion Comet Overview](https://datafusion.apache.org/comet/user-guide/overview.html) to get more detailed information
89+
Follow [Apache DataFusion Comet Overview](https://datafusion.apache.org/comet/about/index.html) to get more detailed information
9090

9191
## Contributing
9292

0 commit comments

Comments
 (0)