build(cudf): Simplify cuDF build configuration#11407
build(cudf): Simplify cuDF build configuration#11407bdice wants to merge 5 commits intoapache:mainfrom
Conversation
| ls -l /usr/local/ | ||
| source /opt/rh/gcc-toolset-12/enable | ||
|
|
||
| source /opt/rh/gcc-toolset-14/enable |
There was a problem hiding this comment.
We do need GCC 14, but we could remove the extra steps above from #11275 that change the CUDA version if you wish. This PR should make it work with the CUDA 12 version that already exists in the container. I know there were quite a few workarounds to reduce the disk space to make room for CUDA 13.1 -- we could revert that too.
There was a problem hiding this comment.
If you'd like me to help revert those changes and minimize the build scripts, I can do that. Let me know your thoughts.
There was a problem hiding this comment.
This line enables the GCC14 though I don't know why we cannot enable opt/rh/gcc-toolset-14/enable directly. Do you try if the docker file can work?https://github.com/apache/incubator-gluten/blob/main/dev/docker/cudf/Dockerfile, I meet curl version issue before.
There was a problem hiding this comment.
It's ok for me to use cuda 13.1, I have resolved all the version mismatch issues, and meet a new issue with the newest Velox, I will try to fix it
26/01/13 10:19:08 ERROR Executor: Exception in task 7.0 in stage 40.0 (TID 12116)
org.apache.gluten.exception.GlutenException: Exception: VeloxRuntimeError
Error Source: RUNTIME
Error Code: INVALID_STATE
Reason: (1 vs. 0) Leaf child memory pool cudf-expr-precompile already exists in __sys_root__
Retriable: False
Expression: children_.count(name) == 0
Function: addLeafChild
File: /opt/gluten/ep/build-velox/build/velox_ep/velox/common/memory/MemoryPool.cpp
Line: 331
Stack trace:
|
Some of them is duplicated with #11386, it's ok for me to merge any of them |
| @@ -31,7 +31,7 @@ WORKDIR /opt/gluten | |||
| RUN rm -rf /opt/rh/gcc-toolset-12 && ln -s /opt/rh/gcc-toolset-14 /opt/rh/gcc-toolset-12; \ | |||
There was a problem hiding this comment.
Do you try to create docker image from this Dockerfile? I meet curl version issue before, please help verify if this PR can resolve it.
There was a problem hiding this comment.
It cannot run successfully
692.8 -- [CURL] Enabled SSL backends: OpenSSL
692.8 -- Setting DuckDB source to AUTO
692.8 -- [DuckDB] Using SYSTEM DuckDB
692.8 -- Using ccache: /usr/bin/ccache
692.8 -- The CUDA compiler identification is unknown
692.8 -- Configuring incomplete, errors occurred!
692.8 make[1]: Leaving directory '/opt/gluten/ep/build-velox/build/velox_ep'
Dockerfile:31
--------------------
30 | WORKDIR /opt/gluten
31 | >>> RUN rm -rf /opt/rh/gcc-toolset-12 && ln -s /opt/rh/gcc-toolset-14 /opt/rh/gcc-toolset-12; \
32 | >>> dnf remove -y cuda-toolkit-12* && dnf install -y cuda-toolkit-13-1; \
33 | >>> dnf autoremove -y && dnf clean all; \
34 | >>> source /opt/rh/gcc-toolset-14/enable; \
35 | >>> bash ./dev/buildbundle-veloxbe.sh --run_setup_script=OFF --build_arrow=ON --spark_version=3.4 --build_tests=ON --build_benchmarks=ON --enable_gpu=ON && rm -rf /opt/gluten
36 |
--------------------
ERROR: failed to solve: process "/bin/sh -c rm -rf /opt/rh/gcc-toolset-12 && ln -s /opt/rh/gcc-toolset-14 /opt/rh/gcc-toolset-12; dnf remove -y cuda-toolkit-12* && dnf install -y cuda-toolkit-13-1; dnf autoremove -y && dnf clean all; source /opt/rh/gcc-toolset-14/enable; bash ./dev/buildbundle-veloxbe.sh --run_setup_script=OFF --build_arrow=ON --spark_version=3.4 --build_tests=ON --build_benchmarks=ON --enable_gpu=ON && rm -rf /opt/gluten" did not complete successfully: exit code: 2
|
This pipleline failed though the final result returns success flag, https://github.com/apache/incubator-gluten/actions/runs/20962505327/job/60252099418?pr=11407 |
| dnf remove -y cuda-toolkit-12* && dnf install -y cuda-toolkit-13-1; \ | ||
| dnf autoremove -y && dnf clean all; \ | ||
| source /opt/rh/gcc-toolset-12/enable; \ | ||
| source /opt/rh/gcc-toolset-14/enable; \ |
There was a problem hiding this comment.
Is it because we should not source gcc14?
CMake Error at CMakeLists.txt:476 (enable_language):
The CMAKE_CUDA_COMPILER:
/usr/local/cuda-12.8/bin/nvcc
is not a full path to an existing compiler tool.
Tell CMake where to find the compiler by setting either the environment
variable "CUDACXX" or the CMake cache entry CMAKE_CUDA_COMPILER to the full
path to the compiler, or to the compiler name if it is in the PATH.
|
Please help update |
|
Do you know why the CI succeeds but build failed? #11407 (comment) @PHILO-HE |
|
Thanks for your fix, this PR #11386 is ahead of yours and has been merged, the CI passed, if you think use cudf_DIR is more reasonable, please resolve the CI and build with the Dockerfile successfully. |
@jinchengchenghh, sorry for missing this comment. Because |
|
Thanks for your explanation, I will try the standard GitHub Actions container field with apache/gluten:centos-9-jdk8-cudf, @PHILO-HE |
|
Thanks @jinchengchenghh for #11386. I'll close this, I think you got most of the important parts there. |
What changes are proposed in this pull request?
This is a follow-up to my comments on #11275.
This changeset should make it simpler to build with cuDF support.
find_package(cudf)How was this patch tested?
I built this locally in a container.