-
Notifications
You must be signed in to change notification settings - Fork 469
Description
Reproducing the behavior
Problem:
When attempting to run a parallelizable Bend program (e.g., parallel_sum.bend) using the CUDA interpreter (bend run-cu), the command consistently fails with the error: "Failed to launch kernels. Error code: invalid argument. Errors: HVM output had no result (An error likely occurred)".
Steps to Reproduce:
-
System Setup:
- Windows 11 Host with NVIDIA GeForce GTX 1660 Ti.
- WSL2 running Ubuntu 22.04.
- NVIDIA Driver Version on Windows: 552.44 (CUDA Version: 12.4).
- CUDA Toolkit 12.4.1 installed in WSL2 via NVIDIA's official local
.debmethod. hvmandbend-langinstalled viacargo install, and subsequently uninstalled/reinstalled/cargo cleaned multiple times to ensure linking against correct CUDA 12.4.
-
Prepare
parallel_sum.bend:- Create a file named
parallel_sum.bendwith the following content (summing 1 to 10 for basic correctness verification):def Sum(start, target): if start == target: return start else: half = (start + target) / 2 left = Sum(start, half) right = Sum(half + 1, target) return left + right def main(): return Sum(1, 10)
- Create a file named
-
Run the command in WSL2 terminal (from the directory containing
parallel_sum.bend):bend run-cu parallel_sum.bend -s
Expected Behavior:
The program should execute on the GPU, calculate the sum correctly (Result: 55), and display high MIPS and very low execution time, without any kernel launch errors.
Actual Behavior:
The command consistently outputs:
Failed to launch kernels. Error code: invalid argument.
Errors:
HVM output had no result (An error likely occurred)
System Settings
System Settings*
Your System's settings
* Operating System (Host): Windows 11
* WSL2 Distribution: Ubuntu 22.04 LTS
* GPU: NVIDIA GeForce GTX 1660 Ti
* NVIDIA Windows Driver Version (from `nvidia-smi` on Windows):
NVIDIA-SMI 552.44 Driver Version: 552.44 CUDA Version: 12.4
* CUDA Toolkit Version (from `nvcc --version` in WSL2):
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Thu_Mar_28_02:18:24_PDT_2024
Cuda compilation tools, release 12.4, V12.4.131
Build cuda_12.4.r12.4/compiler.34097967_0
* Bend Version (from `bend --version`): bend-lang 0.2.38
* HVM Version (from `hvm --version`): hvm 2.0.22
* GCC Version (from `gcc --version`):
gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Copyright (C) 2021 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
* `PATH` environment variable in WSL2:
`/usr/local/cuda-12.4/bin`
* `LD_LIBRARY_PATH` environment variable in WSL2:
`/usr/local/cuda-12.4/lib64`
### Additional context
* **Successful `deviceQuery`:** Crucially, NVIDIA's own `deviceQuery` sample (from CUDA Samples v12.4) compiles and runs successfully within the same WSL2 environment, returning `Result = PASS`. This indicates the fundamental CUDA installation and GPU access via WSL2 is functional and not the direct cause of the issue.
* Path to `deviceQuery` for reference: `/usr/local/cuda-12.4/samples/1_Utilities/deviceQuery`
* **Correct CPU Interpreter Results:**
* `bend run-rs parallel_sum.bend -s` returns `Result: 55` and executes (though slower).
* `bend run-c parallel_sum.bend -s` returns `Result: 55` and executes (faster than `run-rs`).
* This confirms Bendlang's core logic and CPU interpreters are working correctly for the small sum, ruling out a general Bendlang parsing or mathematical error for this specific program. The issue is isolated to the `run-cu` backend.
* **Troubleshooting Steps Taken:**
* Attempted multiple uninstalls and reinstalls of CUDA Toolkit 12.4.1 (deb local) following NVIDIA's official instructions.
* Performed aggressive `cargo clean`, `rm -rf ~/.cargo/registry`, `rm -rf ~/.cargo/git` before reinstalling `hvm` and `bend-lang` to ensure clean builds against CUDA 12.4.
* Confirmed CUDA `PATH` and `LD_LIBRARY_PATH` variables are correctly set and pointing to `cuda-12.4`.
* Created a symbolic link from `/usr/local/cuda` to `/usr/local/cuda-12.4` to match typical `Makefile` expectations.
* Ensured WSL2 GPU passthrough is active (`nvidia-smi` works inside WSL2).
This issue seems specific to `bend run-cu`'s interaction with the CUDA runtime in this WSL2 environment, despite a seemingly healthy underlying CUDA installation. Any guidance or potential debug flags would be greatly appreciated.