`bend run-cu` "Failed to launch kernels: invalid argument" on WSL2 (Ubuntu 22.04, GTX 1660 Ti, CUDA 12.4)

### Reproducing the behavior

**Problem:**
When attempting to run a parallelizable Bend program (e.g., `parallel_sum.bend`) using the CUDA interpreter (`bend run-cu`), the command consistently fails with the error: "Failed to launch kernels. Error code: invalid argument. Errors: HVM output had no result (An error likely occurred)".

**Steps to Reproduce:**

1.  **System Setup:**
    * Windows 11 Host with NVIDIA GeForce GTX 1660 Ti.
    * WSL2 running Ubuntu 22.04.
    * NVIDIA Driver Version on Windows: 552.44 (CUDA Version: 12.4).
    * CUDA Toolkit 12.4.1 installed in WSL2 via NVIDIA's official local `.deb` method.
    * `hvm` and `bend-lang` installed via `cargo install`, and subsequently uninstalled/reinstalled/cargo cleaned multiple times to ensure linking against correct CUDA 12.4.

2.  **Prepare `parallel_sum.bend`:**
    * Create a file named `parallel_sum.bend` with the following content (summing 1 to 10 for basic correctness verification):
        ```bendlang
        def Sum(start, target):
          if start == target:
            return start
          else:
            half = (start + target) / 2
            left = Sum(start, half)
            right = Sum(half + 1, target)
            return left + right

        def main():
          return Sum(1, 10)
        ```

3.  **Run the command in WSL2 terminal (from the directory containing `parallel_sum.bend`):**
    ```bash
    bend run-cu parallel_sum.bend -s
    ```

**Expected Behavior:**
The program should execute on the GPU, calculate the sum correctly (`Result: 55`), and display high MIPS and very low execution time, without any kernel launch errors.

**Actual Behavior:**
The command consistently outputs:
Failed to launch kernels. Error code: invalid argument.
Errors:
HVM output had no result (An error likely occurred)

### System Settings

**System Settings***

Your System's settings

```markdown
* Operating System (Host): Windows 11
* WSL2 Distribution: Ubuntu 22.04 LTS
* GPU: NVIDIA GeForce GTX 1660 Ti
* NVIDIA Windows Driver Version (from `nvidia-smi` on Windows):
    
    NVIDIA-SMI 552.44          Driver Version: 552.44          CUDA Version: 12.4
    
* CUDA Toolkit Version (from `nvcc --version` in WSL2):
    
    nvcc: NVIDIA (R) Cuda compiler driver
    Copyright (c) 2005-2024 NVIDIA Corporation
    Built on Thu_Mar_28_02:18:24_PDT_2024
    Cuda compilation tools, release 12.4, V12.4.131
    Build cuda_12.4.r12.4/compiler.34097967_0
    
* Bend Version (from `bend --version`): bend-lang 0.2.38
* HVM Version (from `hvm --version`): hvm 2.0.22
* GCC Version (from `gcc --version`): 
     gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
     Copyright (C) 2021 Free Software Foundation, Inc.
     This is free software; see the source for copying conditions.  There is NO
     warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

* `PATH` environment variable in WSL2:
    `/usr/local/cuda-12.4/bin`
* `LD_LIBRARY_PATH` environment variable in WSL2:
     `/usr/local/cuda-12.4/lib64`

### Additional context

* **Successful `deviceQuery`:** Crucially, NVIDIA's own `deviceQuery` sample (from CUDA Samples v12.4) compiles and runs successfully within the same WSL2 environment, returning `Result = PASS`. This indicates the fundamental CUDA installation and GPU access via WSL2 is functional and not the direct cause of the issue.
    * Path to `deviceQuery` for reference: `/usr/local/cuda-12.4/samples/1_Utilities/deviceQuery`
* **Correct CPU Interpreter Results:**
    * `bend run-rs parallel_sum.bend -s` returns `Result: 55` and executes (though slower).
    * `bend run-c parallel_sum.bend -s` returns `Result: 55` and executes (faster than `run-rs`).
    * This confirms Bendlang's core logic and CPU interpreters are working correctly for the small sum, ruling out a general Bendlang parsing or mathematical error for this specific program. The issue is isolated to the `run-cu` backend.
* **Troubleshooting Steps Taken:**
    * Attempted multiple uninstalls and reinstalls of CUDA Toolkit 12.4.1 (deb local) following NVIDIA's official instructions.
    * Performed aggressive `cargo clean`, `rm -rf ~/.cargo/registry`, `rm -rf ~/.cargo/git` before reinstalling `hvm` and `bend-lang` to ensure clean builds against CUDA 12.4.
    * Confirmed CUDA `PATH` and `LD_LIBRARY_PATH` variables are correctly set and pointing to `cuda-12.4`.
    * Created a symbolic link from `/usr/local/cuda` to `/usr/local/cuda-12.4` to match typical `Makefile` expectations.
    * Ensured WSL2 GPU passthrough is active (`nvidia-smi` works inside WSL2).

This issue seems specific to `bend run-cu`'s interaction with the CUDA runtime in this WSL2 environment, despite a seemingly healthy underlying CUDA installation. Any guidance or potential debug flags would be greatly appreciated.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

`bend run-cu` "Failed to launch kernels: invalid argument" on WSL2 (Ubuntu 22.04, GTX 1660 Ti, CUDA 12.4) #752

Reproducing the behavior

System Settings

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

bend run-cu "Failed to launch kernels: invalid argument" on WSL2 (Ubuntu 22.04, GTX 1660 Ti, CUDA 12.4) #752

Description

Reproducing the behavior

System Settings

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

`bend run-cu` "Failed to launch kernels: invalid argument" on WSL2 (Ubuntu 22.04, GTX 1660 Ti, CUDA 12.4) #752