Skip to content

Bring up CUDA-enabled colgrep on Windows#36

Open
cepera-ang wants to merge 4 commits intolightonai:mainfrom
cepera-ang:windows-cuda-colgrep-bringup
Open

Bring up CUDA-enabled colgrep on Windows#36
cepera-ang wants to merge 4 commits intolightonai:mainfrom
cepera-ang:windows-cuda-colgrep-bringup

Conversation

@cepera-ang
Copy link

As discussed in #34 this PR is an attempt to make sure that CUDA version of colgrep runs end to end on GPU without any unexpected fallbacks.

  1. It updates cudarc to 0.19.3 and fixes calls to new API.
  2. Forces GPU usage if build with CUDA feature (ignoring small batches, etc)
  3. Assumes cudnn is available and configured correctly (env variables set)
  4. releases and reloads a model at each 1000 iterations to conserve VRAM (otherwise it goes OOM on 8GB with 17M model)

Requires lightonai/fastkmeans-rs#2 to land first (and will need an update to a new version after that, I used vendored version locally for testing).

@cepera-ang
Copy link
Author

I added force_gpu/force_cpu environment variables + cli options. Also fixed a bug when onnx runtime checked only linux versions of libs available and always redownloaded runtime libraries.

Below is the attempt by codex to find and document all the toggles (env vars, cli options, build options, availability of files/libs, internal logic etc) that change the behavior with respect to cpu/gpu selection and usage, for all the components (colgrep, nextplaid and their dependencies).

Main Branch Toggles

Scope Toggle Type Where Effect in origin/main
colgrep build --features cuda Cargo feature colgrep/Cargo.toml Enables CUDA in both next-plaid-onnx and next-plaid.
colgrep build --features coreml / tensorrt / directml Cargo feature colgrep/Cargo.toml Enables alternate ONNX execution providers.
colgrep build --features accelerate / openblas Cargo feature colgrep/Cargo.toml Enables CPU-side BLAS acceleration in next-plaid.
next-plaid-onnx build cuda, tensorrt, coreml, directml Cargo feature next-plaid-onnx/Cargo.toml Compiles corresponding ORT execution providers.
next-plaid build cuda Cargo feature next-plaid/Cargo.toml Enables cudarc and fastkmeans-rs/cuda.
next-plaid CUDA link mode cuda-version-from-build-system, cublas, dynamic-linking cudarc features next-plaid/Cargo.toml CUDA toolkit version comes from environment/build system; runtime links dynamically.
ONNX runtime location ORT_DYLIB_PATH Env var colgrep/src/onnx_runtime.rs, next-plaid-onnx/src/lib.rs If set and exists, uses that ORT library instead of downloading/finding one.
GPU bypass COLGREP_FORCE_CPU=1 Env var colgrep/src/index/mod.rs, colgrep/src/onnx_runtime.rs Avoids CUDA init in colgrep paths; small workloads set this automatically.
ONNX GPU bypass NEXT_PLAID_FORCE_CPU=1 Env var next-plaid-onnx/src/lib.rs Makes ONNX provider selection skip GPU providers.
GPU visibility CUDA_VISIBLE_DEVICES Env var next-plaid-onnx/src/lib.rs Empty / -1 hides GPUs; checked before CUDA EP init.
Linux cuDNN discovery CUDA_HOME, CUDA_PATH, CUDNN_PATH, CUDNN_HOME, CONDA_PREFIX, LD_LIBRARY_PATH Env vars / library presence colgrep/src/onnx_runtime.rs Used to find cuDNN and re-exec with updated LD_LIBRARY_PATH on Linux.
Heuristic: indexing encode SMALL_BATCH_CPU_THRESHOLD = 300 Internal logic colgrep/src/index/mod.rs Small indexing jobs force CPU to avoid GPU init overhead.
Heuristic: search encode single-query search always CPU/CoreML Internal logic colgrep/src/index/mod.rs Query encoding avoids CUDA by design.
Runtime defaults get_default_batch_size, get_default_parallel_sessions Internal logic colgrep/src/config.rs GPU defaults only when CUDA feature is on and cuDNN is available; else CPU defaults.
next-plaid indexing/search IndexConfig.force_cpu, UpdateConfig.force_cpu, ComputeKmeansConfig.force_cpu API/internal flags next-plaid/src/index.rs, next-plaid/src/update.rs, next-plaid/src/kmeans.rs CPU is chosen explicitly for codec/kmeans paths when caller sets force_cpu.
next-plaid search heuristic CUDA_COLBERT_MIN_SIZE Internal logic next-plaid/src/search.rs Uses CUDA scoring only for large query-doc matrices.
CUDA availability actual presence of ORT CUDA libs, cuDNN, CUDA EP, visible GPU Runtime availability colgrep/src/onnx_runtime.rs, next-plaid-onnx/src/lib.rs Default behavior falls back to CPU if unavailable.

Current Branch Changes

Area Main behavior Current behavior Change
Explicit GPU forcing none FORCE_GPU, COLGREP_FORCE_GPU, NEXT_PLAID_FORCE_GPU; --force-gpu Added strict GPU mode in colgrep/src/acceleration.rs, colgrep/src/cli.rs, colgrep/src/main.rs.
Explicit CPU forcing COLGREP_FORCE_CPU, NEXT_PLAID_FORCE_CPU only also accepts generic FORCE_CPU; --force-cpu Broadened names and added CLI surface.
Default indexing behavior heuristic CPU-for-small / GPU-for-large restored to match main colgrep/src/index/mod.rs now uses Auto again instead of unconditional strict CUDA.
Search query encoding CPU/CoreML only still CPU/CoreML in auto; CUDA only in force-GPU mode This preserves old default behavior while allowing strict override.
ONNX auto provider Auto tries GPU providers then CPU same, but FORCE_GPU now routes Auto directly to CUDA next-plaid-onnx/src/lib.rs.
Strict failure below ONNX many CUDA paths fell back silently some now panic/fail under FORCE_GPU next-plaid/src/codec.rs, next-plaid/src/index.rs, next-plaid/src/search.rs.
Runtime defaults based on cuDNN availability only based on acceleration mode first, then cuDNN colgrep/src/config.rs.
cuDNN check on Windows branch already had Windows return true unchanged from current branch baseline Not a new change in this pass.
next-plaid CUDA build deps cudarc 0.12, no nvrtc cudarc 0.19.3, adds nvrtc From earlier CUDA bring-up; still relevant build-time difference in next-plaid/Cargo.toml.

Likely Misses / Caveats

Item Status Note
Docs/help text for new flags/env vars missed I did not add README/help prose for FORCE_GPU / --force-gpu / --force-cpu.
next-plaid CPU forcing via env for all internal paths partially missed We added next-plaid::is_force_cpu(), but most lower-level next-plaid code still keys CPU-vs-GPU primarily off config.force_cpu, not the env var. In colgrep auto/force paths this is mostly okay for ONNX, but pure next-plaid callers will not get full env-driven CPU forcing unless they also pass force_cpu.
next-plaid strict GPU coverage mostly covered, not exhaustively audited Codec/index/search strict paths were updated, and k-means already errors if CUDA init/train fails. I did not audit every future CUDA helper for fallback semantics.
Conflict handling docs missed FORCE_CPU + FORCE_GPU now errors in colgrep/src/acceleration.rs, but that behavior is undocumented.
Tests for CLI flags end-to-end missed Added only small env-mode unit tests; no CLI integration tests yet.
Non-colgrep API consumers changed behavior only partly next-plaid-onnx now honors generic env vars directly, but next-plaid still mainly expects force_cpu in config structs.

Bottom Line

  • origin/main had one real runtime override: force CPU, plus several heuristics and availability checks that biased toward CPU for quick/small work.
  • current branch restores those defaults and adds a new strict GPU mode
  • the biggest thing still not fully unified is FORCE_CPU inside raw next-plaid internals; colgrep is mostly covered, standalone next-plaid callers are not.

@raphaelsty
Copy link
Collaborator

Thank's for the MR, I'll review it carefully this week :)

@cepera-ang
Copy link
Author

clippy fails with other features set enabled, I didn't test changes with them (also, interesting that CI env tests that specific feature "openblas" with clippy and not all of them/some different one).

I thought it would be useful to also test cuda work on linux, I have WSL available and it seem like even with these changes it still doesn't always use GPU. Will look at that too now.

@raphaelsty
Copy link
Collaborator

I merged your fastkmeans-rs pull request and released version 0.1.8 of fastkmeans including your changes :) @cepera-ang

@raphaelsty raphaelsty added the enhancement New feature or request label Mar 17, 2026
@raphaelsty
Copy link
Collaborator

As soon as the MR behave exactly as you expect on windows let me know, I'll do the test with linux and macos, don't bother with clippy, I'll clone your mr and set you as co-author

@cepera-ang cepera-ang force-pushed the windows-cuda-colgrep-bringup branch from d578b75 to cad0e19 Compare March 18, 2026 08:09
@cepera-ang
Copy link
Author

Ok, so the current version seem to work fine for me.

--force-cpu / --force-gpu flags now correctly force CPU or GPU execution across the full stack, including underlying next-plaid paths.
In the default case, the previous auto-selection logic is preserved.

While investigating performance, I also found that larger batch sizes were often making encoding slower even when they still fit in memory. The main reason is padding inefficiency: as batch size grows, the probability of mixing in a long document increases, and the whole batch then runs at the speed of the longest item. Sorting inputs by text length before encoding fixes that and gives a noticeable speedup.

I also found that model initialization was using .with_parallel(...) even in effectively single-session cases, which caused ONNX Runtime to run with a single thread regardless of whether execution was on CPU or GPU.

The above changes make CPU version already much faster on my PC (20 core Intel(R) Core(TM) i9-13900H), for example it inits this repo in ~ a minute (vs 30 seconds on mobile rtx 4060).

Also, there is a bump of cudarc to the latest version + corresponding changes (similar to fastkmeans).

Additionally, this includes a small Windows-specific fix for ONNX Runtime caching: the current logic could fail to detect already-downloaded GPU DLLs and re-download them unnecessarily on each run.

Take a look and let me know what you think. All these #cfg[cuda]/env flags and at least three different modules independently accessing cuda (accelerators) in different ways feel a little bit shaky but I have no idea yet how to make it cleaner.

@raphaelsty
Copy link
Collaborator

Amazing @cepera-ang then I will create another pull request based on yours, I will add you as co-author and merge your update, I might update few things or two but I'll keep what make colgrep working fine on windows with careful testing for other OS

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants