Bring up CUDA-enabled colgrep on Windows by cepera-ang · Pull Request #36 · lightonai/next-plaid

cepera-ang · 2026-03-14T18:59:47Z

As discussed in #34 this PR is an attempt to make sure that CUDA version of colgrep runs end to end on GPU without any unexpected fallbacks.

It updates cudarc to 0.19.3 and fixes calls to new API.
Forces GPU usage if build with CUDA feature (ignoring small batches, etc)
Assumes cudnn is available and configured correctly (env variables set)
releases and reloads a model at each 1000 iterations to conserve VRAM (otherwise it goes OOM on 8GB with 17M model)

Requires lightonai/fastkmeans-rs#2 to land first (and will need an update to a new version after that, I used vendored version locally for testing).

cepera-ang · 2026-03-15T11:01:56Z

I added force_gpu/force_cpu environment variables + cli options. Also fixed a bug when onnx runtime checked only linux versions of libs available and always redownloaded runtime libraries.

Below is the attempt by codex to find and document all the toggles (env vars, cli options, build options, availability of files/libs, internal logic etc) that change the behavior with respect to cpu/gpu selection and usage, for all the components (colgrep, nextplaid and their dependencies).

Main Branch Toggles

Scope	Toggle	Type	Where	Effect in `origin/main`
`colgrep` build	`--features cuda`	Cargo feature	colgrep/Cargo.toml	Enables CUDA in both `next-plaid-onnx` and `next-plaid`.
`colgrep` build	`--features coreml` / `tensorrt` / `directml`	Cargo feature	colgrep/Cargo.toml	Enables alternate ONNX execution providers.
`colgrep` build	`--features accelerate` / `openblas`	Cargo feature	colgrep/Cargo.toml	Enables CPU-side BLAS acceleration in `next-plaid`.
`next-plaid-onnx` build	`cuda`, `tensorrt`, `coreml`, `directml`	Cargo feature	next-plaid-onnx/Cargo.toml	Compiles corresponding ORT execution providers.
`next-plaid` build	`cuda`	Cargo feature	next-plaid/Cargo.toml	Enables `cudarc` and `fastkmeans-rs/cuda`.
`next-plaid` CUDA link mode	`cuda-version-from-build-system`, `cublas`, `dynamic-linking`	`cudarc` features	next-plaid/Cargo.toml	CUDA toolkit version comes from environment/build system; runtime links dynamically.
ONNX runtime location	`ORT_DYLIB_PATH`	Env var	colgrep/src/onnx_runtime.rs, next-plaid-onnx/src/lib.rs	If set and exists, uses that ORT library instead of downloading/finding one.
GPU bypass	`COLGREP_FORCE_CPU=1`	Env var	colgrep/src/index/mod.rs, colgrep/src/onnx_runtime.rs	Avoids CUDA init in `colgrep` paths; small workloads set this automatically.
ONNX GPU bypass	`NEXT_PLAID_FORCE_CPU=1`	Env var	next-plaid-onnx/src/lib.rs	Makes ONNX provider selection skip GPU providers.
GPU visibility	`CUDA_VISIBLE_DEVICES`	Env var	next-plaid-onnx/src/lib.rs	Empty / `-1` hides GPUs; checked before CUDA EP init.
Linux cuDNN discovery	`CUDA_HOME`, `CUDA_PATH`, `CUDNN_PATH`, `CUDNN_HOME`, `CONDA_PREFIX`, `LD_LIBRARY_PATH`	Env vars / library presence	colgrep/src/onnx_runtime.rs	Used to find cuDNN and re-exec with updated `LD_LIBRARY_PATH` on Linux.
Heuristic: indexing encode	`SMALL_BATCH_CPU_THRESHOLD = 300`	Internal logic	colgrep/src/index/mod.rs	Small indexing jobs force CPU to avoid GPU init overhead.
Heuristic: search encode	single-query search always CPU/CoreML	Internal logic	colgrep/src/index/mod.rs	Query encoding avoids CUDA by design.
Runtime defaults	`get_default_batch_size`, `get_default_parallel_sessions`	Internal logic	colgrep/src/config.rs	GPU defaults only when CUDA feature is on and cuDNN is available; else CPU defaults.
`next-plaid` indexing/search	`IndexConfig.force_cpu`, `UpdateConfig.force_cpu`, `ComputeKmeansConfig.force_cpu`	API/internal flags	next-plaid/src/index.rs, next-plaid/src/update.rs, next-plaid/src/kmeans.rs	CPU is chosen explicitly for codec/kmeans paths when caller sets `force_cpu`.
`next-plaid` search heuristic	`CUDA_COLBERT_MIN_SIZE`	Internal logic	next-plaid/src/search.rs	Uses CUDA scoring only for large query-doc matrices.
CUDA availability	actual presence of ORT CUDA libs, cuDNN, CUDA EP, visible GPU	Runtime availability	colgrep/src/onnx_runtime.rs, next-plaid-onnx/src/lib.rs	Default behavior falls back to CPU if unavailable.

Current Branch Changes

Area	Main behavior	Current behavior	Change
Explicit GPU forcing	none	`FORCE_GPU`, `COLGREP_FORCE_GPU`, `NEXT_PLAID_FORCE_GPU`; `--force-gpu`	Added strict GPU mode in colgrep/src/acceleration.rs, colgrep/src/cli.rs, colgrep/src/main.rs.
Explicit CPU forcing	`COLGREP_FORCE_CPU`, `NEXT_PLAID_FORCE_CPU` only	also accepts generic `FORCE_CPU`; `--force-cpu`	Broadened names and added CLI surface.
Default indexing behavior	heuristic CPU-for-small / GPU-for-large	restored to match `main`	colgrep/src/index/mod.rs now uses `Auto` again instead of unconditional strict CUDA.
Search query encoding	CPU/CoreML only	still CPU/CoreML in auto; CUDA only in force-GPU mode	This preserves old default behavior while allowing strict override.
ONNX auto provider	Auto tries GPU providers then CPU	same, but `FORCE_GPU` now routes Auto directly to CUDA	next-plaid-onnx/src/lib.rs.
Strict failure below ONNX	many CUDA paths fell back silently	some now panic/fail under `FORCE_GPU`	next-plaid/src/codec.rs, next-plaid/src/index.rs, next-plaid/src/search.rs.
Runtime defaults	based on cuDNN availability only	based on acceleration mode first, then cuDNN	colgrep/src/config.rs.
cuDNN check on Windows	branch already had Windows return `true`	unchanged from current branch baseline	Not a new change in this pass.
`next-plaid` CUDA build deps	`cudarc 0.12`, no `nvrtc`	`cudarc 0.19.3`, adds `nvrtc`	From earlier CUDA bring-up; still relevant build-time difference in next-plaid/Cargo.toml.

Likely Misses / Caveats

Item	Status	Note
Docs/help text for new flags/env vars	missed	I did not add README/help prose for `FORCE_GPU` / `--force-gpu` / `--force-cpu`.
`next-plaid` CPU forcing via env for all internal paths	partially missed	We added `next-plaid::is_force_cpu()`, but most lower-level `next-plaid` code still keys CPU-vs-GPU primarily off `config.force_cpu`, not the env var. In `colgrep` auto/force paths this is mostly okay for ONNX, but pure `next-plaid` callers will not get full env-driven CPU forcing unless they also pass `force_cpu`.
`next-plaid` strict GPU coverage	mostly covered, not exhaustively audited	Codec/index/search strict paths were updated, and k-means already errors if CUDA init/train fails. I did not audit every future CUDA helper for fallback semantics.
Conflict handling docs	missed	`FORCE_CPU` + `FORCE_GPU` now errors in colgrep/src/acceleration.rs, but that behavior is undocumented.
Tests for CLI flags end-to-end	missed	Added only small env-mode unit tests; no CLI integration tests yet.
Non-`colgrep` API consumers	changed behavior only partly	`next-plaid-onnx` now honors generic env vars directly, but `next-plaid` still mainly expects `force_cpu` in config structs.

Bottom Line

origin/main had one real runtime override: force CPU, plus several heuristics and availability checks that biased toward CPU for quick/small work.
current branch restores those defaults and adds a new strict GPU mode
the biggest thing still not fully unified is FORCE_CPU inside raw next-plaid internals; colgrep is mostly covered, standalone next-plaid callers are not.

raphaelsty · 2026-03-16T20:38:11Z

Thank's for the MR, I'll review it carefully this week :)

cepera-ang · 2026-03-17T10:14:53Z

clippy fails with other features set enabled, I didn't test changes with them (also, interesting that CI env tests that specific feature "openblas" with clippy and not all of them/some different one).

I thought it would be useful to also test cuda work on linux, I have WSL available and it seem like even with these changes it still doesn't always use GPU. Will look at that too now.

raphaelsty · 2026-03-17T12:03:21Z

I merged your fastkmeans-rs pull request and released version 0.1.8 of fastkmeans including your changes :) @cepera-ang

raphaelsty · 2026-03-17T12:56:09Z

As soon as the MR behave exactly as you expect on windows let me know, I'll do the test with linux and macos, don't bother with clippy, I'll clone your mr and set you as co-author

cepera-ang · 2026-03-18T10:06:55Z

Ok, so the current version seem to work fine for me.

--force-cpu / --force-gpu flags now correctly force CPU or GPU execution across the full stack, including underlying next-plaid paths.
In the default case, the previous auto-selection logic is preserved.

While investigating performance, I also found that larger batch sizes were often making encoding slower even when they still fit in memory. The main reason is padding inefficiency: as batch size grows, the probability of mixing in a long document increases, and the whole batch then runs at the speed of the longest item. Sorting inputs by text length before encoding fixes that and gives a noticeable speedup.

I also found that model initialization was using .with_parallel(...) even in effectively single-session cases, which caused ONNX Runtime to run with a single thread regardless of whether execution was on CPU or GPU.

The above changes make CPU version already much faster on my PC (20 core Intel(R) Core(TM) i9-13900H), for example it inits this repo in ~ a minute (vs 30 seconds on mobile rtx 4060).

Also, there is a bump of cudarc to the latest version + corresponding changes (similar to fastkmeans).

Additionally, this includes a small Windows-specific fix for ONNX Runtime caching: the current logic could fail to detect already-downloaded GPU DLLs and re-download them unnecessarily on each run.

Take a look and let me know what you think. All these #cfg[cuda]/env flags and at least three different modules independently accessing cuda (accelerators) in different ways feel a little bit shaky but I have no idea yet how to make it cleaner.

raphaelsty · 2026-03-18T10:45:50Z

Amazing @cepera-ang then I will create another pull request based on yours, I will add you as co-author and merge your update, I might update few things or two but I'll keep what make colgrep working fine on windows with careful testing for other OS

raphaelsty assigned cepera-ang Mar 17, 2026

raphaelsty added the enhancement New feature or request label Mar 17, 2026

cepera-ang added 3 commits March 18, 2026 15:04

Bring up CUDA-enabled colgrep on Windows

f58d04c

Add explicit CPU/GPU modes for colgrep

26feb1b

Refine colgrep acceleration and batching

cad0e19

cepera-ang force-pushed the windows-cuda-colgrep-bringup branch from d578b75 to cad0e19 Compare March 18, 2026 08:09

Fix rebased CUDA kernel loading

c45aad6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bring up CUDA-enabled colgrep on Windows#36

Bring up CUDA-enabled colgrep on Windows#36
cepera-ang wants to merge 4 commits intolightonai:mainfrom
cepera-ang:windows-cuda-colgrep-bringup

cepera-ang commented Mar 14, 2026

Uh oh!

cepera-ang commented Mar 15, 2026

Uh oh!

raphaelsty commented Mar 16, 2026

Uh oh!

cepera-ang commented Mar 17, 2026

Uh oh!

raphaelsty commented Mar 17, 2026

Uh oh!

raphaelsty commented Mar 17, 2026

Uh oh!

cepera-ang commented Mar 18, 2026

Uh oh!

raphaelsty commented Mar 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

cepera-ang commented Mar 14, 2026

Uh oh!

cepera-ang commented Mar 15, 2026

Uh oh!

raphaelsty commented Mar 16, 2026

Uh oh!

cepera-ang commented Mar 17, 2026

Uh oh!

raphaelsty commented Mar 17, 2026

Uh oh!

raphaelsty commented Mar 17, 2026

Uh oh!

cepera-ang commented Mar 18, 2026

Uh oh!

raphaelsty commented Mar 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants