Bring up CUDA-enabled colgrep on Windows#36
Bring up CUDA-enabled colgrep on Windows#36cepera-ang wants to merge 4 commits intolightonai:mainfrom
Conversation
|
I added force_gpu/force_cpu environment variables + cli options. Also fixed a bug when onnx runtime checked only linux versions of libs available and always redownloaded runtime libraries. Below is the attempt by codex to find and document all the toggles (env vars, cli options, build options, availability of files/libs, internal logic etc) that change the behavior with respect to cpu/gpu selection and usage, for all the components (colgrep, nextplaid and their dependencies). Main Branch Toggles
Current Branch Changes
Likely Misses / Caveats
Bottom Line
|
|
Thank's for the MR, I'll review it carefully this week :) |
|
clippy fails with other features set enabled, I didn't test changes with them (also, interesting that CI env tests that specific feature "openblas" with clippy and not all of them/some different one). I thought it would be useful to also test cuda work on linux, I have WSL available and it seem like even with these changes it still doesn't always use GPU. Will look at that too now. |
|
I merged your fastkmeans-rs pull request and released version 0.1.8 of fastkmeans including your changes :) @cepera-ang |
|
As soon as the MR behave exactly as you expect on windows let me know, I'll do the test with linux and macos, don't bother with clippy, I'll clone your mr and set you as co-author |
d578b75 to
cad0e19
Compare
|
Ok, so the current version seem to work fine for me. --force-cpu / --force-gpu flags now correctly force CPU or GPU execution across the full stack, including underlying next-plaid paths. While investigating performance, I also found that larger batch sizes were often making encoding slower even when they still fit in memory. The main reason is padding inefficiency: as batch size grows, the probability of mixing in a long document increases, and the whole batch then runs at the speed of the longest item. Sorting inputs by text length before encoding fixes that and gives a noticeable speedup. I also found that model initialization was using The above changes make CPU version already much faster on my PC (20 core Intel(R) Core(TM) i9-13900H), for example it inits this repo in ~ a minute (vs 30 seconds on mobile rtx 4060). Also, there is a bump of cudarc to the latest version + corresponding changes (similar to fastkmeans). Additionally, this includes a small Windows-specific fix for ONNX Runtime caching: the current logic could fail to detect already-downloaded GPU DLLs and re-download them unnecessarily on each run. Take a look and let me know what you think. All these |
|
Amazing @cepera-ang then I will create another pull request based on yours, I will add you as co-author and merge your update, I might update few things or two but I'll keep what make colgrep working fine on windows with careful testing for other OS |
As discussed in #34 this PR is an attempt to make sure that CUDA version of colgrep runs end to end on GPU without any unexpected fallbacks.
Requires lightonai/fastkmeans-rs#2 to land first (and will need an update to a new version after that, I used vendored version locally for testing).