Replies: 1 comment
-
|
From my point of view, the lack of speedup is not surprising if each worker is paying model initialization cost, competing for the same GPU kernels, or serializing around shared device resources. High VRAM usage alone does not imply useful parallel throughput. The stronger optimization path is usually one long-lived loaded model with batching or request queues, rather than spawning separate compressor instances and hoping the GPU scheduler turns that into parallel wins. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hello,
Since in my case LLMLingua-2 only using 4-6GB of VRAM, I've tried to optimise it by running LLMLingua-2 in parallel by using
concurrent.futuresandmultiprocessingon both my local 5090 and g6.xlarge instances.concurrent.futures
multiprocessing
Everything seems to run like normal, but even with large VRAM, it doesn't reduce the processing time at all, even though it consumes most of it.
Is this expected behavior? Are there any workarounds to effectively run it in parallel?
Beta Was this translation helpful? Give feedback.
All reactions