Running in Parallel #226

amfakh · 2025-08-11T03:25:45Z

amfakh
Aug 11, 2025

Hello,
Since in my case LLMLingua-2 only using 4-6GB of VRAM, I've tried to optimise it by running LLMLingua-2 in parallel by using concurrent.futures and multiprocessing on both my local 5090 and g6.xlarge instances.

concurrent.futures

def compress_task():
    llm_lingua = PromptCompressor(
        model_name="microsoft/llmlingua-2-xlm-roberta-large-meetingbank",
        use_llmlingua2=True,
    )
    start = time.time()
    compressed = llm_lingua.compress_prompt(prompt, rate=rate, force_tokens=['\n', '?'])
    end = time.time()
    return end - start

def run_benchmark(n_processes):
    start_all = time.time()
    with ThreadPoolExecutor(max_workers=n_processes) as executor:
        results = list(executor.map(lambda _: compress_task(), range(total_runs)))
    end_all = time.time()
    total_time = end_all - start_all
    avg_time = sum(results) / len(results)

multiprocessing

def compress_task(_):
    llm_lingua = PromptCompressor(
        model_name="microsoft/llmlingua-2-xlm-roberta-large-meetingbank",
        use_llmlingua2=True,
    )
    start = time.time()
    compressed = llm_lingua.compress_prompt(prompt, rate=rate, force_tokens=['\n', '?'])
    end = time.time()
    return end - start

def run_benchmark(n_processes):
    mp.set_start_method("spawn", force=True)
    ctx = mp.get_context("spawn")

    start_all = time.time()
    with ctx.Pool(processes=n_processes) as pool:
        results = pool.map(compress_task, range(total_runs))
    end_all = time.time()
    total_time = end_all - start_all
    avg_time = sum(results) / len(results)

Everything seems to run like normal, but even with large VRAM, it doesn't reduce the processing time at all, even though it consumes most of it.

Is this expected behavior? Are there any workarounds to effectively run it in parallel?

aniruddhaadak80 · 2026-03-09T22:51:33Z

aniruddhaadak80
Mar 9, 2026

From my point of view, the lack of speedup is not surprising if each worker is paying model initialization cost, competing for the same GPU kernels, or serializing around shared device resources. High VRAM usage alone does not imply useful parallel throughput.

The stronger optimization path is usually one long-lived loaded model with batching or request queues, rather than spawning separate compressor instances and hoping the GPU scheduler turns that into parallel wins.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Running in Parallel #226

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Running in Parallel #226

Uh oh!

Uh oh!

amfakh Aug 11, 2025

concurrent.futures

multiprocessing

Replies: 1 comment

Uh oh!

aniruddhaadak80 Mar 9, 2026

amfakh
Aug 11, 2025

aniruddhaadak80
Mar 9, 2026