[wip] Add batch inference pipeline tutorials (image + LLM) by kumare3 · Pull Request #228 · unionai/unionai-examples

kumare3 · 2026-03-23T05:31:28Z

Two new tutorials demonstrating the InferencePipeline abstraction from flyte.extras for maximizing GPU utilization during batch inference.

batch_image_pipeline.py — Image Classification with ResNet-50

3-stage pipeline: download images → resize/normalize → GPU inference → decode labels. Optimizations informed by GPU expert review:

torch.compile(dynamic=False, mode="reduce-overhead") with multi-size warmup to pre-compile CUDA graphs for all plausible batch sizes
pin_memory() + non_blocking H2D transfer to overlap PCIe with compute
Top-5 computed on-device — 200x less D2H data vs full 1000-class logits
min_batch_size=8 prevents pathological batch-of-1 under low concurrency (T4 throughput drops ~15x at batch=1 vs batch=32 for ResNet-50)

batch_llm_pipeline.py — LLM Inference with Qwen2.5-0.5B on JSONL

3-stage pipeline: read JSONL → tokenize with token-count cost estimation → model.generate() via token-budgeted batching → decode + substring eval. Optimizations:

Token-budgeted DynamicBatcher (TARGET_BATCH_TOKENS=4096) assembles variable-length prompts into GPU-optimal batches
tokenizer.pad() for correct left-padding (avoids pad_token_id=0 / BOS corruption bug from manual padding)
eos_token_id for early termination — short answers stop generating instead of padding to max_new_tokens
Warmup at full batch=16, max_new_tokens=128 to pre-allocate KV cache
Includes 100-prompt demo JSONL with substring-match evaluation

Both examples use ReusePolicy so multiple concurrent tasks on the same replica share a single InferencePipeline singleton — the DynamicBatcher sees items from all streams, producing larger GPU batches automatically.

Depends on: flyteorg/flyte-sdk#826

Two new tutorials demonstrating the InferencePipeline abstraction from flyte.extras for maximizing GPU utilization during batch inference. ## batch_image_pipeline.py — Image Classification with ResNet-50 3-stage pipeline: download images → resize/normalize → GPU inference → decode labels. Optimizations informed by GPU expert review: - torch.compile(dynamic=False, mode="reduce-overhead") with multi-size warmup to pre-compile CUDA graphs for all plausible batch sizes - pin_memory() + non_blocking H2D transfer to overlap PCIe with compute - Top-5 computed on-device — 200x less D2H data vs full 1000-class logits - min_batch_size=8 prevents pathological batch-of-1 under low concurrency (T4 throughput drops ~15x at batch=1 vs batch=32 for ResNet-50) ## batch_llm_pipeline.py — LLM Inference with Qwen2.5-0.5B on JSONL 3-stage pipeline: read JSONL → tokenize with token-count cost estimation → model.generate() via token-budgeted batching → decode + substring eval. Optimizations: - Token-budgeted DynamicBatcher (TARGET_BATCH_TOKENS=4096) assembles variable-length prompts into GPU-optimal batches - tokenizer.pad() for correct left-padding (avoids pad_token_id=0 / BOS corruption bug from manual padding) - eos_token_id for early termination — short answers stop generating instead of padding to max_new_tokens - Warmup at full batch=16, max_new_tokens=128 to pre-allocate KV cache - Includes 100-prompt demo JSONL with substring-match evaluation Both examples use ReusePolicy so multiple concurrent tasks on the same replica share a single InferencePipeline singleton — the DynamicBatcher sees items from all streams, producing larger GPU batches automatically. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

samhita-alla · 2026-03-23T07:01:48Z

v2/tutorials/ml/batch_llm_pipeline.py

+
+@worker.task(cache="auto", retries=2)
+async def generate_responses(
+    jsonl_file: flyte.io.File,


we can use flyteplugins-jsonl here.

samhita-alla · 2026-03-23T07:10:05Z

v2/tutorials/ml/batch_llm_pipeline.py

+    # Split into chunk files and upload
+    tasks = []
+    for i in range(0, len(lines), chunk_size):
+        chunk_lines = lines[i : i + chunk_size]
+        chunk_id = f"chunk_{i // chunk_size:03d}"
+
+        # Write chunk to temp file and upload
+        chunk_path = os.path.join(tempfile.gettempdir(), f"{chunk_id}.jsonl")
+        with open(chunk_path, "w") as f:
+            f.write("\n".join(chunk_lines) + "\n")
+        chunk_file = await flyte.io.File.from_local(chunk_path)


https://www.union.ai/docs/v2/selfmanaged/user-guide/task-programming/files-and-directories/#batch-iteration simplifies creating JSONL chunks and https://github.com/flyteorg/flyte-sdk/blob/f2fa58c13e187524a0ed4ba237b9835dceae52c1/plugins/jsonl/examples/jsonl_file.py#L87-L94 can be used to write a batch of chunks to a JSONLFile.

samhita-alla · 2026-03-23T07:14:24Z

v2/tutorials/ml/batch_image_pipeline.py

+    loop = asyncio.get_running_loop()
+    tensor = await loop.run_in_executor(
+        _io_cpu_pool,
+        lambda: _preprocess_transform(Image.open(BytesIO(resp.content)).convert("RGB")),
+    )


we can set preprocess_executor no?

samhita-alla · 2026-03-23T07:30:25Z

v2/tutorials/ml/batch_image_pipeline.py

+            for i in range(len(batch))
+        ]
+
+    return await loop.run_in_executor(_gpu_pool, _forward)


don't think we need this, right?

samhita-alla · 2026-03-23T07:30:38Z

v2/tutorials/ml/batch_llm_pipeline.py

+        texts = tokenizer.batch_decode(generated, skip_special_tokens=True)
+        return texts
+
+    return await loop.run_in_executor(_gpu_pool, _generate)


don't think we need this, right?

samhita-alla reviewed Mar 23, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[wip] Add batch inference pipeline tutorials (image + LLM)#228

[wip] Add batch inference pipeline tutorials (image + LLM)#228
kumare3 wants to merge 1 commit intomainfrom
inference-pipeline

kumare3 commented Mar 23, 2026

Uh oh!

samhita-alla Mar 23, 2026

Uh oh!

samhita-alla Mar 23, 2026

Uh oh!

samhita-alla Mar 23, 2026

Uh oh!

samhita-alla Mar 23, 2026

Uh oh!

samhita-alla Mar 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

kumare3 commented Mar 23, 2026

batch_image_pipeline.py — Image Classification with ResNet-50

batch_llm_pipeline.py — LLM Inference with Qwen2.5-0.5B on JSONL

Uh oh!

samhita-alla Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

samhita-alla Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

samhita-alla Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

samhita-alla Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

samhita-alla Mar 23, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants