[wip] Add batch inference pipeline tutorials (image + LLM)#228
Open
[wip] Add batch inference pipeline tutorials (image + LLM)#228
Conversation
Two new tutorials demonstrating the InferencePipeline abstraction from flyte.extras for maximizing GPU utilization during batch inference. ## batch_image_pipeline.py — Image Classification with ResNet-50 3-stage pipeline: download images → resize/normalize → GPU inference → decode labels. Optimizations informed by GPU expert review: - torch.compile(dynamic=False, mode="reduce-overhead") with multi-size warmup to pre-compile CUDA graphs for all plausible batch sizes - pin_memory() + non_blocking H2D transfer to overlap PCIe with compute - Top-5 computed on-device — 200x less D2H data vs full 1000-class logits - min_batch_size=8 prevents pathological batch-of-1 under low concurrency (T4 throughput drops ~15x at batch=1 vs batch=32 for ResNet-50) ## batch_llm_pipeline.py — LLM Inference with Qwen2.5-0.5B on JSONL 3-stage pipeline: read JSONL → tokenize with token-count cost estimation → model.generate() via token-budgeted batching → decode + substring eval. Optimizations: - Token-budgeted DynamicBatcher (TARGET_BATCH_TOKENS=4096) assembles variable-length prompts into GPU-optimal batches - tokenizer.pad() for correct left-padding (avoids pad_token_id=0 / BOS corruption bug from manual padding) - eos_token_id for early termination — short answers stop generating instead of padding to max_new_tokens - Warmup at full batch=16, max_new_tokens=128 to pre-allocate KV cache - Includes 100-prompt demo JSONL with substring-match evaluation Both examples use ReusePolicy so multiple concurrent tasks on the same replica share a single InferencePipeline singleton — the DynamicBatcher sees items from all streams, producing larger GPU batches automatically. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
|
||
| @worker.task(cache="auto", retries=2) | ||
| async def generate_responses( | ||
| jsonl_file: flyte.io.File, |
Collaborator
There was a problem hiding this comment.
we can use flyteplugins-jsonl here.
Comment on lines
+379
to
+389
| # Split into chunk files and upload | ||
| tasks = [] | ||
| for i in range(0, len(lines), chunk_size): | ||
| chunk_lines = lines[i : i + chunk_size] | ||
| chunk_id = f"chunk_{i // chunk_size:03d}" | ||
|
|
||
| # Write chunk to temp file and upload | ||
| chunk_path = os.path.join(tempfile.gettempdir(), f"{chunk_id}.jsonl") | ||
| with open(chunk_path, "w") as f: | ||
| f.write("\n".join(chunk_lines) + "\n") | ||
| chunk_file = await flyte.io.File.from_local(chunk_path) |
Collaborator
There was a problem hiding this comment.
https://www.union.ai/docs/v2/selfmanaged/user-guide/task-programming/files-and-directories/#batch-iteration simplifies creating JSONL chunks and https://github.com/flyteorg/flyte-sdk/blob/f2fa58c13e187524a0ed4ba237b9835dceae52c1/plugins/jsonl/examples/jsonl_file.py#L87-L94 can be used to write a batch of chunks to a JSONLFile.
Comment on lines
+187
to
+191
| loop = asyncio.get_running_loop() | ||
| tensor = await loop.run_in_executor( | ||
| _io_cpu_pool, | ||
| lambda: _preprocess_transform(Image.open(BytesIO(resp.content)).convert("RGB")), | ||
| ) |
Collaborator
There was a problem hiding this comment.
we can set preprocess_executor no?
| for i in range(len(batch)) | ||
| ] | ||
|
|
||
| return await loop.run_in_executor(_gpu_pool, _forward) |
Collaborator
There was a problem hiding this comment.
don't think we need this, right?
| texts = tokenizer.batch_decode(generated, skip_special_tokens=True) | ||
| return texts | ||
|
|
||
| return await loop.run_in_executor(_gpu_pool, _generate) |
Collaborator
There was a problem hiding this comment.
don't think we need this, right?
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Two new tutorials demonstrating the InferencePipeline abstraction from flyte.extras for maximizing GPU utilization during batch inference.
batch_image_pipeline.py — Image Classification with ResNet-50
3-stage pipeline: download images → resize/normalize → GPU inference → decode labels. Optimizations informed by GPU expert review:
batch_llm_pipeline.py — LLM Inference with Qwen2.5-0.5B on JSONL
3-stage pipeline: read JSONL → tokenize with token-count cost estimation → model.generate() via token-budgeted batching → decode + substring eval. Optimizations:
Both examples use ReusePolicy so multiple concurrent tasks on the same replica share a single InferencePipeline singleton — the DynamicBatcher sees items from all streams, producing larger GPU batches automatically.
Depends on: flyteorg/flyte-sdk#826