perf: Optimize CUDA graph batch size selection and padding #56

louiswang524 · 2025-12-30T03:25:03Z

Improve CUDA graph efficiency by reducing padding waste and using binary search for batch size lookups.

Key improvements:

Fine-grained batch size coverage:
- Small batches (1-7): Capture every size for zero padding waste
- Medium batches (8-32): Step by 4 instead of 8
- Large batches (32+): Keep step of 8 for memory efficiency
Binary search optimization:
- Replace linear search with bisect.bisect_left for O(log n) lookup
- Cleaner code with proper edge case handling

Performance impact:

Reduces average padding waste from 17.3% to 7.3% (9.9% improvement)
Particularly beneficial for common small batch sizes (3, 5, 7, 9, 11)
Trade-off: ~7 additional graphs (~700MB memory for max_bs=160)

Examples:

Batch size 3: 25% waste -> 0% waste (perfect fit)
Batch size 5: 37.5% waste -> 0% waste (perfect fit)
Batch size 9: 43.8% waste -> 25% waste
Batch size 11: 31.2% waste -> 8.3% waste

The memory overhead is acceptable for modern GPUs (>40GB VRAM) and the improved batch packing efficiency results in better GPU utilization.

Improve CUDA graph efficiency by reducing padding waste and using binary search for batch size lookups. Key improvements: 1. Fine-grained batch size coverage: - Small batches (1-7): Capture every size for zero padding waste - Medium batches (8-32): Step by 4 instead of 8 - Large batches (32+): Keep step of 8 for memory efficiency 2. Binary search optimization: - Replace linear search with bisect.bisect_left for O(log n) lookup - Cleaner code with proper edge case handling Performance impact: - Reduces average padding waste from 17.3% to 7.3% (9.9% improvement) - Particularly beneficial for common small batch sizes (3, 5, 7, 9, 11) - Trade-off: ~7 additional graphs (~700MB memory for max_bs=160) Examples: - Batch size 3: 25% waste -> 0% waste (perfect fit) - Batch size 5: 37.5% waste -> 0% waste (perfect fit) - Batch size 9: 43.8% waste -> 25% waste - Batch size 11: 31.2% waste -> 8.3% waste The memory overhead is acceptable for modern GPUs (>40GB VRAM) and the improved batch packing efficiency results in better GPU utilization.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: Optimize CUDA graph batch size selection and padding #56

perf: Optimize CUDA graph batch size selection and padding #56

Uh oh!

louiswang524 commented Dec 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

perf: Optimize CUDA graph batch size selection and padding #56

Are you sure you want to change the base?

perf: Optimize CUDA graph batch size selection and padding #56

Uh oh!

Conversation

louiswang524 commented Dec 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant