perf: optimize micro-batch calculation #403

ppraneth · 2026-01-06T04:52:39Z

This PR optimizes the bin-packing logic in slime/utils/data.py used to calculate the minimum number of micro-batches required for training. By replacing a linear search with a Max-Heap-based approach, we reduce the algorithmic complexity from $O(N^2)$ to $O(N \log B)$ where $N$ is the number of samples and $B$ is the number of resulting micro-batches.

The Problem

The current implementation of get_minimum_num_micro_batch_size uses a First-Fit algorithm. For every sequence length in the batch, it performs a linear scan through all existing micro-batches to find the first one with available capacity:

# slime/utils/data.py
for length in total_lengths:
    for i in range(len(batches)): # <--- Linear search: O(B)
        if batches[i] + length <= max_tokens_per_gpu:
            batches[i] += length
            break

In large-scale RL training, where can reach tens of thousands of samples, the number of bins also grows. This nested loop leads to quadratic time complexity, causing a significant CPU bottleneck that stalls the GPU while the training process prepares the next batch.

The Solution

This PR implements a Best-Fit strategy using a Max-Heap to track the remaining capacity of each micro-batch.

Instead of scanning every bin, we only check the bin with the most available space.
Finding and updating the best bin now takes $O(\log B)$ time rather than $O(N^2)$ .
This ensures that the calculation remains efficient even as the number of samples scales into the hundreds of thousands.

Benchmark Results

import time
import heapq
import random

def packing_before(total_lengths, max_tokens_per_gpu):
    batches = []
    for length in total_lengths:
        for i in range(len(batches)):
            if batches[i] + length <= max_tokens_per_gpu:
                batches[i] += length
                break
        else:
            batches.append(length)
    return len(batches)

def packing_after(total_lengths, max_tokens_per_gpu):
    # Best-fit using a Max-Heap
    remaining_capacities = []
    for length in total_lengths:
        if remaining_capacities and (-remaining_capacities[0] >= length):
            most_space = -heapq.heappop(remaining_capacities)
            heapq.heappush(remaining_capacities, -(most_space - length))
        else:
            heapq.heappush(remaining_capacities, -(max_tokens_per_gpu - length))
    return len(remaining_capacities)

def run_benchmark():
    # max_tokens_per_gpu is usually large
    LIMIT = 32768
    # Test sizes: number of sequences in the batch
    test_sizes = [1000, 5000, 10000, 20000, 40000]

    print(f"{'N (Samples)':<12} | {'Before (O(N^2))':<15} | {'After (O(N log B))':<18} | {'Speedup':<10}")
    print("-" * 65)

    for n in test_sizes:
        # Generate random sequence lengths between 128 and 2048
        lengths = [random.randint(128, 2048) for _ in range(n)]

        # Benchmark Before
        start = time.perf_counter()
        count_before = packing_before(lengths, LIMIT)
        time_before = time.perf_counter() - start

        # Benchmark After
        start = time.perf_counter()
        count_after = packing_after(lengths, LIMIT)
        time_after = time.perf_counter() - start

        speedup = time_before / time_after if time_after > 0 else 0
        print(f"{n:<12} | {time_before:>13.5f}s | {time_after:>16.5f}s | {speedup:>8.1f}x")

if __name__ == "__main__":
    run_benchmark()

The following results demonstrate that the optimization prevents the exponential performance degradation seen in the original version:

N (Samples)	Before ()	After ()	Speedup
1,000	0.00099s	0.00035s	2.8x
5,000	0.02458s	0.00221s	11.1x
10,000	0.09980s	0.00509s	19.6x
20,000	0.43859s	0.00962s	45.6x
40,000	2.60162s	0.03897s	66.8x

Impact

Eliminates Training Stalls: Reduces CPU-bound batch preparation time from seconds to milliseconds for large rollouts.
Improves Scaling: Critical for RL scaling workflows where massive rollout batches are partitioned across many DP ranks.

gemini-code-assist · 2026-01-06T04:52:54Z

Summary of Changes

Hello @ppraneth, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a critical performance optimization to the micro-batch calculation process, which is a known CPU bottleneck in large-scale Reinforcement Learning training. By transitioning from a quadratic time complexity First-Fit algorithm to a more efficient O(N log B) Max-Heap-based Best-Fit strategy, the change drastically reduces batch preparation times. This enhancement is crucial for improving training throughput, enabling better scaling of RL workflows, and freeing up CPU resources for other asynchronous tasks.

Highlights

Algorithmic Optimization: The micro-batch calculation logic in slime/utils/data.py has been optimized from an O(N^2) complexity to O(N log B) by replacing a linear search (First-Fit) with a Max-Heap-based approach (Best-Fit).
Performance Improvement: This change significantly reduces CPU bottlenecks during large-scale RL training, eliminating training stalls and improving overall scaling efficiency, with reported speedups up to 66.8x for 40,000 samples.
Implementation Detail: The get_minimum_num_micro_batch_size function now uses Python's heapq module to maintain a max-heap of remaining capacities, ensuring efficient bin selection.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This is an excellent performance optimization. The pull request description clearly explains the problem with the previous O(N^2) implementation and the benefits of the new O(N log B) heap-based approach. The benchmark results effectively demonstrate the significant speedup. The code change is correct and implements the described logic well. I have a couple of minor suggestions to make the code a bit more concise.

miles/utils/data.py

zhaochenyang20 · 2026-01-06T18:21:31Z

Great work on identifying this CPU bottleneck!

The transition from $O(N^2)$ to $O(N \log B)$ using a heap is a solid algorithmic improvement for the calculation speed. However, I have a major concern regarding the Bin Packing Heuristic.

You are effectively switching from First-Fit to Largest-Fit (by always picking the bin with the most space). First-Fit tends to pack bins tighter, resulting in fewer total batches. Worst-Fit tends to distribute load, potentially resulting in more total batches.

Since the function is named get_minimum_num_micro_batch_size, our goal is to minimize the number of batches to reduce GPU kernel overhead. If this PR increases the total number of micro-batches (even if the CPU calculation is faster), it might degrade the actual training throughput.

Please run your benchmark again and compare the output value (len(remaining_capacities) vs len(batches)) between the two implementations. If the new approach generates significantly more batches (e.g., >1%), we might need to reconsider using a Best-Fit strategy (using a BST/SortedList or Min-Heap with segmentation) instead of Worst-Fit, or confirm the regression is acceptable.

zhaochenyang20 · 2026-01-06T18:29:18Z

The best way is to find the "Best-Fit", Find the bin that offers the tightest fit (i.e., the one with the least available space that is sufficient).

zhaochenyang20 · 2026-01-06T20:40:40Z

To clarify the algorithmic choices, let's look at the three standard heuristics:

First-Fit (Original Code): Linearly scan from the beginning and pick the first bin that fits.
Largest-Fit (Current PR): Pick the bin with the largest global remaining capacity (using a Max-Heap).
Best-Fit (Ideal): Pick the bin with the smallest remaining capacity that can still accommodate the sequence.

Among these, Option 3 is actually what we want. It packs bins the "tightest," producing the minimum number of total micro-batches. Option 2 (Largest-Fit) tends to distribute load evenly, which often results in more half-empty bins and higher GPU kernel overhead.

PopSoda2002 · 2026-01-06T23:45:23Z

To clarify the algorithmic choices, let's look at the three standard heuristics:

First-Fit (Original Code): Linearly scan from the beginning and pick the first bin that fits.

Largest-Fit (Current PR): Pick the bin with the largest global remaining capacity (using a Max-Heap).

Best-Fit (Ideal): Pick the bin with the smallest remaining capacity that can still accommodate the sequence.

Among these, Option 3 is actually what we want. It packs bins the "tightest," producing the minimum number of total micro-batches. Option 2 (Largest-Fit) tends to distribute load evenly, which often results in more half-empty bins and higher GPU kernel overhead.

I agree with @zhaochenyang20 and we talked about the implementation details:

I think the name should be changed from get minimum to like best_fit, because objectively speaking, finding minimum buckets is a NP-hard problem
Second, for implementing the best fit, an easy and efficient way is sorting first, then using some way like binary search for finding the lower bound

Hope it helps！

PopSoda2002 · 2026-01-06T23:46:58Z

miles/utils/data.py

        return len(self.samples)


 def get_minimum_num_micro_batch_size(total_lengths, max_tokens_per_gpu):


And I think for this case, writing a new function will be better

ppraneth · 2026-01-07T02:24:51Z

@zhaochenyang20 @PopSoda2002 The major concern regarding increased batch counts is resolved. BFD generates ~2.1% fewer batches than the Previous (Worst-Fit) implementation.
Benchmark code:

import time
import heapq
import random
import bisect
import numpy as np
from typing import List

# First-Fit (O(N^2))
def packing_original(total_lengths: List[int], limit: int):
    bins = []
    for length in total_lengths:
        for i in range(len(bins)):
            if bins[i] + length <= limit:
                bins[i] += length
                break
        else:
            bins.append(length)
    return bins

# Largest-Fit/Worst-Fit (O(N log B))
def packing_worst_fit(total_lengths: List[int], limit: int):
    remaining_capacities = []
    for length in total_lengths:
        if remaining_capacities and (-remaining_capacities[0] >= length):
            most_space = -heapq.heappop(remaining_capacities)
            heapq.heappush(remaining_capacities, -(most_space - length))
        else:
            heapq.heappush(remaining_capacities, -(limit - length))
    # We return [limit - remaining] to get the filled tokens per bin
    return [limit + res for res in remaining_capacities]

# Best-Fit Decreasing (O(N log N))
def packing_best_fit_decreasing(total_lengths: List[int], limit: int):
    sorted_lengths = sorted(total_lengths, reverse=True)
    bin_totals = [] 
    for length in sorted_lengths:
        threshold = limit - length
        idx = bisect.bisect_right(bin_totals, threshold)
        if idx > 0:
            val = bin_totals.pop(idx - 1)
            bisect.insort(bin_totals, val + length)
        else:
            bisect.insort(bin_totals, length)
    return bin_totals

def run_incremental_benchmark():
    LIMIT = 32768
    
    sample_sizes = [5000, 10000, 20000, 30000, 40000]
    
    methods = [
        ("Original (FF)", packing_original),
        ("Previous (WF)", packing_worst_fit),
        ("New (BFD)", packing_best_fit_decreasing)
    ]

    print(f"{'N':<7} | {'Method':<15} | {'Time':<10} | {'Bins':<6} | {'Util %':<8} | {'SD'}")
    print("-" * 65)

    for n in sample_sizes:
        
        lengths = [random.randint(512, 4096) for _ in range(n)]
        
        for name, func in methods:
            start = time.perf_counter()
            bins = func(lengths, LIMIT)
            duration = time.perf_counter() - start
            
            # Metrics
            count = len(bins)
            utils = [(b / LIMIT) * 100 for b in bins]
            avg_util = np.mean(utils)
            std_dev = np.std(utils)
            
            print(f"{n:<7} | {name:<15} | {duration:>8.4f}s | {count:>6} | {avg_util:>7.2f}% | {std_dev:>5.2f}")
        print("-" * 65)

if __name__ == "__main__":
    run_incremental_benchmark()

Benchmark Result:

N	Method	Time (s)	Bins	Util %	SD
10000	Original (FF)	0.2007	702	99.34%	2.39
	Previous (WF)	0.0049	714	97.67%	1.38
	New (BFD)	0.0092	699	99.76%	1.72

20000	Original (FF)	0.8691	1412	99.43%	0.64
	Previous (WF)	0.0117	1439	97.56%	1.73
	New (BFD)	0.0239	1407	99.78%	1.82

30000	Original (FF)	1.9904	2115	99.39%	1.43
	Previous (WF)	0.0170	2151	97.72%	1.27
	New (BFD)	0.0487	2106	99.81%	0.77

40000	Original (FF)	3.5739	2826	99.41%	0.81
	Previous (WF)	0.0232	2877	97.64%	1.93
	New (BFD)	0.0771	2815	99.79%	1.75

ppraneth · 2026-01-07T02:26:24Z

Can you once please check the updated code if it is ok
If it is fine i can change the name

PopSoda2002 · 2026-01-07T18:26:49Z

I am worried for this analysis now:

First Fit: the time complexity if O(N * B) right? B is the total buckets
Current Implementation: When we maintain the sorted array, the time complexity is still O(N * B) not O(N * logB) right for the reason we insert and the new sequence right?

So the improvement is pretty marginal compared to the initial version @ppraneth cc @zhaochenyang20
And personally, I think this part is not like the bottleneck right? Do you do the experiment about how this change can make the whole system run faster? Like the one step time

ppraneth · 2026-01-08T02:18:32Z

@PopSoda2002 You’re technically correct that maintaining a sorted list still results in O(N·B) theoretical complexity because pop and insort are linear operations.

That said, the practical performance improvement is meaningful for a few reasons:

Interpreter overhead: The original implementation performs N×B comparisons in Python-level loops. The revised version moves the search and shifting to bisect and list primitives implemented in C, which significantly reduces Python interpreter overhead.

Benchmark data: In isolation, at N = 40,000 samples, the original version takes ~3.57s while the revised version takes ~0.07s (≈ 50× faster).

At N = 250,000 samples, the original version takes ~159.36s and produces 17,684 bins, while the revised version takes ~2.43s and produces 17,613 bins (≈ 65× faster, 71 fewer bins).

Downstream efficiency: By sorting first (Best-Fit Decreasing), we achieve tighter packing. In the benchmark this reduces the number of micro-batches (2815 vs 2826), which can translate to fewer forward/backward launches and slightly better GPU utilization.

I agree this path is not currently the primary system bottleneck, and I’m not claiming a large end-to-end step-time reduction. The goal here is to reduce Python-level overhead and improve packing efficiency in micro-batch construction, while keeping this logic efficient and scalable as batch sizes grow.

ppraneth · 2026-01-09T02:56:11Z

Time complexity:

First-Fit: O(N²) in the average and worst case.
Worst-Fit (heap): O(N log N).
Best-Fit Decreasing: O(N²) in the worst case, with O(N log N) behavior observed in practice due to sorting and early placement.

Tradeoffs:

Worst-Fit: Fastest in time and scales well, but tends to spread items across bins, resulting in more bins and lower packing efficiency.
Best-Fit Decreasing: Although it shares the same worst-case complexity as First-Fit, it is significantly faster in practice and achieves much tighter packing, leading to fewer bins and reduced downstream micro-batching and GPU launch overhead.

PopSoda2002 · 2026-01-12T05:05:32Z

Time complexity:

First-Fit: O(N²) in the average and worst case.

Worst-Fit (heap): O(N log N).

Best-Fit Decreasing: O(N²) in the worst case, with O(N log N) behavior observed in practice due to sorting and early placement.

Tradeoffs:

Worst-Fit: Fastest in time and scales well, but tends to spread items across bins, resulting in more bins and lower packing efficiency.

Best-Fit Decreasing: Although it shares the same worst-case complexity as First-Fit, it is significantly faster in practice and achieves much tighter packing, leading to fewer bins and reduced downstream micro-batching and GPU launch overhead.

I think maybe you need to calculate the time complexity again for first fit, it should be O(N*B) not O(N^2), and usually we will not sample like 40000. The e2e optimization should be benchmarked.

ppraneth · 2026-01-12T09:59:39Z

Time complexity:

First-Fit: O(N²) in the average and worst case.

Worst-Fit (heap): O(N log N).

Best-Fit Decreasing: O(N²) in the worst case, with O(N log N) behavior observed in practice due to sorting and early placement.

Tradeoffs:

Worst-Fit: Fastest in time and scales well, but tends to spread items across bins, resulting in more bins and lower packing efficiency.

Best-Fit Decreasing: Although it shares the same worst-case complexity as First-Fit, it is significantly faster in practice and achieves much tighter packing, leading to fewer bins and reduced downstream micro-batching and GPU launch overhead.

I think maybe you need to calculate the time complexity again for first fit, it should be O(N*B) not O(N^2), and usually we will not sample like 40000. The e2e optimization should be benchmarked.

@PopSoda2002 I agree that $O(N \cdot B)$ is better way to describe the complexity of the First-Fit algorithm, as $B$ represents the actual number of bins created. I assumed the theoretical worst-case scenario where $N = B$ (each sequence requires its own bin), which is what leads to the $O(N^2)$ upper bound.

While the revised BFD implementation technically shares this $O(N \cdot B)$ theoretical complexity it is faster.

opti

6929cf8

ppraneth requested review from fzyzcjy and yueming-yuan as code owners January 6, 2026 04:52

gemini-code-assist bot reviewed Jan 6, 2026

View reviewed changes

miles/utils/data.py Outdated Show resolved Hide resolved

miles/utils/data.py Outdated Show resolved Hide resolved

PopSoda2002 reviewed Jan 6, 2026

View reviewed changes

ppraneth added 2 commits January 7, 2026 07:50

opti-2

be2d465

opti-2

51e39d2

ppraneth requested a review from PopSoda2002 January 7, 2026 18:20

ppraneth changed the title ~~perf: optimize micro-batch calculation from O(N^2) to O(N log B)~~ perf: optimize micro-batch calculation Jan 8, 2026

Merge branch 'main' into opti2

1359d76

		return len(self.samples)


		def get_minimum_num_micro_batch_size(total_lengths, max_tokens_per_gpu):

perf: optimize micro-batch calculation #403

Are you sure you want to change the base?

perf: optimize micro-batch calculation #403

Conversation

ppraneth commented Jan 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

The Problem

The Solution

Benchmark Results

Impact

Uh oh!

gemini-code-assist bot commented Jan 6, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

zhaochenyang20 commented Jan 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zhaochenyang20 commented Jan 6, 2026

Uh oh!

zhaochenyang20 commented Jan 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

PopSoda2002 commented Jan 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

PopSoda2002 Jan 6, 2026

Choose a reason for hiding this comment

Uh oh!

ppraneth commented Jan 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ppraneth commented Jan 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

PopSoda2002 commented Jan 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ppraneth commented Jan 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ppraneth commented Jan 9, 2026

Uh oh!

PopSoda2002 commented Jan 12, 2026

Uh oh!

ppraneth commented Jan 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ppraneth commented Jan 6, 2026 •

edited

Loading

zhaochenyang20 commented Jan 6, 2026 •

edited

Loading

zhaochenyang20 commented Jan 6, 2026 •

edited

Loading

PopSoda2002 commented Jan 6, 2026 •

edited

Loading

ppraneth commented Jan 7, 2026 •

edited

Loading

ppraneth commented Jan 7, 2026 •

edited

Loading

PopSoda2002 commented Jan 7, 2026 •

edited

Loading

ppraneth commented Jan 8, 2026 •

edited

Loading

ppraneth commented Jan 12, 2026 •

edited

Loading