Skip to content

Conversation

@ppraneth
Copy link

@ppraneth ppraneth commented Jan 6, 2026

This PR optimizes the bin-packing logic in slime/utils/data.py used to calculate the minimum number of micro-batches required for training. By replacing a linear search with a Max-Heap-based approach, we reduce the algorithmic complexity from $O(N^2)$ to $O(N \log B)$ where $N$ is the number of samples and $B$ is the number of resulting micro-batches.

The Problem

The current implementation of get_minimum_num_micro_batch_size uses a First-Fit algorithm. For every sequence length in the batch, it performs a linear scan through all existing micro-batches to find the first one with available capacity:

# slime/utils/data.py
for length in total_lengths:
    for i in range(len(batches)): # <--- Linear search: O(B)
        if batches[i] + length <= max_tokens_per_gpu:
            batches[i] += length
            break

In large-scale RL training, where can reach tens of thousands of samples, the number of bins also grows. This nested loop leads to quadratic time complexity, causing a significant CPU bottleneck that stalls the GPU while the training process prepares the next batch.

The Solution

This PR implements a Best-Fit strategy using a Max-Heap to track the remaining capacity of each micro-batch.

  • Instead of scanning every bin, we only check the bin with the most available space.
  • Finding and updating the best bin now takes $O(\log B)$ time rather than $O(N^2)$ .
  • This ensures that the calculation remains efficient even as the number of samples scales into the hundreds of thousands.

Benchmark Results

import time
import heapq
import random

def packing_before(total_lengths, max_tokens_per_gpu):
    batches = []
    for length in total_lengths:
        for i in range(len(batches)):
            if batches[i] + length <= max_tokens_per_gpu:
                batches[i] += length
                break
        else:
            batches.append(length)
    return len(batches)

def packing_after(total_lengths, max_tokens_per_gpu):
    # Best-fit using a Max-Heap
    remaining_capacities = []
    for length in total_lengths:
        if remaining_capacities and (-remaining_capacities[0] >= length):
            most_space = -heapq.heappop(remaining_capacities)
            heapq.heappush(remaining_capacities, -(most_space - length))
        else:
            heapq.heappush(remaining_capacities, -(max_tokens_per_gpu - length))
    return len(remaining_capacities)

def run_benchmark():
    # max_tokens_per_gpu is usually large
    LIMIT = 32768
    # Test sizes: number of sequences in the batch
    test_sizes = [1000, 5000, 10000, 20000, 40000]

    print(f"{'N (Samples)':<12} | {'Before (O(N^2))':<15} | {'After (O(N log B))':<18} | {'Speedup':<10}")
    print("-" * 65)

    for n in test_sizes:
        # Generate random sequence lengths between 128 and 2048
        lengths = [random.randint(128, 2048) for _ in range(n)]

        # Benchmark Before
        start = time.perf_counter()
        count_before = packing_before(lengths, LIMIT)
        time_before = time.perf_counter() - start

        # Benchmark After
        start = time.perf_counter()
        count_after = packing_after(lengths, LIMIT)
        time_after = time.perf_counter() - start

        speedup = time_before / time_after if time_after > 0 else 0
        print(f"{n:<12} | {time_before:>13.5f}s | {time_after:>16.5f}s | {speedup:>8.1f}x")

if __name__ == "__main__":
    run_benchmark()
    

The following results demonstrate that the optimization prevents the exponential performance degradation seen in the original version:

N (Samples) Before () After () Speedup
1,000 0.00099s 0.00035s 2.8x
5,000 0.02458s 0.00221s 11.1x
10,000 0.09980s 0.00509s 19.6x
20,000 0.43859s 0.00962s 45.6x
40,000 2.60162s 0.03897s 66.8x

Impact

  • Eliminates Training Stalls: Reduces CPU-bound batch preparation time from seconds to milliseconds for large rollouts.
  • Improves Scaling: Critical for RL scaling workflows where massive rollout batches are partitioned across many DP ranks.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @ppraneth, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a critical performance optimization to the micro-batch calculation process, which is a known CPU bottleneck in large-scale Reinforcement Learning training. By transitioning from a quadratic time complexity First-Fit algorithm to a more efficient O(N log B) Max-Heap-based Best-Fit strategy, the change drastically reduces batch preparation times. This enhancement is crucial for improving training throughput, enabling better scaling of RL workflows, and freeing up CPU resources for other asynchronous tasks.

Highlights

  • Algorithmic Optimization: The micro-batch calculation logic in slime/utils/data.py has been optimized from an O(N^2) complexity to O(N log B) by replacing a linear search (First-Fit) with a Max-Heap-based approach (Best-Fit).
  • Performance Improvement: This change significantly reduces CPU bottlenecks during large-scale RL training, eliminating training stalls and improving overall scaling efficiency, with reported speedups up to 66.8x for 40,000 samples.
  • Implementation Detail: The get_minimum_num_micro_batch_size function now uses Python's heapq module to maintain a max-heap of remaining capacities, ensuring efficient bin selection.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This is an excellent performance optimization. The pull request description clearly explains the problem with the previous O(N^2) implementation and the benefits of the new O(N log B) heap-based approach. The benchmark results effectively demonstrate the significant speedup. The code change is correct and implements the described logic well. I have a couple of minor suggestions to make the code a bit more concise.

@zhaochenyang20
Copy link
Collaborator

zhaochenyang20 commented Jan 6, 2026

Great work on identifying this CPU bottleneck!

The transition from $O(N^2)$ to $O(N \log B)$ using a heap is a solid algorithmic improvement for the calculation speed. However, I have a major concern regarding the Bin Packing Heuristic.

You are effectively switching from First-Fit to Largest-Fit (by always picking the bin with the most space). First-Fit tends to pack bins tighter, resulting in fewer total batches. Worst-Fit tends to distribute load, potentially resulting in more total batches.

Since the function is named get_minimum_num_micro_batch_size, our goal is to minimize the number of batches to reduce GPU kernel overhead. If this PR increases the total number of micro-batches (even if the CPU calculation is faster), it might degrade the actual training throughput.

Please run your benchmark again and compare the output value (len(remaining_capacities) vs len(batches)) between the two implementations. If the new approach generates significantly more batches (e.g., >1%), we might need to reconsider using a Best-Fit strategy (using a BST/SortedList or Min-Heap with segmentation) instead of Worst-Fit, or confirm the regression is acceptable.

@zhaochenyang20
Copy link
Collaborator

The best way is to find the "Best-Fit", Find the bin that offers the tightest fit (i.e., the one with the least available space that is sufficient).

@zhaochenyang20
Copy link
Collaborator

zhaochenyang20 commented Jan 6, 2026

To clarify the algorithmic choices, let's look at the three standard heuristics:

  1. First-Fit (Original Code): Linearly scan from the beginning and pick the first bin that fits.

  2. Largest-Fit (Current PR): Pick the bin with the largest global remaining capacity (using a Max-Heap).

  3. Best-Fit (Ideal): Pick the bin with the smallest remaining capacity that can still accommodate the sequence.

Among these, Option 3 is actually what we want. It packs bins the "tightest," producing the minimum number of total micro-batches. Option 2 (Largest-Fit) tends to distribute load evenly, which often results in more half-empty bins and higher GPU kernel overhead.

@PopSoda2002
Copy link
Contributor

PopSoda2002 commented Jan 6, 2026

To clarify the algorithmic choices, let's look at the three standard heuristics:

  1. First-Fit (Original Code): Linearly scan from the beginning and pick the first bin that fits.
  2. Largest-Fit (Current PR): Pick the bin with the largest global remaining capacity (using a Max-Heap).
  3. Best-Fit (Ideal): Pick the bin with the smallest remaining capacity that can still accommodate the sequence.

Among these, Option 3 is actually what we want. It packs bins the "tightest," producing the minimum number of total micro-batches. Option 2 (Largest-Fit) tends to distribute load evenly, which often results in more half-empty bins and higher GPU kernel overhead.

I agree with @zhaochenyang20 and we talked about the implementation details:

  1. I think the name should be changed from get minimum to like best_fit, because objectively speaking, finding minimum buckets is a NP-hard problem
  2. Second, for implementing the best fit, an easy and efficient way is sorting first, then using some way like binary search for finding the lower bound

Hope it helps!

return len(self.samples)


def get_minimum_num_micro_batch_size(total_lengths, max_tokens_per_gpu):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And I think for this case, writing a new function will be better

@ppraneth
Copy link
Author

ppraneth commented Jan 7, 2026

@zhaochenyang20 @PopSoda2002 The major concern regarding increased batch counts is resolved. BFD generates ~2.1% fewer batches than the Previous (Worst-Fit) implementation.
Benchmark code:

import time
import heapq
import random
import bisect
import numpy as np
from typing import List

# First-Fit (O(N^2))
def packing_original(total_lengths: List[int], limit: int):
    bins = []
    for length in total_lengths:
        for i in range(len(bins)):
            if bins[i] + length <= limit:
                bins[i] += length
                break
        else:
            bins.append(length)
    return bins

# Largest-Fit/Worst-Fit (O(N log B))
def packing_worst_fit(total_lengths: List[int], limit: int):
    remaining_capacities = []
    for length in total_lengths:
        if remaining_capacities and (-remaining_capacities[0] >= length):
            most_space = -heapq.heappop(remaining_capacities)
            heapq.heappush(remaining_capacities, -(most_space - length))
        else:
            heapq.heappush(remaining_capacities, -(limit - length))
    # We return [limit - remaining] to get the filled tokens per bin
    return [limit + res for res in remaining_capacities]

# Best-Fit Decreasing (O(N log N))
def packing_best_fit_decreasing(total_lengths: List[int], limit: int):
    sorted_lengths = sorted(total_lengths, reverse=True)
    bin_totals = [] 
    for length in sorted_lengths:
        threshold = limit - length
        idx = bisect.bisect_right(bin_totals, threshold)
        if idx > 0:
            val = bin_totals.pop(idx - 1)
            bisect.insort(bin_totals, val + length)
        else:
            bisect.insort(bin_totals, length)
    return bin_totals

def run_incremental_benchmark():
    LIMIT = 32768
    
    sample_sizes = [5000, 10000, 20000, 30000, 40000]
    
    methods = [
        ("Original (FF)", packing_original),
        ("Previous (WF)", packing_worst_fit),
        ("New (BFD)", packing_best_fit_decreasing)
    ]

    print(f"{'N':<7} | {'Method':<15} | {'Time':<10} | {'Bins':<6} | {'Util %':<8} | {'SD'}")
    print("-" * 65)

    for n in sample_sizes:
        
        lengths = [random.randint(512, 4096) for _ in range(n)]
        
        for name, func in methods:
            start = time.perf_counter()
            bins = func(lengths, LIMIT)
            duration = time.perf_counter() - start
            
            # Metrics
            count = len(bins)
            utils = [(b / LIMIT) * 100 for b in bins]
            avg_util = np.mean(utils)
            std_dev = np.std(utils)
            
            print(f"{n:<7} | {name:<15} | {duration:>8.4f}s | {count:>6} | {avg_util:>7.2f}% | {std_dev:>5.2f}")
        print("-" * 65)

if __name__ == "__main__":
    run_incremental_benchmark()

Benchmark Result:

N Method Time (s) Bins Util % SD
10000 Original (FF) 0.2007 702 99.34% 2.39
Previous (WF) 0.0049 714 97.67% 1.38
New (BFD) 0.0092 699 99.76% 1.72
20000 Original (FF) 0.8691 1412 99.43% 0.64
Previous (WF) 0.0117 1439 97.56% 1.73
New (BFD) 0.0239 1407 99.78% 1.82
30000 Original (FF) 1.9904 2115 99.39% 1.43
Previous (WF) 0.0170 2151 97.72% 1.27
New (BFD) 0.0487 2106 99.81% 0.77
40000 Original (FF) 3.5739 2826 99.41% 0.81
Previous (WF) 0.0232 2877 97.64% 1.93
New (BFD) 0.0771 2815 99.79% 1.75

@ppraneth
Copy link
Author

ppraneth commented Jan 7, 2026

Can you once please check the updated code if it is ok
If it is fine i can change the name

@ppraneth ppraneth requested a review from PopSoda2002 January 7, 2026 18:20
@PopSoda2002
Copy link
Contributor

PopSoda2002 commented Jan 7, 2026

I am worried for this analysis now:

  • First Fit: the time complexity if O(N * B) right? B is the total buckets
  • Current Implementation: When we maintain the sorted array, the time complexity is still O(N * B) not O(N * logB) right for the reason we insert and the new sequence right?

So the improvement is pretty marginal compared to the initial version @ppraneth cc @zhaochenyang20
And personally, I think this part is not like the bottleneck right? Do you do the experiment about how this change can make the whole system run faster? Like the one step time

@ppraneth
Copy link
Author

ppraneth commented Jan 8, 2026

@PopSoda2002 You’re technically correct that maintaining a sorted list still results in O(N·B) theoretical complexity because pop and insort are linear operations.

That said, the practical performance improvement is meaningful for a few reasons:

Interpreter overhead: The original implementation performs N×B comparisons in Python-level loops. The revised version moves the search and shifting to bisect and list primitives implemented in C, which significantly reduces Python interpreter overhead.

Benchmark data: In isolation, at N = 40,000 samples, the original version takes ~3.57s while the revised version takes ~0.07s (≈ 50× faster).

At N = 250,000 samples, the original version takes ~159.36s and produces 17,684 bins, while the revised version takes ~2.43s and produces 17,613 bins (≈ 65× faster, 71 fewer bins).

Downstream efficiency: By sorting first (Best-Fit Decreasing), we achieve tighter packing. In the benchmark this reduces the number of micro-batches (2815 vs 2826), which can translate to fewer forward/backward launches and slightly better GPU utilization.

I agree this path is not currently the primary system bottleneck, and I’m not claiming a large end-to-end step-time reduction. The goal here is to reduce Python-level overhead and improve packing efficiency in micro-batch construction, while keeping this logic efficient and scalable as batch sizes grow.

@ppraneth ppraneth changed the title perf: optimize micro-batch calculation from O(N^2) to O(N log B) perf: optimize micro-batch calculation Jan 8, 2026
@ppraneth
Copy link
Author

ppraneth commented Jan 9, 2026

Time complexity:

  • First-Fit: O(N²) in the average and worst case.
  • Worst-Fit (heap): O(N log N).
  • Best-Fit Decreasing: O(N²) in the worst case, with O(N log N) behavior observed in practice due to sorting and early placement.

Tradeoffs:

  • Worst-Fit: Fastest in time and scales well, but tends to spread items across bins, resulting in more bins and lower packing efficiency.
  • Best-Fit Decreasing: Although it shares the same worst-case complexity as First-Fit, it is significantly faster in practice and achieves much tighter packing, leading to fewer bins and reduced downstream micro-batching and GPU launch overhead.

@PopSoda2002
Copy link
Contributor

Time complexity:

  • First-Fit: O(N²) in the average and worst case.
  • Worst-Fit (heap): O(N log N).
  • Best-Fit Decreasing: O(N²) in the worst case, with O(N log N) behavior observed in practice due to sorting and early placement.

Tradeoffs:

  • Worst-Fit: Fastest in time and scales well, but tends to spread items across bins, resulting in more bins and lower packing efficiency.
  • Best-Fit Decreasing: Although it shares the same worst-case complexity as First-Fit, it is significantly faster in practice and achieves much tighter packing, leading to fewer bins and reduced downstream micro-batching and GPU launch overhead.

I think maybe you need to calculate the time complexity again for first fit, it should be O(N*B) not O(N^2), and usually we will not sample like 40000. The e2e optimization should be benchmarked.

@ppraneth
Copy link
Author

ppraneth commented Jan 12, 2026

Time complexity:

  • First-Fit: O(N²) in the average and worst case.
  • Worst-Fit (heap): O(N log N).
  • Best-Fit Decreasing: O(N²) in the worst case, with O(N log N) behavior observed in practice due to sorting and early placement.

Tradeoffs:

  • Worst-Fit: Fastest in time and scales well, but tends to spread items across bins, resulting in more bins and lower packing efficiency.
  • Best-Fit Decreasing: Although it shares the same worst-case complexity as First-Fit, it is significantly faster in practice and achieves much tighter packing, leading to fewer bins and reduced downstream micro-batching and GPU launch overhead.

I think maybe you need to calculate the time complexity again for first fit, it should be O(N*B) not O(N^2), and usually we will not sample like 40000. The e2e optimization should be benchmarked.

@PopSoda2002 I agree that $O(N \cdot B)$ is better way to describe the complexity of the First-Fit algorithm, as $B$ represents the actual number of bins created. I assumed the theoretical worst-case scenario where $N = B$ (each sequence requires its own bin), which is what leads to the $O(N^2)$ upper bound.

While the revised BFD implementation technically shares this $O(N \cdot B)$ theoretical complexity it is faster.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants