Skip to content

Conversation

@Ratish1
Copy link
Contributor

@Ratish1 Ratish1 commented Jan 2, 2026

No description provided.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @Ratish1, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the data loading and processing pipeline by implementing lazy loading for datasets. By integrating the Hugging Face datasets library, the system can now handle larger datasets more efficiently, reducing memory footprint and improving overall performance during dataset initialization, filtering, and sampling. The changes primarily involve refactoring the PromptDataset class to leverage the datasets library's capabilities, including memory-mapped access and multiprocessing for data operations.

Highlights

  • Lazy Loading Implementation: The dataset loading mechanism has been refactored from eager in-memory storage to a lazy loading approach, significantly improving memory efficiency, especially for large datasets.
  • Hugging Face datasets Integration: The project now leverages the Hugging Face datasets library for robust and efficient data handling, including memory-mapped access, streamlined filtering, and optimized shuffling.
  • Multiprocessing for Dataset Operations: A new command-line argument, --dataset-num-proc, has been introduced to allow users to specify the number of processes for dataset initialization and filtering, enhancing performance for these operations.
  • Refactored PromptDataset: The PromptDataset class has undergone a major overhaul to delegate data access, filtering, and shuffling responsibilities to the Hugging Face datasets library, simplifying its internal logic and improving scalability.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the data loading mechanism to use lazy loading with the Hugging Face datasets library, which is a great improvement for memory efficiency when handling large datasets. The changes are well-structured, introducing a new _filter_func for parallel data filtering and updating data access patterns across the codebase.

My review focuses on two main points. First, a high-severity performance issue in miles/utils/data.py where the new Dataset class's __getitem__ method doesn't support slicing, leading to inefficient one-by-one data processing. I've recommended adding slice support to enable batch operations. Second, a medium-severity readability improvement in miles/rollout/sglang_rollout.py to simplify a loop by iterating directly over the Dataset object. Overall, these are solid changes that will improve the project's scalability.

Comment on lines 205 to 317
def __getitem__(self, idx):
# The underlying HF dataset handles lazy fetching
data = self.hf_dataset[idx]

# Process the data using existing logic
prompt = _build_messages(data, self.prompt_key, self.multimodal_keys)

metadata = data.get(self.metadata_key) or {}
if self.tool_key is not None and self.tool_key in data:
tools = data[self.tool_key]
if isinstance(tools, str):
tools = json.loads(tools)
# TODO (chenyang): If the JSON parsing is heavy, we might need
# to use hf_dataset.map() during init to pre-process these
# fields into a more efficient format (Arrow-native), rather
# than parsing raw strings on the fly.
elif isinstance(tools, np.ndarray):
tools = tools.tolist()
assert isinstance(tools, list), f"tools must be a list, got {type(tools)} instead"
metadata["tools"] = tools

sample = Sample(
prompt=prompt,
label=data.get(self.label_key) if self.label_key is not None else None,
metadata=metadata,
)

return sample
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The current __getitem__ implementation only supports integer indexing. This leads to inefficient data fetching patterns in other parts of the code, such as in rollout_data_source.py, where list comprehensions are used to get multiple items one by one. This can be a significant performance bottleneck, especially with on-the-fly processing like json.loads within this method.

To improve this, you should enhance __getitem__ to also support slice indexing. This will allow for batch data retrieval and processing, which is much more efficient with the Hugging Face datasets library.

A good way to implement this would be to refactor the current logic for processing a single item into a private helper method (e.g., _process_item(self, data)). Then, __getitem__ can handle both int and slice indices, calling the helper method accordingly. For a slice, it would fetch a batch of data from self.hf_dataset, iterate through the returned dictionary of lists to reconstruct each item, and process them.

Comment on lines +500 to +532
for i in range(len(dataset)):
prompt_sample = dataset[i]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The Dataset object is iterable because it implements __len__ and __getitem__. You can iterate over it directly, which is more Pythonic and readable than using range(len(dataset)). This simplifies the code and improves maintainability.

Suggested change
for i in range(len(dataset)):
prompt_sample = dataset[i]
for prompt_sample in dataset:

@ppraneth
Copy link

ppraneth commented Jan 5, 2026

Hi @Ratish1 ,
The lazy loading PR I originally made in Slime has been automatically updated in this repo by the Miles bot, so I think this current draft PR can be closed.

@Ratish1
Copy link
Contributor Author

Ratish1 commented Jan 5, 2026

Hi @ppraneth , I have been working on this for a while for miles, so I dont know about that. I have to confirm with the maintainer. My original PR got merged and then reverted, and it was opened long before in miles repo. #246

@Ratish1 Ratish1 marked this pull request as ready for review January 5, 2026 15:55
@ppraneth
Copy link

ppraneth commented Jan 5, 2026

@Ratish1 My pr in slime THUDM/slime#696

@Ratish1
Copy link
Contributor Author

Ratish1 commented Jan 5, 2026

@ppraneth , yes I saw it, but I need to wait for maintainers approval and see first. Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants