feat: Implement lazy loading for dataset #382

Ratish1 · 2026-01-02T19:33:00Z

No description provided.

gemini-code-assist · 2026-01-02T19:33:21Z

Summary of Changes

Hello @Ratish1, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the data loading and processing pipeline by implementing lazy loading for datasets. By integrating the Hugging Face datasets library, the system can now handle larger datasets more efficiently, reducing memory footprint and improving overall performance during dataset initialization, filtering, and sampling. The changes primarily involve refactoring the PromptDataset class to leverage the datasets library's capabilities, including memory-mapped access and multiprocessing for data operations.

Highlights

Lazy Loading Implementation: The dataset loading mechanism has been refactored from eager in-memory storage to a lazy loading approach, significantly improving memory efficiency, especially for large datasets.
Hugging Face datasets Integration: The project now leverages the Hugging Face datasets library for robust and efficient data handling, including memory-mapped access, streamlined filtering, and optimized shuffling.
Multiprocessing for Dataset Operations: A new command-line argument, --dataset-num-proc, has been introduced to allow users to specify the number of processes for dataset initialization and filtering, enhancing performance for these operations.
Refactored PromptDataset: The PromptDataset class has undergone a major overhaul to delegate data access, filtering, and shuffling responsibilities to the Hugging Face datasets library, simplifying its internal logic and improving scalability.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request refactors the data loading mechanism to use lazy loading with the Hugging Face datasets library, which is a great improvement for memory efficiency when handling large datasets. The changes are well-structured, introducing a new _filter_func for parallel data filtering and updating data access patterns across the codebase.

My review focuses on two main points. First, a high-severity performance issue in miles/utils/data.py where the new Dataset class's __getitem__ method doesn't support slicing, leading to inefficient one-by-one data processing. I've recommended adding slice support to enable batch operations. Second, a medium-severity readability improvement in miles/rollout/sglang_rollout.py to simplify a loop by iterating directly over the Dataset object. Overall, these are solid changes that will improve the project's scalability.

gemini-code-assist · 2026-01-02T19:35:10Z

miles/utils/data.py

+    def __getitem__(self, idx):
+        # The underlying HF dataset handles lazy fetching
+        data = self.hf_dataset[idx]
+
+        # Process the data using existing logic
+        prompt = _build_messages(data, self.prompt_key, self.multimodal_keys)
+
+        metadata = data.get(self.metadata_key) or {}
+        if self.tool_key is not None and self.tool_key in data:
+            tools = data[self.tool_key]
+            if isinstance(tools, str):
+                tools = json.loads(tools)
+            # TODO (chenyang): If the JSON parsing is heavy, we might need
+            #  to use hf_dataset.map() during init to pre-process these
+            #  fields into a more efficient format (Arrow-native), rather
+            #  than parsing raw strings on the fly.
+            elif isinstance(tools, np.ndarray):
+                tools = tools.tolist()
+            assert isinstance(tools, list), f"tools must be a list, got {type(tools)} instead"
+            metadata["tools"] = tools
+
+        sample = Sample(
+            prompt=prompt,
+            label=data.get(self.label_key) if self.label_key is not None else None,
+            metadata=metadata,
+        )
+
+        return sample


The current __getitem__ implementation only supports integer indexing. This leads to inefficient data fetching patterns in other parts of the code, such as in rollout_data_source.py, where list comprehensions are used to get multiple items one by one. This can be a significant performance bottleneck, especially with on-the-fly processing like json.loads within this method.

To improve this, you should enhance __getitem__ to also support slice indexing. This will allow for batch data retrieval and processing, which is much more efficient with the Hugging Face datasets library.

A good way to implement this would be to refactor the current logic for processing a single item into a private helper method (e.g., _process_item(self, data)). Then, __getitem__ can handle both int and slice indices, calling the helper method accordingly. For a slice, it would fetch a batch of data from self.hf_dataset, iterate through the returned dictionary of lists to reconstruct each item, and process them.

gemini-code-assist · 2026-01-02T19:35:10Z

miles/rollout/sglang_rollout.py

+    for i in range(len(dataset)):
+        prompt_sample = dataset[i]


The Dataset object is iterable because it implements __len__ and __getitem__. You can iterate over it directly, which is more Pythonic and readable than using range(len(dataset)). This simplifies the code and improves maintainability.

Suggested change

for i in range(len(dataset)):

prompt_sample = dataset[i]

for prompt_sample in dataset:

ppraneth · 2026-01-05T04:20:56Z

Hi @Ratish1 ,
The lazy loading PR I originally made in Slime has been automatically updated in this repo by the Miles bot, so I think this current draft PR can be closed.

Ratish1 · 2026-01-05T15:47:06Z

Hi @ppraneth , I have been working on this for a while for miles, so I dont know about that. I have to confirm with the maintainer. My original PR got merged and then reverted, and it was opened long before in miles repo. #246

ppraneth · 2026-01-05T16:24:33Z

@Ratish1 My pr in slime THUDM/slime#696

Ratish1 · 2026-01-05T16:26:40Z

@ppraneth , yes I saw it, but I need to wait for maintainers approval and see first. Thanks

gemini-code-assist bot reviewed Jan 2, 2026

View reviewed changes

Ratish1 added 2 commits January 3, 2026 12:25

feat: Implement lazy loading for dataset

a104bf9

lint

61dfee5

Ratish1 force-pushed the data-loading-v2 branch from 16e1781 to 61dfee5 Compare January 3, 2026 08:29

lint

0089364

Ratish1 marked this pull request as ready for review January 5, 2026 15:55

Ratish1 requested review from fzyzcjy and yueming-yuan as code owners January 5, 2026 15:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Implement lazy loading for dataset #382

feat: Implement lazy loading for dataset #382

Ratish1 commented Jan 2, 2026

Uh oh!

gemini-code-assist bot commented Jan 2, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Jan 2, 2026

Uh oh!

gemini-code-assist bot Jan 2, 2026

Uh oh!

ppraneth commented Jan 5, 2026

Uh oh!

Ratish1 commented Jan 5, 2026 •

edited

Loading

Uh oh!

ppraneth commented Jan 5, 2026 •

edited

Loading

Uh oh!

Ratish1 commented Jan 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	for i in range(len(dataset)):
	prompt_sample = dataset[i]
	for prompt_sample in dataset:

feat: Implement lazy loading for dataset #382

Are you sure you want to change the base?

feat: Implement lazy loading for dataset #382

Conversation

Ratish1 commented Jan 2, 2026

Uh oh!

gemini-code-assist bot commented Jan 2, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jan 2, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Jan 2, 2026

Choose a reason for hiding this comment

Uh oh!

ppraneth commented Jan 5, 2026

Uh oh!

Ratish1 commented Jan 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ppraneth commented Jan 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Ratish1 commented Jan 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Ratish1 commented Jan 5, 2026 •

edited

Loading

ppraneth commented Jan 5, 2026 •

edited

Loading