-
Notifications
You must be signed in to change notification settings - Fork 81
feat: Implement lazy loading for dataset #382
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Summary of ChangesHello @Ratish1, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the data loading and processing pipeline by implementing lazy loading for datasets. By integrating the Hugging Face Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request refactors the data loading mechanism to use lazy loading with the Hugging Face datasets library, which is a great improvement for memory efficiency when handling large datasets. The changes are well-structured, introducing a new _filter_func for parallel data filtering and updating data access patterns across the codebase.
My review focuses on two main points. First, a high-severity performance issue in miles/utils/data.py where the new Dataset class's __getitem__ method doesn't support slicing, leading to inefficient one-by-one data processing. I've recommended adding slice support to enable batch operations. Second, a medium-severity readability improvement in miles/rollout/sglang_rollout.py to simplify a loop by iterating directly over the Dataset object. Overall, these are solid changes that will improve the project's scalability.
| def __getitem__(self, idx): | ||
| # The underlying HF dataset handles lazy fetching | ||
| data = self.hf_dataset[idx] | ||
|
|
||
| # Process the data using existing logic | ||
| prompt = _build_messages(data, self.prompt_key, self.multimodal_keys) | ||
|
|
||
| metadata = data.get(self.metadata_key) or {} | ||
| if self.tool_key is not None and self.tool_key in data: | ||
| tools = data[self.tool_key] | ||
| if isinstance(tools, str): | ||
| tools = json.loads(tools) | ||
| # TODO (chenyang): If the JSON parsing is heavy, we might need | ||
| # to use hf_dataset.map() during init to pre-process these | ||
| # fields into a more efficient format (Arrow-native), rather | ||
| # than parsing raw strings on the fly. | ||
| elif isinstance(tools, np.ndarray): | ||
| tools = tools.tolist() | ||
| assert isinstance(tools, list), f"tools must be a list, got {type(tools)} instead" | ||
| metadata["tools"] = tools | ||
|
|
||
| sample = Sample( | ||
| prompt=prompt, | ||
| label=data.get(self.label_key) if self.label_key is not None else None, | ||
| metadata=metadata, | ||
| ) | ||
|
|
||
| return sample |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current __getitem__ implementation only supports integer indexing. This leads to inefficient data fetching patterns in other parts of the code, such as in rollout_data_source.py, where list comprehensions are used to get multiple items one by one. This can be a significant performance bottleneck, especially with on-the-fly processing like json.loads within this method.
To improve this, you should enhance __getitem__ to also support slice indexing. This will allow for batch data retrieval and processing, which is much more efficient with the Hugging Face datasets library.
A good way to implement this would be to refactor the current logic for processing a single item into a private helper method (e.g., _process_item(self, data)). Then, __getitem__ can handle both int and slice indices, calling the helper method accordingly. For a slice, it would fetch a batch of data from self.hf_dataset, iterate through the returned dictionary of lists to reconstruct each item, and process them.
| for i in range(len(dataset)): | ||
| prompt_sample = dataset[i] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Dataset object is iterable because it implements __len__ and __getitem__. You can iterate over it directly, which is more Pythonic and readable than using range(len(dataset)). This simplifies the code and improves maintainability.
| for i in range(len(dataset)): | |
| prompt_sample = dataset[i] | |
| for prompt_sample in dataset: |
16e1781 to
61dfee5
Compare
|
Hi @Ratish1 , |
|
@Ratish1 My pr in slime THUDM/slime#696 |
|
@ppraneth , yes I saw it, but I need to wait for maintainers approval and see first. Thanks |
No description provided.