-
Notifications
You must be signed in to change notification settings - Fork 81
feat: Support OAI TITO v1 #502
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Summary of ChangesHello @guapisolo, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly overhauls the system's generation and rollout mechanisms. It introduces a new, more modular interface for defining generation functions, moving away from a monolithic approach. This refactor facilitates the integration of advanced features like multi-turn conversations and OpenAI-style tool calling, ensuring token consistency during generation. The changes also extend to the Miles router, enabling it to manage sessions and proxy these new generation types, laying the groundwork for more complex agentic behaviors. Highlights
Ignored Files
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces significant refactoring to support naive OpenAI Token-In-Token-Out (TITO) for agentic workflows, particularly for multi-turn conversations with tool calls. The changes are extensive, adding new modules for generation hubs, inference rollouts, and session management, along with numerous tests. A key addition is the MILES_EXPERIMENTAL_ROLLOUT_REFACTOR feature flag to control the new logic. While the overall direction is a great improvement in modularity and functionality, I've identified a critical bug in the multi-turn logic and a few areas for improvement in the new shell script and utility functions.
| if args.generate_multi_samples: | ||
| sample = deepcopy(input.sample) | ||
|
|
||
| output = await post(url, payload) | ||
| await update_sample_from_response(args, sample, payload=payload, output=output, update_loss_mask=True) | ||
|
|
||
| if args.generate_multi_samples: | ||
| multi_samples.append(deepcopy(sample)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a critical bug in the logic for generate_multi_samples. When args.generate_multi_samples is true, the sample variable is rebound to a fresh copy of input.sample on line 57. This causes the conversation history, which is maintained in the sample object, to be lost. As a result, subsequent turns in the loop will not have the correct context, as update_sample_with_tool_responses on line 75 will be updating a sample that doesn't contain the full conversation history.
To fix this, you should use a separate variable for the turn-specific sample that gets appended to multi_samples, while the main sample variable continues to hold the running state of the entire conversation.
| if args.generate_multi_samples: | |
| sample = deepcopy(input.sample) | |
| output = await post(url, payload) | |
| await update_sample_from_response(args, sample, payload=payload, output=output, update_loss_mask=True) | |
| if args.generate_multi_samples: | |
| multi_samples.append(deepcopy(sample)) | |
| output = await post(url, payload) | |
| if args.generate_multi_samples: | |
| turn_sample = deepcopy(input.sample) | |
| await update_sample_from_response(args, turn_sample, payload=payload, output=output, update_loss_mask=True) | |
| multi_samples.append(turn_sample) | |
| await update_sample_from_response(args, sample, payload=payload, output=output, update_loss_mask=True) |
| sleep 3 | ||
| pkill -9 ray | ||
| pkill -9 python |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The commands to kill ray and python processes on lines 10-11 are redundant, as they are already executed on lines 7-8. The ray stop --force command should handle stopping Ray processes, and the subsequent pkill commands serve as a fallback. The second set of pkill commands is unnecessary and can be removed to simplify the script.
|
|
||
| async def post(url, payload, max_retries=60): | ||
| # TODO may generalize the name since it now contains http DELETE/GET etc (with retries and remote-execution) | ||
| async def post(url, payload, max_retries=60, action="post"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function post now handles more than just POST requests (e.g., GET, DELETE). The function name should be generalized to reflect its broader capability, for instance, to request_with_retry or a similar name that better describes its function of making various types of HTTP requests with retry logic.
|
|
||
|
|
||
| # TODO may rename to `load_object` since it can be used to load things like tool_specs | ||
| def load_function(path): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
46d7e84 to
62176cd
Compare
d4655d4 to
8efb357
Compare
8efb357 to
97bc803
Compare
97bc803 to
5343a51
Compare
|
@yueming-yuan Do you have some time to review this PR? I think it's almost ready. Thanks! |
This PR support several things (sorry I found it hard to split it into seperate PR.. Will do better in later PR)
Under /chat/completions endpoint with miles router enable:
miles_router_enable_token_input_for_chat_completions. (text model only).convert_ids_to_tokenssometimes has bug when the output token text include white space).Why? Chat template is not append-only. e.g. Different model cut think part in very different timing and scenario. It makes cross turn token inherit tightly coupled with the chat template, and very hard to find a general solution.、
TODO: