-
Notifications
You must be signed in to change notification settings - Fork 81
Doc: Add Doc for agentic multi-turn #509
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Summary of ChangesHello @guapisolo, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly upgrades the OpenAI-compatible chat completion endpoint by integrating explicit support for pre-tokenized inputs and enhancing log probability reporting. It also involves a substantial refactoring of the Miles Router's session management, improving its modularity and laying groundwork for advanced multi-turn agentic functionalities. The changes are accompanied by new documentation to guide users through the updated endpoint usage and related concepts like TITO, ensuring clarity and ease of adoption. Highlights
Ignored Files
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces significant new capabilities for agentic multi-turn interactions, primarily by enabling input_ids to be passed directly to the chat completion endpoint and refactoring session management. The addition of comprehensive documentation, examples, and E2E tests is commendable and greatly enhances the usability and correctness of the new features.
My main feedback centers on a large block of duplicated code within the sglang patch, which I recommend refactoring to improve long-term maintainability. I've also highlighted a functional limitation regarding the use of stop sequences with input_ids that could cause unexpected behavior. Minor suggestions include improving documentation formatting and removing a redundant parameter in an example file.
Overall, this is a strong contribution. Addressing the code duplication will make the implementation more robust and easier to maintain.
| + def _hack_convert_input_ids_to_messages( | ||
| + self, | ||
| + request: ChatCompletionRequest, | ||
| + raw_request: Request = None, | ||
| + ) -> tuple[GenerateReqInput, ChatCompletionRequest]: | ||
| + | ||
| + # Notice: currently, if input_ids is provided, the stop token is not used. | ||
| + sampling_params = request.to_sampling_params( | ||
| + model_generation_config=self.default_sampling_params | ||
| + ) | ||
| + | ||
| + prompt_kwargs = {"input_ids": request.input_ids} | ||
| + | ||
| + # Extract custom labels from raw request headers | ||
| + custom_labels = self.extract_custom_labels(raw_request) | ||
| + | ||
| + # Resolve LoRA adapter from model parameter or explicit lora_path | ||
| + lora_path = self._resolve_lora_path(request.model, request.lora_path) | ||
| + if lora_path: | ||
| + first_adapter = ( | ||
| + lora_path | ||
| + if isinstance(lora_path, str) | ||
| + else next((a for a in lora_path if a), None) | ||
| + ) | ||
| + if first_adapter: | ||
| + self._validate_lora_enabled(first_adapter) | ||
| + | ||
| + logprob_start_len = ( | ||
| + request.logprob_start_len if request.logprob_start_len is not None else -1 | ||
| + ) | ||
| + | ||
| + adapted_request = GenerateReqInput( | ||
| + **prompt_kwargs, | ||
| + sampling_params=sampling_params, | ||
| + return_logprob=request.logprobs, | ||
| + logprob_start_len=logprob_start_len, | ||
| + top_logprobs_num=request.top_logprobs or 0, | ||
| + stream=request.stream, | ||
| + return_text_in_logprobs=True, | ||
| + lora_path=lora_path, | ||
| + bootstrap_host=request.bootstrap_host, | ||
| + bootstrap_port=request.bootstrap_port, | ||
| + bootstrap_room=request.bootstrap_room, | ||
| + data_parallel_rank=request.data_parallel_rank, | ||
| + return_hidden_states=request.return_hidden_states, | ||
| + rid=request.rid, | ||
| + extra_key=self._compute_extra_key(request), | ||
| + require_reasoning=self._get_reasoning_from_request(request), | ||
| + priority=request.priority, | ||
| + custom_labels=custom_labels, | ||
| + custom_logit_processor=request.custom_logit_processor, | ||
| + ) | ||
| + | ||
| + return adapted_request, request | ||
| + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The function _hack_convert_input_ids_to_messages contains a significant amount of code duplicated from _convert_to_internal_request. The function name prefix _hack_ also suggests this is a temporary or non-ideal solution.
To improve maintainability and reduce redundancy, I recommend refactoring this logic. The two paths (one for input_ids and one for messages) can be unified within _convert_to_internal_request. You could determine the prompt_kwargs and other message-related data at the beginning, and then proceed with the common logic for handling sampling parameters, LoRA, and creating the GenerateReqInput.
This would make the code cleaner, easier to understand, and less prone to bugs when one path is updated and the other is forgotten.
| + raw_request: Request = None, | ||
| + ) -> tuple[GenerateReqInput, ChatCompletionRequest]: | ||
| + | ||
| + # Notice: currently, if input_ids is provided, the stop token is not used. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment notes a significant limitation: stop tokens are not used when input_ids are provided. This could lead to unexpected behavior for users who expect stop sequences to function as they do with message-based inputs. This is a potential correctness issue from a user's perspective.
This limitation should either be addressed to make the feature complete, or it should be very clearly documented in the public-facing API documentation to prevent confusion and bugs in user code.
|
|
||
|
|
||
| async def run_agent(base_url: str, prompt: list[dict[str, Any]] | str, request_kwargs: dict[str, Any] = None) -> None: | ||
| payload = {"model": "default", "messages": prompt, "logprobs": True, **request_kwargs} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The request_kwargs passed to this function are generated by build_chat_request_kwargs in agentic_tool_call.py, which already hardcodes "logprobs": True. Setting it again here in the payload is redundant.
To avoid redundancy, you can remove "logprobs": True from this payload definition.
| payload = {"model": "default", "messages": prompt, "logprobs": True, **request_kwargs} | |
| payload = {"model": "default", "messages": prompt, **request_kwargs} |
5fcb85b to
c11ecea
Compare
|
@zhaochenyang20 Could you help review the doc oai_endpoint.md and gen_endpoint.md? Thanks! |
as the title