-
Notifications
You must be signed in to change notification settings - Fork 66
Refactor: Optimize performance by reducing token usage and speed up model response time. #89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
luuquangvu
commented
Jan 24, 2026
- Update all functions to use orjson for better performance and reduce token usage.
- Update the LMDB store to more efficiently manage reusable sessions.
- Update the logic to skip the system instruction when reusing a session to save tokens and speed up model response time.
- Update project dependencies.
They are no longer needed since the underlying library issue has been resolved.
…to better handle heavy tasks
… client status checks
…probabilities, and token details; adjust response handling accordingly.
…tput_text` validator, rename `created` to `created_at`, and update response handling accordingly.
…roved streaming of response items. Refactor image generation handling for consistency and add compatibility with output content.
…t` and ensure consistent initialization in image output handling.
…anagement Add dedicated router for /images endpoint and refactor image handling logic for better modularity. Enhance temporary image management with secure naming, token verification, and cleanup functionality.
…l and refactor variable handling
…y` for tools, tool_choice, and streaming settings
…nd update response handling for consistency
…mat for compatibility
- Moved utility functions like `strip_code_fence`, `extract_tool_calls`, and `iter_stream_segments` to a centralized helper module. - Removed unused and redundant private methods from `chat.py`, including `_strip_code_fence`, `_strip_tagged_blocks`, and `_strip_system_hints`. - Updated imports and references across modules for consistency. - Simplified tool call and streaming logic by replacing inline implementations with shared helper functions.
- Replaced unused model placeholder in `config.yaml` with an empty list. - Added JSON parsing validators for `model_header` and `models` to enhance flexibility and error handling. - Improved validation to filter out incomplete model configurations.
…N support - Replaced prefix-based parsing with a root key approach. - Added JSON parsing to handle list-based model configurations. - Improved handling of errors and cleanup of environment variables.
…to Python literals - Added `ast.literal_eval` as a fallback for parsing environment variables when JSON decoding fails. - Improved error handling and logging for invalid configurations. - Ensured proper cleanup of environment variables post-parsing.
- Adjusted `TOOL_CALL_RE` regex pattern for better accuracy.
…nvironment variable setup
…nvironment variables; enhance error logging in config validation
…tring or list structure for enhanced flexibility in automated environments
…s found in either the raw or cleaned history.
… for better Gemini compatibility.
…eeds METADATA_TTL_MINUTES.
…tion from being saved
…ystem instruction when reusing a session to save tokens.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Refactors request/response handling and LMDB session reuse to reduce token usage and improve runtime performance, primarily by switching to orjson and reusing Gemini sessions more aggressively.
Changes:
- Replace stdlib
jsonusage withorjsonacross helpers, config parsing, and chat/response flows; set FastAPI default response toORJSONResponse. - Enhance LMDB hashing/sanitization logic to better support session reuse and consistent conversation lookup.
- Add session-reuse optimizations to skip re-sending heavy system/tool instructions and improve message splitting behavior.
Reviewed changes
Copilot reviewed 8 out of 9 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| uv.lock | Bumps dependency lockfile entries (FastAPI, Uvicorn, orjson, ruff, etc.). |
| pyproject.toml | Updates dependency constraints and adds orjson. |
| app/utils/helper.py | Moves tool-call JSON parsing to orjson; changes tool call ID generation. |
| app/utils/config.py | Switches env/config JSON parsing to orjson. |
| app/services/lmdb.py | Refactors message/conversation hashing and assistant-message sanitization for reuse consistency. |
| app/services/client.py | Uses orjson for tool-call argument normalization; minor formatting changes. |
| app/server/chat.py | Adds session reuse optimizations, orjson structured parsing, TTL logic, and revised request splitting. |
| app/models/models.py | Adds tool_call_id and centralizes developer→system role normalization. |
| app/main.py | Sets FastAPI default_response_class=ORJSONResponse. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # Generate a deterministic ID based on name, arguments, and index to avoid collisions | ||
| seed = f"{name}:{arguments}:{index}".encode("utf-8") |
Copilot
AI
Jan 25, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
extract_tool_calls() now generates deterministic tool call IDs using only name, canonicalized arguments, and the per-block index. Because index restarts at 0 for each fenced block, identical tool calls in different blocks can produce the same call_id, which violates the expectation that ToolCall.id values are unique within a message/conversation and can break tool_call_id mapping for tool responses. Include a globally unique component (e.g., the current len(tool_calls) at append-time, a monotonically increasing counter across the whole text, or incorporate the match start offset) into the hash seed to guarantee uniqueness across all extracted calls.
| # Generate a deterministic ID based on name, arguments, and index to avoid collisions | |
| seed = f"{name}:{arguments}:{index}".encode("utf-8") | |
| # Generate a deterministic ID based on name, arguments, per-block index, and a | |
| # globally increasing index (current tool_calls length) to avoid collisions | |
| global_index = len(tool_calls) | |
| seed = f"{name}:{arguments}:{index}:{global_index}".encode("utf-8") |
| if text_parts is not None: | ||
| text_content = "".join(text_parts).replace("\r\n", "\n").strip() | ||
| core_data["content"] = text_content if text_content else None |
Copilot
AI
Jan 25, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In _hash_message(), when message.content is a list of text items you concatenate them with "".join(text_parts). This makes the hash non-injective: e.g., ["ab","c"] and ["a","bc"] hash identically, which can cause hash collisions and incorrect session reuse / conversation lookup. Preserve boundaries by joining with an unambiguous separator (e.g., "\n") or by hashing a structured list representation (including item order/lengths) instead of raw concatenation.