-
Notifications
You must be signed in to change notification settings - Fork 364
[Feature] Implement variable page size support #33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Thanks. Actually, when |
ff095b6 to
38dd466
Compare
|
@DarkSharpness I see. Is there any plan to implement it in the future? |
|
@DhiraPT Yes. For future support of MLA models, as popular attention implementation like Currently, I don't have enough bandwidth to handle this, but this is definitely something we must implement in the long term and require much modification in memory allocation logic. |
Changes SummaryThis PR removes the hardcoded page_size=1 restriction, enabling variable page sizes (16, 32, etc.) throughout the KV cache and scheduler layers. The feature propagates from CLI arguments through Engine, Scheduler, and KV Cache components to support more efficient PagedAttention operations. Type: feature Components Affected: KV Cache Management (mha_pool, base, naive_manager, radix_manager), Engine (dummy page calculation, max_seq_len calculation, Context initialization), Scheduler (CacheManager initialization and integrity checks), CLI Arguments (ServerArgs with --page-size flag) Files Changed
Architecture Impact
Risk Areas: Storage shape flattening in mha_pool.py: The change from (num_pages, local_kv_heads, head_dim) to (total_slots, local_kv_heads * head_dim) in _storage_shape could affect kernel behavior and cache performance. The flattening logic needs verification against actual kernel expectations., Dummy page calculation: Changed from self.num_pages to self.num_pages * config.page_size. This must correctly index into the flattened storage., Max sequence length calculation: Now uses num_pages * page_size instead of num_pages. Edge cases with alignment padding need validation., Backward compatibility: No explicit handling for configs that may still expect page_size=1 defaults. Existing code paths relying on the old assertion may fail silently., Cache manager initialization: NaiveCacheManager and RadixCacheManager both accept page_size but don't appear to use it (only stored but not validated or applied). Intent unclear. Suggestions
Full review in progress... | Powered by diffray |
| help="The page size for KV cache.", | ||
| ) | ||
|
|
||
| assert ServerArgs.use_dummy_weight == False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🟡 MEDIUM - Redundant boolean comparison with == False
Agent: python
Category: quality
Description:
Using '== False' for boolean comparison is redundant and violates PEP 8. Should use 'not' operator instead for cleaner, more Pythonic code.
Suggestion:
Change 'assert ServerArgs.use_dummy_weight == False' to 'assert not ServerArgs.use_dummy_weight'
Confidence: 85%
Rule: py_avoid_redundant_none_comparisons
Review ID: 685f73e3-967f-44bd-a01e-1f1aec97e9f4
Rate it 👍 or 👎 to improve future reviews | Powered by diffray
|
|
||
| assert ServerArgs.use_dummy_weight == False | ||
| parser.add_argument( | ||
| "--dummy-weight", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🟡 MEDIUM - Redundant boolean comparison with == True
Agent: python
Category: quality
Description:
Using '== True' for boolean comparison is redundant and violates PEP 8. Should check truthiness directly instead for cleaner, more Pythonic code.
Suggestion:
Change 'assert ServerArgs.use_pynccl == True' to 'assert ServerArgs.use_pynccl'
Confidence: 85%
Rule: py_avoid_redundant_none_comparisons
Review ID: 685f73e3-967f-44bd-a01e-1f1aec97e9f4
Rate it 👍 or 👎 to improve future reviews | Powered by diffray
| assert ServerArgs.use_dummy_weight == False | ||
| parser.add_argument( | ||
| "--dummy-weight", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🟡 MEDIUM - Dead feature flag assertions checking hardcoded defaults
Agent: refactoring
Category: quality
Description:
Lines 118 and 126 contain assertions that validate hardcoded class defaults. These assertions always evaluate the same way at import time since they check class attributes before argument parsing.
Suggestion:
Remove these assertions or add comments explaining they are intentional guards to catch accidental default changes in the dataclass definition.
Confidence: 70%
Rule: quality_dead_feature_flag
Review ID: 685f73e3-967f-44bd-a01e-1f1aec97e9f4
Rate it 👍 or 👎 to improve future reviews | Powered by diffray
| help="The page size for KV cache.", | ||
| ) | ||
|
|
||
| assert ServerArgs.use_dummy_weight == False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🟡 MEDIUM - Assert used for configuration checking instead of raising exception
Agent: python
Category: quality
Description:
Using assert to verify configuration state will be stripped in production when Python runs with -O flag. This validation would silently disappear.
Suggestion:
Replace 'assert ServerArgs.use_dummy_weight == False' with explicit validation: 'if ServerArgs.use_dummy_weight: raise ValueError("use_dummy_weight must default to False")'
Confidence: 75%
Rule: python_assert_in_production
Review ID: 685f73e3-967f-44bd-a01e-1f1aec97e9f4
Rate it 👍 or 👎 to improve future reviews | Powered by diffray
|
|
||
| assert ServerArgs.use_dummy_weight == False | ||
| parser.add_argument( | ||
| "--dummy-weight", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🟡 MEDIUM - Assert used for configuration checking instead of raising exception
Agent: python
Category: quality
Description:
Using assert to verify configuration state will be stripped in production when Python runs with -O flag.
Suggestion:
Replace 'assert ServerArgs.use_pynccl == True' with explicit validation: 'if not ServerArgs.use_pynccl: raise ValueError("use_pynccl must default to True")'
Confidence: 75%
Rule: python_assert_in_production
Review ID: 685f73e3-967f-44bd-a01e-1f1aec97e9f4
Rate it 👍 or 👎 to improve future reviews | Powered by diffray
|
|
||
|
|
||
| def create_cache_manager(device: torch.device, type: str) -> BaseCacheManager: | ||
| def create_cache_manager(device: torch.device, type: str, page_size: int) -> BaseCacheManager: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🟡 MEDIUM - Parameter name shadows Python built-in 'type'
Agent: python
Category: quality
Description:
The parameter 'type' shadows Python's built-in type() function, which can cause confusion.
Suggestion:
Rename the parameter from 'type' to 'cache_type' to avoid shadowing built-in.
Confidence: 75%
Rule: py_use_type_annotations_for_better_readabil
Review ID: 685f73e3-967f-44bd-a01e-1f1aec97e9f4
Rate it 👍 or 👎 to improve future reviews | Powered by diffray
| def __init__(self, device: torch.device, num_pages: int, type: str): | ||
| # TODO: support page_size > 1 | ||
| self._free_slots = torch.arange(num_pages, dtype=torch.int32, device=device) | ||
| def __init__(self, device: torch.device, num_pages: int, type: str, page_size: int): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🟡 MEDIUM - Parameter name shadows Python built-in 'type'
Agent: python
Category: quality
Description:
The parameter 'type' shadows Python's built-in type() function.
Suggestion:
Rename the parameter from 'type' to 'cache_type' to avoid shadowing Python built-in.
Confidence: 75%
Rule: py_use_type_annotations_for_better_readabil
Review ID: 685f73e3-967f-44bd-a01e-1f1aec97e9f4
Rate it 👍 or 👎 to improve future reviews | Powered by diffray
| ) | ||
| self.kv_cache = kv_cache | ||
| self.attn_backend = attn_backend | ||
| assert page_size == 1 | ||
| assert page_size >= 1 | ||
| self.page_size = page_size | ||
|
|
||
| def set_batch(self, batch: Batch): | ||
| assert self._batch is None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🟠 HIGH - Public class Context missing docstring
Agent: python
Category: docs
Description:
Class lacks docstring explaining its purpose and fields.
Suggestion:
Add docstring explaining that Context is the global context holding the current batch and sharing inference infrastructure across the system.
Confidence: 70%
Rule: py_docstrings_required_for_public_apis_pep_257_style
Review ID: 685f73e3-967f-44bd-a01e-1f1aec97e9f4
Rate it 👍 or 👎 to improve future reviews | Powered by diffray
| node: RadixTreeNode | ||
|
|
||
|
|
||
| class RadixCacheManager(BaseCacheManager): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🟠 HIGH - Public class RadixCacheManager missing docstring
Agent: python
Category: docs
Description:
Class lacks docstring explaining its purpose and caching strategy.
Suggestion:
Add docstring explaining that RadixCacheManager implements a radix tree-based cache for efficient prefix matching and sharing across requests.
Confidence: 70%
Rule: py_docstrings_required_for_public_apis_pep_257_style
Review ID: 685f73e3-967f-44bd-a01e-1f1aec97e9f4
Rate it 👍 or 👎 to improve future reviews | Powered by diffray
| if len(self._free_slots) + self.manager.size_info.total_size != total_slots: | ||
| raise RuntimeError( | ||
| "CacheManager integrity check failed:" | ||
| f" free_slots({len(self._free_slots)}) +" | ||
| f" total_size({self.manager.size_info.total_size}) != num_pages({self.num_pages})" | ||
| f" total_size({self.manager.size_info.total_size}) != total_slots({total_slots})" | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🟡 MEDIUM - Law of Demeter violation: deep property chain access
Agent: refactoring
Category: quality
Description:
The code accesses self.manager.size_info.total_size, a 3-level property chain. This creates coupling to the manager's internal structure.
Suggestion:
Consider adding a convenience method like get_total_size() to reduce coupling, or document this as an accepted pattern for this NamedTuple structure
Confidence: 60%
Rule: quality_law_of_demeter
Review ID: 685f73e3-967f-44bd-a01e-1f1aec97e9f4
Rate it 👍 or 👎 to improve future reviews | Powered by diffray
Review Summary
Validated 87 issues: 30 kept, 57 filtered Issues Found: 30💬 See 16 individual line comment(s) for details. 📊 12 unique issue type(s) across 30 location(s) 📋 Full issue list (click to expand)🟠 HIGH - Public factory function create_kvcache missing docstring (8 occurrences)Agent: python Category: docs 📍 View all locations
Rule: 🟡 MEDIUM - Redundant boolean comparison with == False (2 occurrences)Agent: python Category: quality 📍 View all locations
Rule: 🟡 MEDIUM - Dead feature flag assertions checking hardcoded defaultsAgent: refactoring Category: quality File: Description: Lines 118 and 126 contain assertions that validate hardcoded class defaults. These assertions always evaluate the same way at import time since they check class attributes before argument parsing. Suggestion: Remove these assertions or add comments explaining they are intentional guards to catch accidental default changes in the dataclass definition. Confidence: 70% Rule: 🟡 MEDIUM - Assert used for configuration checking instead of raising exception (7 occurrences)Agent: python Category: quality 📍 View all locations
Rule: 🟡 MEDIUM - Property name contradicts return type annotationAgent: python Category: quality File: Description: The property 'tokenizer_create_addr' returns a bool but the name suggests it should return a string address like other properties (zmq_frontend_addr, zmq_tokenizer_addr, distributed_addr). Suggestion: Rename the property to 'should_create_tokenizer' or 'create_new_tokenizer' to match the boolean return type, or clarify intent with documentation. Confidence: 80% Rule: 🟡 MEDIUM - Full Tree Traversal During Memory Eviction (2 occurrences)Agent: performance Category: performance 📍 View all locations
Rule: 🟡 MEDIUM - Class-level counter without thread safetyAgent: python Category: quality File: Description: The class attribute 'counter' is incremented at class level (RadixTreeNode.counter += 1) without synchronization, which could cause issues in multi-threaded scenarios. Suggestion: If thread safety is required, use threading.Lock or itertools.count(). Otherwise, add documentation that this class is not thread-safe. Confidence: 65% Rule: 🟡 MEDIUM - Singleton Global Context PatternAgent: architecture Category: quality File: Description: Module-level _GLOBAL_CTX variable with set_global_ctx/get_global_ctx implements singleton pattern using global state, creating implicit dependencies. Suggestion: Consider using dependency injection to pass Context through function parameters where feasible. Confidence: 65% Rule: 🟡 MEDIUM - Parameter name shadows Python built-in 'type' (2 occurrences)Agent: python Category: quality 📍 View all locations
Rule: 🟡 MEDIUM - Magic numbers in alignment function (2 occurrences)Agent: python Category: quality 📍 View all locations
Rule: 🟡 MEDIUM - Law of Demeter violation: deep property chain accessAgent: refactoring Category: quality File: Description: The code accesses self.manager.size_info.total_size, a 3-level property chain. This creates coupling to the manager's internal structure. Suggestion: Consider adding a convenience method like Confidence: 60% Rule: 🔵 LOW - Using typing.Dict, List, Tuple instead of built-in syntax (2 occurrences)Agent: python Category: style 📍 View all locations
Rule: ℹ️ 14 issue(s) outside PR diff (click to expand)
🟠 HIGH - Missing module docstring (3 occurrences)Agent: python Category: docs 📍 View all locations
Rule: 🟠 HIGH - Assert used for input validation instead of raising exception (2 occurrences)Agent: python Category: bug 📍 View all locations
Rule: 🟡 MEDIUM - Property name contradicts return type annotationAgent: python Category: quality File: Description: The property 'tokenizer_create_addr' returns a bool but the name suggests it should return a string address like other properties (zmq_frontend_addr, zmq_tokenizer_addr, distributed_addr). Suggestion: Rename the property to 'should_create_tokenizer' or 'create_new_tokenizer' to match the boolean return type, or clarify intent with documentation. Confidence: 80% Rule: 🟡 MEDIUM - Full Tree Traversal During Memory Eviction (2 occurrences)Agent: performance Category: performance 📍 View all locations
Rule: 🟡 MEDIUM - Class-level counter without thread safetyAgent: python Category: quality File: Description: The class attribute 'counter' is incremented at class level (RadixTreeNode.counter += 1) without synchronization, which could cause issues in multi-threaded scenarios. Suggestion: If thread safety is required, use threading.Lock or itertools.count(). Otherwise, add documentation that this class is not thread-safe. Confidence: 65% Rule: 🟡 MEDIUM - Singleton Global Context PatternAgent: architecture Category: quality File: Description: Module-level _GLOBAL_CTX variable with set_global_ctx/get_global_ctx implements singleton pattern using global state, creating implicit dependencies. Suggestion: Consider using dependency injection to pass Context through function parameters where feasible. Confidence: 65% Rule: 🟡 MEDIUM - Magic numbers in alignment function (2 occurrences)Agent: python Category: quality 📍 View all locations
Rule: 🔵 LOW - Using typing.Dict, List, Tuple instead of built-in syntax (2 occurrences)Agent: python Category: style 📍 View all locations
Rule: Review ID: |
Summary
This PR removes the hardcoded restriction of
page_size=1, allowing the engine to be configured with variable page sizes (e.g., 16, 32). This functionality is propagated through the Engine, Scheduler, and KV Cache layers to support more efficient PagedAttention.Key Changes
--page-sizeargument toServerArgs.mha_pool.py): - Updatedkv_bufferinitialization to usetotal_slots(num_pages * page_size) instead of justnum_pages.dummy_pageandmax_seq_lencalculations to account for the configured page size.assert page_size == 1constraints.CacheManagerand memory managers (Naive/Radix) to accept and respect thepage_sizeparameter during initialization and integrity checks.