Skip to content

Commit 6e823c9

Browse files
committed
update doc
1 parent 6cfaac0 commit 6e823c9

File tree

8 files changed

+14
-12
lines changed

8 files changed

+14
-12
lines changed

cookbook/client/tinker/megatron/server_config.yaml

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,8 @@ applications:
2121
route_prefix: /api/v1 # API endpoint prefix (Tinker-compatible)
2222
import_path: server # Python module to import
2323
args:
24+
server_config:
25+
per_token_model_limit: 3 # Maximum number of models (adapters) per token (server-globally enforced)
2426

2527
deployments:
2628
- name: TinkerCompatServer
@@ -95,7 +97,6 @@ applications:
9597
rps_limit: 20 # Max requests per second
9698
tps_limit: 16000 # Max tokens per second
9799
adapter_config:
98-
per_token_adapter_limit: 3 # Max concurrent LoRA adapters
99100
adapter_timeout: 30 # Seconds before idle adapter unload
100101
adapter_max_lifetime: 36000 # Maximum lifetime of an adapter in seconds (e.g., 10 hours)
101102
deployments:

cookbook/client/tinker/megatron/server_config_7b.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,7 @@ applications:
2222
import_path: server # Python module to import
2323
args:
2424
server_config:
25-
per_token_adapter_limit: 1 # Maximum number of adapters per token (globally)
25+
per_token_model_limit: 1 # Maximum number of models (adapters) per token (server-globally enforced)
2626
supported_models:
2727
- Qwen/Qwen2.5-7B-Instruct
2828
deployments:

cookbook/client/tinker/transformer/server_config.yaml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,8 @@ applications:
2121
route_prefix: /api/v1 # API endpoint prefix (Tinker-compatible)
2222
import_path: server # Python module to import
2323
args:
24-
24+
server_config:
25+
per_token_model_limit: 3 # Maximum number of models (adapters) per token (server-globally enforced)
2526
deployments:
2627
- name: TinkerCompatServer
2728
autoscaling_config:
@@ -52,7 +53,6 @@ applications:
5253
rps_limit: 100 # Max requests per second
5354
tps_limit: 100000 # Max tokens per second
5455
adapter_config:
55-
per_token_adapter_limit: 30 # Max concurrent LoRA adapters
5656
adapter_timeout: 1800 # Seconds before idle adapter unload
5757
deployments:
5858
- name: ModelManagement

cookbook/client/twinkle/megatron/server_config.yaml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,8 @@ applications:
2121
route_prefix: /server # API endpoint prefix
2222
import_path: server # Python module to import
2323
args:
24-
24+
server_config:
25+
per_token_model_limit: 3 # Maximum number of models (adapters) per token (server-globally enforced)
2526
deployments:
2627
- name: TwinkleServer
2728
autoscaling_config:
@@ -50,7 +51,6 @@ applications:
5051
mesh: [0,1] # Device indices in the mesh
5152
mesh_dim_names: ['dp'] # Mesh dimension names: 'dp' = data parallel
5253
adapter_config:
53-
per_token_adapter_limit: 30 # Max concurrent LoRA adapters
5454
adapter_timeout: 1800 # Seconds before idle adapter unload
5555
deployments:
5656
- name: ModelManagement

cookbook/client/twinkle/transformer/server_config.yaml

Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,8 @@ applications:
2121
route_prefix: /server # API endpoint prefix
2222
import_path: server # Python module to import
2323
args:
24-
24+
server_config:
25+
per_token_model_limit: 3 # Maximum number of models (adapters) per token (server-globally enforced)
2526
deployments:
2627
- name: TwinkleServer
2728
autoscaling_config:
@@ -40,7 +41,6 @@ applications:
4041
use_megatron: false # Use HuggingFace Transformers (not Megatron)
4142
model_id: "ms://Qwen/Qwen2.5-3B-Instruct" # ModelScope model identifier to load
4243
adapter_config:
43-
per_token_adapter_limit: 30 # Max LoRA adapters that can be active simultaneously
4444
adapter_timeout: 1800 # Seconds before an idle adapter is unloaded
4545
nproc_per_node: 2 # Number of GPU processes per node
4646
device_group: # Logical device group for this model
@@ -103,7 +103,6 @@ applications:
103103
gpu_memory_utilization: 0.4
104104
max_model_len: 1024
105105
adapter_config: # Adapter lifecycle management
106-
per_token_adapter_limit: 30 # Max LoRA adapters per user
107106
adapter_timeout: 1800 # Seconds before idle adapter is unloaded
108107
device_group:
109108
name: sampler

docs/source_en/Usage Guide/Server and Client/Server.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -259,7 +259,6 @@ applications:
259259
use_megatron: false # Use Transformers backend
260260
model_id: "ms://Qwen/Qwen2.5-7B-Instruct" # ModelScope model identifier
261261
adapter_config: # LoRA adapter configuration
262-
per_token_adapter_limit: 30 # Maximum number of LoRAs that can be activated simultaneously
263262
adapter_timeout: 1800 # Idle adapter timeout unload time (seconds)
264263
nproc_per_node: 2 # Number of GPU processes per node
265264
device_group: # Logical device group
@@ -354,6 +353,8 @@ applications:
354353
route_prefix: /api/v1 # Tinker protocol API prefix
355354
import_path: server
356355
args:
356+
server_config:
357+
per_token_model_limit: 30 # Maximum number of models (adapters) per token (server-global)
357358
deployments:
358359
- name: TinkerCompatServer
359360
autoscaling_config:

docs/source_zh/使用指引/服务端和客户端/服务端.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -202,7 +202,6 @@ applications:
202202
use_megatron: false # 使用 Transformers 后端
203203
model_id: "ms://Qwen/Qwen2.5-7B-Instruct" # ModelScope 模型标识
204204
adapter_config: # LoRA 适配器配置
205-
per_token_adapter_limit: 30 # 同时可激活的最大 LoRA 数量
206205
adapter_timeout: 1800 # 空闲适配器超时卸载时间(秒)
207206
nproc_per_node: 2 # 每节点 GPU 进程数
208207
device_group: # 逻辑设备组
@@ -297,6 +296,8 @@ applications:
297296
route_prefix: /api/v1 # Tinker 协议 API 前缀
298297
import_path: server
299298
args:
299+
server_config:
300+
per_token_model_limit: 30 # 每个 token 最多可创建的模型(适配器)数量(服务器全局生效)
300301
deployments:
301302
- name: TinkerCompatServer
302303
autoscaling_config:

src/twinkle/server/utils/state/model_manager.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -36,7 +36,7 @@ def add(self, model_id: str, record: ModelRecord) -> None:
3636
token = record.token
3737
current_ids = self._token_models.get(token, set())
3838
if len(current_ids) >= self._per_token_model_limit:
39-
raise RuntimeError(f'Model limit exceeded for token {token[:8]}...: '
39+
raise RuntimeError(f'Model limit exceeded: '
4040
f'{len(current_ids)}/{self._per_token_model_limit} models')
4141
self._token_models.setdefault(token, set()).add(model_id)
4242
self._store[model_id] = record

0 commit comments

Comments
 (0)