Skip to content

Commit 593f740

Browse files
Change online model to qwen3.5-27b (#140)
1 parent 01dc476 commit 593f740

File tree

17 files changed

+76
-53
lines changed

17 files changed

+76
-53
lines changed

.gitignore

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,8 @@ __pycache__/
55
*.py[cod]
66
*$py.class
77
test.py
8+
test.sh
9+
twinkle-web
810
# C extensions
911
*.so
1012

Dockerfile

Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,25 @@
1+
FROM modelscope-registry.cn-hangzhou.cr.aliyuncs.com/modelscope-repo/modelscope:ubuntu22.04-cuda12.8.1-py311-torch2.9.1-1.35.0
2+
3+
# Install miniconda with Python 3.12
4+
RUN curl -O https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh && \
5+
bash Miniconda3-latest-Linux-x86_64.sh -b -p /opt/conda && \
6+
rm Miniconda3-latest-Linux-x86_64.sh
7+
ENV PATH="/opt/conda/bin:${PATH}"
8+
RUN conda create -n twinkle python=3.12 -y --override-channels -c conda-forge
9+
SHELL ["conda", "run", "-n", "twinkle", "/bin/bash", "-c"]
10+
11+
# Clone and install twinkle, checkout to latest v-tag
12+
RUN git clone https://github.com/modelscope/twinkle.git
13+
WORKDIR /twinkle
14+
RUN echo "Available v-tags:" && git tag -l 'v*' --sort=-v:refname && \
15+
LATEST_TAG=$(git tag -l 'v*' --sort=-v:refname | head -n 1) && \
16+
echo "Checking out: $LATEST_TAG" && \
17+
git checkout "$LATEST_TAG"
18+
19+
RUN sh INSTALL_MEGATRON.sh
20+
21+
RUN pip install --no-cache-dir tinker==0.14.0 "ray[serve]" transformers peft accelerate -U
22+
23+
RUN pip install -e . --no-build-isolation
24+
25+
CMD ["bash", "cookbook/client/server/megatron/run.sh"]

INSTALL_MEGATRON.sh

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -81,6 +81,8 @@ MAX_JOBS=8 \
8181
FLASH_ATTENTION_FORCE_BUILD=TRUE \
8282
pip install flash-attn --no-build-isolation --no-cache-dir
8383

84+
pip install flash-linear-attention -U
85+
8486
# Install numpy
8587
echo ""
8688
echo "Installing numpy==2.2 and deep_gemm..."

README.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -131,7 +131,7 @@ supported on Twinkle✨ framework.
131131
> For serverless training service accessed via `base_url=https://www.modelscope.cn/twinkle`, it
132132
> is currently provided via the Tinker-compatible APIs. We will be rolling out services that support
133133
> both Tinker APIs, as well as the full-fledged Twinkle✨ native APIs. The serverless endpoint is backed
134-
> by one training base at a time, and currently it is [Qwen3.5-4B](https://modelscope.cn/models/Qwen/Qwen3.5-4B).
134+
> by one training base at a time, and currently it is [Qwen3.5-27B](https://modelscope.cn/models/Qwen/Qwen3.5-27B).
135135
136136
| Model Type | Model ID on [ModelScope](https://modelscope.cn) | Model Size | Requires | Support Megatron | HF Model ID |
137137
|---------------------|-----------------------------------------------------------------------------------------------------------------|:---------------------------------------:|----------------------|:----------------:|:---------------------------------------------------------------------------------------------------------:|
@@ -180,7 +180,7 @@ twinkle.initialize(mode='ray', groups=device_group, global_device_mesh=device_me
180180

181181
def train():
182182
# to load model from Hugging Face, use 'hf://...'
183-
base_model = 'ms://Qwen/Qwen3.5-4B'
183+
base_model = 'ms://Qwen/Qwen3.5-27B'
184184
# 1000 samples
185185
dataset = Dataset(dataset_meta=DatasetMeta('ms://swift/self-cognition', data_slice=range(1000)))
186186
# Set template to prepare encoding
@@ -236,7 +236,7 @@ from twinkle.dataset import Dataset, DatasetMeta
236236
from twinkle.preprocessor import SelfCognitionProcessor
237237
from twinkle.server.common import input_feature_to_datum
238238

239-
base_model = 'ms://Qwen/Qwen3.5-4B'
239+
base_model = 'ms://Qwen/Qwen3.5-27B'
240240
base_url='your-base-url'
241241
api_key='your-api-key'
242242

README_ZH.md

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -114,7 +114,7 @@ Twinkle✨支持相同的算法接口运行在单GPU、torchrun多机、Ray、Cl
114114
随着新模型的发布,我们将添加对更多模型的支持。下表列出了 Twinkle✨ 框架当前支持的模型。
115115

116116
>[!Note]
117-
> 通过 `base_url=https://www.modelscope.cn/twinkle` 访问的无服务器训练服务,目前是通过兼容Tinker的API提供的。我们将陆续推出同时支持Tinker API和完整Twinkle✨原生 API的服务。无服务器端点每次由一个训练基座支持,目前使用的是[Qwen3.5-4B](https://modelscope.cn/models/Qwen/Qwen3.5-4B)
117+
> 通过 `base_url=https://www.modelscope.cn/twinkle` 访问的无服务器训练服务,目前是通过兼容Tinker的API提供的。我们将陆续推出同时支持Tinker API和完整Twinkle✨原生 API的服务。无服务器端点每次由一个训练基座支持,目前使用的是[Qwen3.5-27B](https://modelscope.cn/models/Qwen/Qwen3.5-27B)
118118
119119
| Model Type | Model ID 举例 | Model Size | Requires | Support Megatron | HF Model ID |
120120
|---------------------|-----------------------------------------------------------------------------------------------------------------|:---------------------------------------:|----------------------|:----------------:|:---------------------------------------------------------------------------------------------------------:|
@@ -162,7 +162,7 @@ twinkle.initialize(mode='ray', groups=device_group, global_device_mesh=device_me
162162

163163
def train():
164164
# to load model from Hugging Face, use 'hf://...'
165-
base_model = 'ms://Qwen/Qwen3.5-4B'
165+
base_model = 'ms://Qwen/Qwen3.5-27B'
166166
# 1000 samples
167167
dataset = Dataset(dataset_meta=DatasetMeta('ms://swift/self-cognition', data_slice=range(1000)))
168168
# Set template to prepare encoding
@@ -218,7 +218,7 @@ from twinkle.dataset import Dataset, DatasetMeta
218218
from twinkle.preprocessor import SelfCognitionProcessor
219219
from twinkle.server.common import input_feature_to_datum
220220

221-
base_model = 'ms://Qwen/Qwen3.5-4B'
221+
base_model = 'ms://Qwen/Qwen3.5-27B'
222222
base_url='your-base-url'
223223
api_key='your-api-key'
224224

cookbook/client/server/megatron/server.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,7 @@
1515

1616
# Resolve the path to server_config.yaml relative to this script's location
1717
file_dir = os.path.abspath(os.path.dirname(__file__))
18-
config_path = os.path.join(file_dir, 'server_config_4b.yaml')
18+
config_path = os.path.join(file_dir, 'server_config.yaml')
1919

2020
# Launch the Twinkle server — this call blocks until the server is shut down
2121
launch_server(config_path=config_path)

cookbook/client/server/megatron/server_config.yaml

Lines changed: 23 additions & 20 deletions
Original file line numberDiff line numberDiff line change
@@ -36,29 +36,32 @@ applications:
3636

3737
# 3. Sampler Service - Runs inference / sampling using vLLM engine
3838
# Used for generating text from the model (e.g., evaluating LoRA results).
39-
- name: sampler-Qwen3.5-4B
40-
route_prefix: /api/v1/sampler/Qwen/Qwen3.5-4B
39+
# Config: TP=2 x DP=2 on 4 GPUs, ~27GB weights/GPU, ~37GB for KV cache + LoRA
40+
- name: sampler-Qwen3.5-27B
41+
route_prefix: /api/v1/sampler/Qwen/Qwen3.5-27B
4142
import_path: sampler
4243
args:
43-
model_id: "ms://Qwen/Qwen3.5-4B" # ModelScope model identifier
44-
nproc_per_node: 4 # Number of GPU processes per node
44+
model_id: "ms://Qwen/Qwen3.5-27B" # ModelScope model identifier
45+
nproc_per_node: 8 # Number of GPU processes per node
4546
sampler_type: vllm # Inference engine: 'vllm' (fast) or 'torch' (TorchSampler)
4647
engine_args: # vLLM engine-specific settings
47-
max_model_len: 16000 # Maximum sequence length the engine supports
48-
gpu_memory_utilization: 0.85 # Fraction of GPU memory to use (0.0-1.0)
48+
max_model_len: 32000 # Maximum sequence length the engine supports
49+
gpu_memory_utilization: 0.80 # 80% utilization, ~64GB/GPU, leaves buffer for safety
4950
enable_lora: true # Allow loading LoRA adapters during inference
5051
max_loras: 5 # Max allowed loras working on vLLM at the same time
52+
max_lora_rank: 32 # Support up to rank 64 LoRA adapters
5153
device_group: # Logical device group for the sampler
5254
name: sampler
53-
gpus_per_worker: 1
55+
gpus_per_worker: 2
5456
ranks: 4 # GPU rank indices to use
5557
device_type: cuda
5658
device_mesh:
5759
device_type: cuda
58-
dp_size: 4
60+
dp_size: 2
61+
tp_size: 2 # 2 TP replicas for multi-tenant throughput
5962
queue_config:
6063
rps_limit: 20 # Max requests per second
61-
tps_limit: 16000 # Max tokens per second
64+
tps_limit: 32000 # Max tokens per second
6265
deployments:
6366
- name: SamplerManagement
6467
autoscaling_config:
@@ -71,29 +74,29 @@ applications:
7174
env_vars:
7275
TWINKLE_TRUST_REMOTE_CODE: "0"
7376

74-
# 2. Model Service (commented out) - Would host the base model for training.
75-
# Uncomment and configure if you need a training model worker.
76-
- name: models-Qwen3.5-4B
77-
route_prefix: /api/v1/model/Qwen/Qwen3.5-4B
77+
# 2. Model Service - Hosts the base model for training.
78+
# Config: PP=2 x DP=2 on 4 GPUs, ~27GB weights/GPU, comfortable for LoRA training
79+
- name: models-Qwen3.5-27B
80+
route_prefix: /api/v1/model/Qwen/Qwen3.5-27B
7881
import_path: model
7982
args:
80-
use_megatron: true # Use HuggingFace Transformers backend
81-
model_id: "ms://Qwen/Qwen3.5-4B" # ModelScope model identifier
82-
max_length: 16000 # model max length
83+
use_megatron: true # Use Megatron-LM backend
84+
model_id: "ms://Qwen/Qwen3.5-27B" # ModelScope model identifier
85+
max_length: 32000 # model max length
8386
max_loras: 5 # model max loras
84-
nproc_per_node: 4 # Number of GPU processes per node
87+
nproc_per_node: 8 # Number of GPU processes per node
8588
device_group:
8689
name: model
8790
ranks: 4 # GPU rank indices
8891
device_type: cuda
8992
device_mesh:
9093
device_type: cuda
91-
dp_size: 4
92-
ep_size: 2
94+
dp_size: 2 # 2-way data parallel
95+
pp_size: 2 # 2-way pipeline parallel (~27GB/GPU)
9396

9497
queue_config:
9598
rps_limit: 20 # Max requests per second
96-
tps_limit: 16000 # Max tokens per second
99+
tps_limit: 32000 # Max tokens per second
97100
adapter_config:
98101
adapter_timeout: 30 # Seconds before idle adapter unload
99102
adapter_max_lifetime: 36000 # Maximum lifetime of an adapter in seconds (e.g., 10 hours)

cookbook/client/tinker/modelscope/sample.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@
1616

1717
from tinker import ServiceClient
1818

19-
base_model = 'Qwen/Qwen3.5-4B'
19+
base_model = 'Qwen/Qwen3.5-27B'
2020
base_url = 'http://www.modelscope.cn/twinkle'
2121

2222
# Step 2: Define the base model and connect to the server
@@ -29,7 +29,7 @@
2929
# The model_path is a twinkle:// URI pointing to a previously saved LoRA checkpoint.
3030
# The server will load the base model and apply the LoRA adapter weights.
3131
sampling_client = service_client.create_sampling_client(
32-
model_path='twinkle://xxx-Qwen_Qwen3.5-4B-xxx/weights/twinkle-lora-1',
32+
model_path='twinkle://xxx-Qwen_Qwen3.5-27B-xxx/weights/twinkle-lora-1',
3333
base_model=base_model
3434
)
3535

cookbook/client/tinker/modelscope/self_cognition.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,7 @@
2323
from tinker import ServiceClient
2424

2525
# The base model to fine-tune / evaluate
26-
base_model = 'Qwen/Qwen3.5-4B'
26+
base_model = 'Qwen/Qwen3.5-27B'
2727
base_url = 'http://www.modelscope.cn/twinkle'
2828

2929

cookbook/client/tinker/modelscope/short_math_grpo.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -38,7 +38,7 @@
3838
logger = get_logger()
3939

4040
# ========== Configuration ==========
41-
BASE_MODEL = 'Qwen/Qwen3.5-4B'
41+
BASE_MODEL = 'Qwen/Qwen3.5-27B'
4242
NUM_GENERATIONS = 8
4343
MAX_NEW_TOKENS = 4096
4444
LEARNING_RATE = 1e-4

0 commit comments

Comments
 (0)