From f164ef084b6e678612d4a4ed676aeed2f92c8c6d Mon Sep 17 00:00:00 2001 From: "baijin.xh" Date: Mon, 2 Mar 2026 10:21:18 +0800 Subject: [PATCH] update the supported models --- README.md | 54 +- README_ZH.md | 65 +- docs/source_en/Usage Guide/Quick-Start.md | 917 +----------------- ...53\351\200\237\345\274\200\345\247\213.md" | 917 +----------------- 4 files changed, 146 insertions(+), 1807 deletions(-) diff --git a/README.md b/README.md index 92e807c2..8a9944c0 100644 --- a/README.md +++ b/README.md @@ -112,35 +112,31 @@ supported on Twinkle✨ framework. > both Tinker APIs, as well as the full-fledged Twinkle✨ native APIs. The serverless endpoint is backed > by one training base at a time, and currently it is [Qwen3-30B-A3B-Instruct-2507](https://modelscope.cn/models/Qwen/Qwen3-30B-A3B-Instruct-2507). - -| Model Type | Model ID on [ModelScope](https://modelscope.cn) | Requires | Megatron Support | HF Model ID | -| ------------------- |--------------------------------------------------------------------------------------------------------------------------| -------------------- | ---------------- | ---------------------------------------------------------------------------------------------------------- | -| qwen3 series | [Qwen/Qwen3-0.6B-Base](https://modelscope.cn/models/Qwen/Qwen3-0.6B-Base)~32B | transformers>=4.51 | ✅ | [Qwen/Qwen3-0.6B-Base](https://huggingface.co/Qwen/Qwen3-0.6B-Base) | -| qwen3_moe series | [Qwen/Qwen3-30B-A3B-Base](https://modelscope.cn/models/Qwen/Qwen3-30B-A3B-Base) | transformers>=4.51 | ✅ | [Qwen/Qwen3-30B-A3B-Base](https://huggingface.co/Qwen/Qwen3-30B-A3B-Base) | -| | [Qwen/Qwen3-30B-A3B](https://modelscope.cn/models/Qwen/Qwen3-30B-A3B)~235B | transformers>=4.51 | ✅ | [Qwen/Qwen3-30B-A3B](https://huggingface.co/Qwen/Qwen3-30B-A3B) | -| qwen2 series | [Qwen/Qwen2-0.5B-Instruct](https://modelscope.cn/models/Qwen/Qwen2-0.5B-Instruct) ~72B | transformers>=4.37 | ✅ | [Qwen/Qwen2-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2-0.5B-Instruct) | -| | [Qwen/Qwen2.5-0.5B-Instruct](https://modelscope.cn/models/Qwen/Qwen2.5-0.5B-Instruct)~72B | transformers>=4.37 | ✅ | [Qwen/Qwen2.5-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct) | -| | [Qwen/Qwen2.5-0.5B](https://modelscope.cn/models/Qwen/Qwen2.5-0.5B)~72B | transformers>=4.37 | ✅ | [Qwen/Qwen2.5-0.5B](https://huggingface.co/Qwen/Qwen2.5-0.5B) | -| qwen2_moe series | [Qwen/Qwen1.5-MoE-A2.7B-Chat](https://modelscope.cn/models/Qwen/Qwen1.5-MoE-A2.7B-Chat) | transformers>=4.40 | ✅ | [Qwen/Qwen1.5-MoE-A2.7B-Chat](https://huggingface.co/Qwen/Qwen1.5-MoE-A2.7B-Chat) | -| chatglm4 series | [ZhipuAI/glm-4-9b-chat](https://modelscope.cn/models/ZhipuAI/glm-4-9b-chat) | transformers>=4.42 | ✘ | [zai-org/glm-4-9b-chat](https://huggingface.co/zai-org/glm-4-9b-chat) | -| | [ZhipuAI/LongWriter-glm4-9b](https://modelscope.cn/models/ZhipuAI/LongWriter-glm4-9b) | transformers>=4.42 | ✘ | [zai-org/LongWriter-glm4-9b](https://huggingface.co/zai-org/LongWriter-glm4-9b) | -| glm_edge series | [ZhipuAI/glm-edge-1.5b-chat](https://modelscope.cn/models/ZhipuAI/glm-edge-1.5b-chat) | transformers>=4.46 | ✘ | [zai-org/glm-edge-1.5b-chat](https://huggingface.co/zai-org/glm-edge-1.5b-chat) | -| | [ZhipuAI/glm-edge-4b-chat](https://modelscope.cn/models/ZhipuAI/glm-edge-4b-chat) | transformers>=4.46 | ✘ | [zai-org/glm-edge-4b-chat](https://huggingface.co/zai-org/glm-edge-4b-chat) | -| internlm2 series | [Shanghai_AI_Laboratory/internlm2-1_8b](https://modelscope.cn/models/Shanghai_AI_Laboratory/internlm2-1_8b) | transformers>=4.38 | ✘ | [internlm/internlm2-1_8b](https://huggingface.co/internlm/internlm2-1_8b) | -| | [Shanghai_AI_Laboratory/internlm2-chat-7b](https://modelscope.cn/models/Shanghai_AI_Laboratory/internlm2-chat-7b) | transformers>=4.38 | ✘ | [internlm/internlm2-chat-7b](https://huggingface.co/internlm/internlm2-chat-7b) | -| deepseek_v1 | [deepseek-ai/deepseek-vl-7b-chat](https://modelscope.cn/models/deepseek-ai/deepseek-vl-7b-chat) | transformers>=4.39.4 | ✅ | —— | -| | [deepseek-ai/DeepSeek-V2-Lite](https://modelscope.cn/models/deepseek-ai/DeepSeek-V2-Lite) | transformers>=4.39.3 | ✅ | [deepseek-ai/DeepSeek-V2-Lite](https://huggingface.co/deepseek-ai/DeepSeek-V2-Lite) | -| | [deepseek-ai/DeepSeek-V2.5](https://modelscope.cn/models/deepseek-ai/DeepSeek-V2.5) | transformers>=4.39.3 | ✅ | [deepseek-ai/DeepSeek-V2.5](https://huggingface.co/deepseek-ai/DeepSeek-V2.5) | -| | [deepseek-ai/DeepSeek-R1](https://modelscope.cn/models/deepseek-ai/DeepSeek-R1) | transformers>=4.39.3 | ✅ | [deepseek-ai/DeepSeek-R1](https://huggingface.co/deepseek-ai/DeepSeek-R1) | -| deepSeek-r1-distill | [deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B](https://modelscope.cn/models/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B) ~32B | transformers>=4.37 | ✅ | [deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B) | +| Model Type | Model ID on [ModelScope](https://modelscope.cn) | Model Size | Requires | Support Megatron | HF Model ID | +| ------------------- | ------------------------------------------------------------ | :-------------------------------------: | -------------------- | :--------------: | :----------------------------------------------------------: | +| qwen3 series | [Qwen/Qwen3-14B-Base](https://modelscope.cn/models/Qwen/Qwen3-14B-Base) | 0.6B/1.7B/4B/8B/14B | transformers>=4.51 | ✔ | [Qwen/Qwen3-14B-Base](https://huggingface.co/Qwen/Qwen3-14B-Base) | +| | [Qwen/Qwen3-32B](https://modelscope.cn/models/Qwen/Qwen3-32B) | 0.6B/1.7B/4B/8B/14B/32B | transformers>=4.51 | ✔ | [Qwen/Qwen3-32B](https://huggingface.co/Qwen/Qwen3-32B) | +| qwen3_moe series | [Qwen/Qwen3-30B-A3B-Base](https://modelscope.cn/models/Qwen/Qwen3-30B-A3B-Base) | 30B-A3B/A3B-Base,235B-A22B | transformers>=4.51 | ✔ | [Qwen/Qwen3-30B-A3B-Base](https://huggingface.co/Qwen/Qwen3-30B-A3B-Base) | +| qwen2 series | [Qwen/Qwen2-0.5B-Instruct](https://modelscope.cn/models/Qwen/Qwen2-0.5B-Instruct) | 0.5B/1.5B/7B/72B | transformers>=4.37 | ✔ | [Qwen/Qwen2-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2-0.5B-Instruct) | +| | [Qwen/Qwen2-1.5B](https://modelscope.cn/models/Qwen/Qwen2-1.5B) | 0.5B/1.5B/7B/72B | transformers>=4.37 | ✔ | [Qwen/Qwen2-1.5B](https://huggingface.co/Qwen/Qwen2-1.5B) | +| | [Qwen/Qwen2.5-1.5B-Instruct](https://modelscope.cn/models/Qwen/Qwen2.5-1.5B-Instruct) | 0.5B/1.5B/3B/7B/14B/32B/72B | transformers>=4.37 | ✔ | [Qwen/Qwen2.5-1.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct) | +| | [Qwen/Qwen2.5-0.5B](https://modelscope.cn/models/Qwen/Qwen2.5-0.5B) | 0.5B/1.5B/3B/7B/14B/32B | transformers>=4.37 | ✔ | [Qwen/Qwen2.5-0.5B](https://huggingface.co/Qwen/Qwen2.5-0.5B) | +| qwen2_moe series | [Qwen/Qwen1.5-MoE-A2.7B-Chat](https://modelscope.cn/models/Qwen/Qwen1.5-MoE-A2.7B-Chat) | - | transformers>=4.40 | ✔ | [Qwen/Qwen1.5-MoE-A2.7B-Chat](https://huggingface.co/Qwen/Qwen1.5-MoE-A2.7B-Chat) | +| | [Qwen/Qwen1.5-MoE-A2.7B](https://modelscope.cn/models/Qwen/Qwen1.5-MoE-A2.7B) | - | transformers>=4.40 | ✔ | [Qwen/Qwen1.5-MoE-A2.7B](https://huggingface.co/Qwen/Qwen1.5-MoE-A2.7B) | +| chatglm3 series | [ZhipuAI/chatglm3-6b](https://modelscope.cn/models/ZhipuAI/chatglm3-6b) | 6b/6b-base/6b-32k/6b-128k | transformers<4.42 | ✘ | [zai-org/chatglm3-6b](https://huggingface.co/zai-org/chatglm3-6b) | +| chatglm4 series | [ZhipuAI/glm-4-9b-chat](https://modelscope.cn/models/ZhipuAI/glm-4-9b-chat) | glm-4-9b/glm-4-9b-chat/glm-4-9b-chat-1m | transformers>=4.42 | ✘ | [zai-org/glm-4-9b-chat](https://huggingface.co/zai-org/glm-4-9b-chat) | +| | [ZhipuAI/LongWriter-glm4-9b](https://modelscope.cn/models/ZhipuAI/LongWriter-glm4-9b) | - | transformers>=4.42 | ✘ | [zai-org/LongWriter-glm4-9b](https://huggingface.co/zai-org/LongWriter-glm4-9b) | +| glm_edge series | [ZhipuAI/glm-edge-1.5b-chat](https://modelscope.cn/models/ZhipuAI/glm-edge-1.5b-chat) | 1.5b-chat/4b-chat | transformers>=4.46 | ✘ | [zai-org/glm-edge-1.5b-chat](https://huggingface.co/zai-org/glm-edge-1.5b-chat) | +| internlm2 series | [Shanghai_AI_Laboratory/internlm2-1_8b](https://modelscope.cn/models/Shanghai_AI_Laboratory/internlm2-1_8b) | 1_8b/chat-1_8b-sft/base-7b/7b/chat-7b/ | transformers>=4.38 | ✘ | [internlm/internlm2-1_8b](https://huggingface.co/internlm/internlm2-1_8b) | +| deepseek_v1 | [deepseek-ai/DeepSeek-V2-Lite](https://modelscope.cn/models/deepseek-ai/DeepSeek-V2-Lite) | V2/V2-Lite/V2-Chat/2-Lite-Chat/V2.5 | transformers>=4.39.3 | ✔ | [deepseek-ai/DeepSeek-V2-Lite](https://huggingface.co/deepseek-ai/DeepSeek-V2-Lite) | +| | [deepseek-ai/DeepSeek-Prover-V2-7B](https://modelscope.cn/models/deepseek-ai/DeepSeek-Prover-V2-7B) | - | transformers>=4.39.3 | ✔ | [deepseek-ai/DeepSeek-Prover-V2-7B](https://huggingface.co/deepseek-ai/DeepSeek-Prover-V2-7B) | +| | [deepseek-ai/DeepSeek-R1](https://modelscope.cn/models/deepseek-ai/DeepSeek-R1) | - | transformers>=4.39.3 | ✔ | [deepseek-ai/DeepSeek-R1](https://huggingface.co/deepseek-ai/DeepSeek-R1) | +| deepSeek-r1-distill | [deepseek-ai/DeepSeek-R1-Distill-Qwen-7B](https://modelscope.cn/models/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B) | 1.5B/7B/14B/32B | transformers>=4.37 | ✔ | [deepseek-ai/DeepSeek-R1-Distill-Qwen-7B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B) | For more detailed model support list 👉 [Quick Start](docs/source_en/Usage%20Guide/Quick-Start.md) ## Sample Code -Below are some of the capabilities demonstrated in the example code. For a complete introduction to training capabilities, -please refer to [Quick Start](docs/source_en/Usage%20Guide/Quick-Start.md) and [cookbook](cookbook). - ### Train with Ray ```python @@ -160,7 +156,7 @@ twinkle.initialize(mode='ray', groups=device_group, global_device_mesh=device_me def train(): # to load model from Hugging Face, use 'hf://...' - base_model = 'ms://Qwen/Qwen3-4B' + base_model = 'ms://Qwen/Qwen2.5-7B-Instruct' # 1000 samples dataset = Dataset(dataset_meta=DatasetMeta('ms://swift/self-cognition', data_slice=range(1000))) # Set template to prepare encoding @@ -210,20 +206,20 @@ if __name__ == '__main__': import os from tqdm import tqdm from tinker import types -from twinkle import init_tinker_client +from twinkle_client import init_tinker_client from twinkle.dataloader import DataLoader from twinkle.dataset import Dataset, DatasetMeta from twinkle.preprocessor import SelfCognitionProcessor from twinkle.server.tinker.common import input_feature_to_datum base_model = 'ms://Qwen/Qwen3-30B-A3B-Instruct-2507' -base_url='your-base-url' -api_key='your-api-key' +base_url='http://www.modelscope.cn/twinkle' +api_key=os.environ.get('MODELSCOPE_TOKEN') # Use twinkle dataset to load the data dataset = Dataset(dataset_meta=DatasetMeta('ms://swift/self-cognition', data_slice=range(500))) dataset.set_template('Template', model_id=base_model, max_length=256) -dataset.map(SelfCognitionProcessor('twinkle Model', 'ModelScope Team'), load_from_cache_file=False) +dataset.map(SelfCognitionProcessor('twinkle Model', 'twinkle Team'), load_from_cache_file=False) dataset.encode(batched=True, load_from_cache_file=False) dataloader = DataLoader(dataset=dataset, batch_size=8) diff --git a/README_ZH.md b/README_ZH.md index e2b60dcc..0273fa57 100644 --- a/README_ZH.md +++ b/README_ZH.md @@ -67,11 +67,9 @@ pip install -e . | twinkle 客户端微调 | megatron | [脚本](cookbook/client/twinkle/megatron) | | twinkle 客户端微调 | transformer | [脚本](cookbook/client/twinkle/transformer) | -Twinkle✨支持相同的算法接口运行在单GPU、torchrun多机、Ray、Client等各场景下。其算法过程是外露的,非常便于修改和调试。完整的框架介绍请查看[快速开始](docs/source_zh/使用指引/快速开始.md) - ## 更新日志 -🎉2026-02-13 Twinkle✨ 初始版本发布,支持文本模型的SFT/PT/RL训练。我们还通过兼容Tinker的API,在魔搭社区上提供了无服务器训练功能。 +- 🎉2026-02-13 Twinkle✨ 初始版本发布,包括对文本模型的 SFT/PT/RL 支持以及在 [ModelScope](https://modelscope.cn) 上的无服务器训练能力。 ## ModelScope 的训练服务 @@ -90,37 +88,34 @@ Twinkle✨支持相同的算法接口运行在单GPU、torchrun多机、Ray、Cl 随着新模型的发布,我们将添加对更多模型的支持。下表列出了 Twinkle✨ 框架当前支持的模型。 ->[!Note] -> 通过 `base_url=https://www.modelscope.cn/twinkle` 访问的无服务器训练服务,目前是通过兼容Tinker的API提供的。我们将陆续推出同时支持Tinker API和完整Twinkle✨原生 API的服务。无服务器端点每次由一个训练基座支持,目前使用的是[Qwen3-30B-A3B-Instruct-2507](https://modelscope.cn/models/Qwen/Qwen3-30B-A3B-Instruct-2507)。 - - -| 模型类型 | [ModelScope](https://modelscope.cn) 上的模型 ID | 要求 | Megatron 支持 | HF 模型 ID | -| ----------------- |--------------------------------------------------------------------------------------------------------------------------| -------------------- | -------------- | ---------------------------------------------------------------------------------------------------------- | -| qwen3 系列 | [Qwen/Qwen3-0.6B-Base](https://modelscope.cn/models/Qwen/Qwen3-0.6B-Base)~32B | transformers>=4.51 | ✅ | [Qwen/Qwen3-0.6B-Base](https://huggingface.co/Qwen/Qwen3-0.6B-Base) | -| qwen3_moe 系列 | [Qwen/Qwen3-30B-A3B-Base](https://modelscope.cn/models/Qwen/Qwen3-30B-A3B-Base) | transformers>=4.51 | ✅ | [Qwen/Qwen3-30B-A3B-Base](https://huggingface.co/Qwen/Qwen3-30B-A3B-Base) | -| | [Qwen/Qwen3-30B-A3B](https://modelscope.cn/models/Qwen/Qwen3-30B-A3B)~235B | transformers>=4.51 | ✅ | [Qwen/Qwen3-30B-A3B](https://huggingface.co/Qwen/Qwen3-30B-A3B) | -| qwen2 系列 | [Qwen/Qwen2-0.5B-Instruct](https://modelscope.cn/models/Qwen/Qwen2-0.5B-Instruct) ~72B | transformers>=4.37 | ✅ | [Qwen/Qwen2-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2-0.5B-Instruct) | -| | [Qwen/Qwen2.5-0.5B-Instruct](https://modelscope.cn/models/Qwen/Qwen2.5-0.5B-Instruct)~72B | transformers>=4.37 | ✅ | [Qwen/Qwen2.5-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct) | -| | [Qwen/Qwen2.5-0.5B](https://modelscope.cn/models/Qwen/Qwen2.5-0.5B)~72B | transformers>=4.37 | ✅ | [Qwen/Qwen2.5-0.5B](https://huggingface.co/Qwen/Qwen2.5-0.5B) | -| qwen2_moe 系列 | [Qwen/Qwen1.5-MoE-A2.7B-Chat](https://modelscope.cn/models/Qwen/Qwen1.5-MoE-A2.7B-Chat) | transformers>=4.40 | ✅ | [Qwen/Qwen1.5-MoE-A2.7B-Chat](https://huggingface.co/Qwen/Qwen1.5-MoE-A2.7B-Chat) | -| chatglm4 系列 | [ZhipuAI/glm-4-9b-chat](https://modelscope.cn/models/ZhipuAI/glm-4-9b-chat) | transformers>=4.42 | ✘ | [zai-org/glm-4-9b-chat](https://huggingface.co/zai-org/glm-4-9b-chat) | -| | [ZhipuAI/LongWriter-glm4-9b](https://modelscope.cn/models/ZhipuAI/LongWriter-glm4-9b) | transformers>=4.42 | ✘ | [zai-org/LongWriter-glm4-9b](https://huggingface.co/zai-org/LongWriter-glm4-9b) | -| glm_edge 系列 | [ZhipuAI/glm-edge-1.5b-chat](https://modelscope.cn/models/ZhipuAI/glm-edge-1.5b-chat) | transformers>=4.46 | ✘ | [zai-org/glm-edge-1.5b-chat](https://huggingface.co/zai-org/glm-edge-1.5b-chat) | -| | [ZhipuAI/glm-edge-4b-chat](https://modelscope.cn/models/ZhipuAI/glm-edge-4b-chat) | transformers>=4.46 | ✘ | [zai-org/glm-edge-4b-chat](https://huggingface.co/zai-org/glm-edge-4b-chat) | -| internlm2 系列 | [Shanghai_AI_Laboratory/internlm2-1_8b](https://modelscope.cn/models/Shanghai_AI_Laboratory/internlm2-1_8b) | transformers>=4.38 | ✘ | [internlm/internlm2-1_8b](https://huggingface.co/internlm/internlm2-1_8b) | -| | [Shanghai_AI_Laboratory/internlm2-chat-7b](https://modelscope.cn/models/Shanghai_AI_Laboratory/internlm2-chat-7b) | transformers>=4.38 | ✘ | [internlm/internlm2-chat-7b](https://huggingface.co/internlm/internlm2-chat-7b) | -| deepseek_v1 | [deepseek-ai/deepseek-vl-7b-chat](https://modelscope.cn/models/deepseek-ai/deepseek-vl-7b-chat) | transformers>=4.39.4 | ✅ | —— | -| | [deepseek-ai/DeepSeek-V2-Lite](https://modelscope.cn/models/deepseek-ai/DeepSeek-V2-Lite) | transformers>=4.39.3 | ✅ | [deepseek-ai/DeepSeek-V2-Lite](https://huggingface.co/deepseek-ai/DeepSeek-V2-Lite) | -| | [deepseek-ai/DeepSeek-V2.5](https://modelscope.cn/models/deepseek-ai/DeepSeek-V2.5) | transformers>=4.39.3 | ✅ | [deepseek-ai/DeepSeek-V2.5](https://huggingface.co/deepseek-ai/DeepSeek-V2.5) | -| | [deepseek-ai/DeepSeek-R1](https://modelscope.cn/models/deepseek-ai/DeepSeek-R1) | transformers>=4.39.3 | ✅ | [deepseek-ai/DeepSeek-R1](https://huggingface.co/deepseek-ai/DeepSeek-R1) | -| deepSeek-r1-distill | [deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B](https://modelscope.cn/models/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B) ~32B | transformers>=4.37 | ✅ | [deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B) | +>[!注意] +> 对于通过 `base_url=https://www.modelscope.cn/twinkle` 访问的无服务器训练服务,目前一次只支持一个训练基座,当前是 [Qwen3-30B-A3B-Instruct-2507](https://modelscope.cn/models/Qwen/Qwen3-30B-A3B-Instruct-2507)。 + +| Model Type | Model ID 举例 | Model Size | Requires | Support Megatron | HF Model ID | +| ------------------- | ------------------------------------------------------------ | :-------------------------------------: | -------------------- | :--------------: | :----------------------------------------------------------: | +| qwen3 全系列 | [Qwen/Qwen3-14B-Base](https://modelscope.cn/models/Qwen/Qwen3-14B-Base) | 0.6B/1.7B/4B/8B/14B | transformers>=4.51 | ✔ | [Qwen/Qwen3-14B-Base](https://huggingface.co/Qwen/Qwen3-14B-Base) | +| | [Qwen/Qwen3-32B](https://modelscope.cn/models/Qwen/Qwen3-32B) | 0.6B/1.7B/4B/8B/14B/32B | transformers>=4.51 | ✔ | [Qwen/Qwen3-32B](https://huggingface.co/Qwen/Qwen3-32B) | +| qwen3_moe 全系列 | [Qwen/Qwen3-30B-A3B-Base](https://modelscope.cn/models/Qwen/Qwen3-30B-A3B-Base) | 30B-A3B/A3B-Base,235B-A22B | transformers>=4.51 | ✔ | [Qwen/Qwen3-30B-A3B-Base](https://huggingface.co/Qwen/Qwen3-30B-A3B-Base) | +| qwen2 全系列 | [Qwen/Qwen2-0.5B-Instruct](https://modelscope.cn/models/Qwen/Qwen2-0.5B-Instruct) | 0.5B/1.5B/7B/72B | transformers>=4.37 | ✔ | [Qwen/Qwen2-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2-0.5B-Instruct) | +| | [Qwen/Qwen2-1.5B](https://modelscope.cn/models/Qwen/Qwen2-1.5B) | 0.5B/1.5B/7B/72B | transformers>=4.37 | ✔ | [Qwen/Qwen2-1.5B](https://huggingface.co/Qwen/Qwen2-1.5B) | +| | [Qwen/Qwen2.5-1.5B-Instruct](https://modelscope.cn/models/Qwen/Qwen2.5-1.5B-Instruct) | 0.5B/1.5B/3B/7B/14B/32B/72B | transformers>=4.37 | ✔ | [Qwen/Qwen2.5-1.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct) | +| | [Qwen/Qwen2.5-0.5B](https://modelscope.cn/models/Qwen/Qwen2.5-0.5B) | 0.5B/1.5B/3B/7B/14B/32B | transformers>=4.37 | ✔ | [Qwen/Qwen2.5-0.5B](https://huggingface.co/Qwen/Qwen2.5-0.5B) | +| qwen2_moe 全系列 | [Qwen/Qwen1.5-MoE-A2.7B-Chat](https://modelscope.cn/models/Qwen/Qwen1.5-MoE-A2.7B-Chat) | - | transformers>=4.40 | ✔ | [Qwen/Qwen1.5-MoE-A2.7B-Chat](https://huggingface.co/Qwen/Qwen1.5-MoE-A2.7B-Chat) | +| | [Qwen/Qwen1.5-MoE-A2.7B](https://modelscope.cn/models/Qwen/Qwen1.5-MoE-A2.7B) | - | transformers>=4.40 | ✔ | [Qwen/Qwen1.5-MoE-A2.7B](https://huggingface.co/Qwen/Qwen1.5-MoE-A2.7B) | +| chatglm3 全系列 | [ZhipuAI/chatglm3-6b](https://modelscope.cn/models/ZhipuAI/chatglm3-6b) | 6b/6b-base/6b-32k/6b-128k | transformers<4.42 | ✘ | [zai-org/chatglm3-6b](https://huggingface.co/zai-org/chatglm3-6b) | +| chatglm4 全系列 | [ZhipuAI/glm-4-9b-chat](https://modelscope.cn/models/ZhipuAI/glm-4-9b-chat) | glm-4-9b/glm-4-9b-chat/glm-4-9b-chat-1m | transformers>=4.42 | ✘ | [zai-org/glm-4-9b-chat](https://huggingface.co/zai-org/glm-4-9b-chat) | +| | [ZhipuAI/LongWriter-glm4-9b](https://modelscope.cn/models/ZhipuAI/LongWriter-glm4-9b) | - | transformers>=4.42 | ✘ | [zai-org/LongWriter-glm4-9b](https://huggingface.co/zai-org/LongWriter-glm4-9b) | +| glm_edge 全系列 | [ZhipuAI/glm-edge-1.5b-chat](https://modelscope.cn/models/ZhipuAI/glm-edge-1.5b-chat) | 1.5b-chat/4b-chat | transformers>=4.46 | ✘ | [zai-org/glm-edge-1.5b-chat](https://huggingface.co/zai-org/glm-edge-1.5b-chat) | +| internlm2 全系列 | [Shanghai_AI_Laboratory/internlm2-1_8b](https://modelscope.cn/models/Shanghai_AI_Laboratory/internlm2-1_8b) | 1_8b/chat-1_8b-sft/base-7b/7b/chat-7b/ | transformers>=4.38 | ✘ | [internlm/internlm2-1_8b](https://huggingface.co/internlm/internlm2-1_8b) | +| deepseek_v1 | [deepseek-ai/DeepSeek-V2-Lite](https://modelscope.cn/models/deepseek-ai/DeepSeek-V2-Lite) | V2/V2-Lite/V2-Chat/2-Lite-Chat/V2.5 | transformers>=4.39.3 | ✔ | [deepseek-ai/DeepSeek-V2-Lite](https://huggingface.co/deepseek-ai/DeepSeek-V2-Lite) | +| | [deepseek-ai/DeepSeek-Prover-V2-7B](https://modelscope.cn/models/deepseek-ai/DeepSeek-Prover-V2-7B) | - | transformers>=4.39.3 | ✔ | [deepseek-ai/DeepSeek-Prover-V2-7B](https://huggingface.co/deepseek-ai/DeepSeek-Prover-V2-7B) | +| | [deepseek-ai/DeepSeek-R1](https://modelscope.cn/models/deepseek-ai/DeepSeek-R1) | - | transformers>=4.39.3 | ✔ | [deepseek-ai/DeepSeek-R1](https://huggingface.co/deepseek-ai/DeepSeek-R1) | +| deepSeek-r1-distill | [deepseek-ai/DeepSeek-R1-Distill-Qwen-7B](https://modelscope.cn/models/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B) | 1.5B/7B/14B/32B | transformers>=4.37 | ✔ | [deepseek-ai/DeepSeek-R1-Distill-Qwen-7B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B) | 更详细的模型支持列表 👉 [快速开始.md](docs/source_zh/使用指引/快速开始.md) ## 示例代码 -下面列出了示例代码的一部分能力。完整的训练能力介绍请参考[快速开始](docs/source_zh/使用指引/快速开始.md)以及[cookbook](cookbook)。 - ### 使用 Ray 训练 ```python @@ -140,7 +135,7 @@ twinkle.initialize(mode='ray', groups=device_group, global_device_mesh=device_me def train(): # to load model from Hugging Face, use 'hf://...' - base_model = 'ms://Qwen/Qwen3-4B' + base_model = 'ms://Qwen/Qwen2.5-7B-Instruct' # 1000 samples dataset = Dataset(dataset_meta=DatasetMeta('ms://swift/self-cognition', data_slice=range(1000))) # Set template to prepare encoding @@ -184,26 +179,26 @@ if __name__ == '__main__': train() ``` -### 使用类 Tinker API实现无服务器式训练 +### 使用类 Tinker API ```python import os from tqdm import tqdm from tinker import types -from twinkle import init_tinker_client +from twinkle_client import init_tinker_client from twinkle.dataloader import DataLoader from twinkle.dataset import Dataset, DatasetMeta from twinkle.preprocessor import SelfCognitionProcessor from twinkle.server.tinker.common import input_feature_to_datum base_model = 'ms://Qwen/Qwen3-30B-A3B-Instruct-2507' -base_url='your-base-url' -api_key='your-api-key' +base_url='http://www.modelscope.cn/twinkle' +api_key=os.environ.get('MODELSCOPE_TOKEN') # Use twinkle dataset to load the data dataset = Dataset(dataset_meta=DatasetMeta('ms://swift/self-cognition', data_slice=range(500))) dataset.set_template('Template', model_id=base_model, max_length=256) -dataset.map(SelfCognitionProcessor('twinkle Model', 'ModelScope Team'), load_from_cache_file=False) +dataset.map(SelfCognitionProcessor('twinkle Model', 'twinkle Team'), load_from_cache_file=False) dataset.encode(batched=True, load_from_cache_file=False) dataloader = DataLoader(dataset=dataset, batch_size=8) diff --git a/docs/source_en/Usage Guide/Quick-Start.md b/docs/source_en/Usage Guide/Quick-Start.md index d00a3560..784d1c49 100644 --- a/docs/source_en/Usage Guide/Quick-Start.md +++ b/docs/source_en/Usage Guide/Quick-Start.md @@ -4,10 +4,10 @@ A component library for large model training. Based on PyTorch, simpler, more flexible, production-ready. -🧩 Loosely Coupled Architecture · Standardized Interfaces
-🚀 Multiple Runtime Modes · torchrun / Ray / HTTP
-🔌 Multi-Framework Compatible · Transformers / Megatron
-👥 Multi-Tenant Support · Single Base Model Deployment +🧩 ``Loosely Coupled Architecture `` · Standardized Interfaces `
` +🚀 ``Multiple Runtime Modes `` · torchrun / Ray / HTTP `
` +🔌 ``Multi-Framework Compatible `` · Transformers / Megatron `
` +👥 ``Multi-Tenant Support `` · Single Base Model Deployment ## Twinkle Compatibility @@ -28,824 +28,27 @@ Twinkle and [ms-swift](https://github.com/modelscope/ms-swift) are both model tr - If you need other capabilities like inference, deployment, quantization - If you are sensitive to new model training support, Swift guarantees day-0 update capability -## Usage Patterns - -### Using Only Partial Components - -Developers can use only a portion of Twinkle's components, combining them with their own existing code to complete training work. For example, using only Dataset & DataLoader: - -```python -from twinkle.dataset import PackingDataset, DatasetMeta -from twinkle.dataloader import DataLoader -from twinkle.preprocessor import SelfCognitionProcessor - -def train(): - dataset_meta = DatasetMeta( - dataset_id='ms://swift/self-cognition', - ) - - dataset = PackingDataset(dataset_meta) - dataset.map(SelfCognitionProcessor(model_name='Twinkle Model', model_author='ModelScope Community')) - dataset.set_template('Template', model_id='ms://Qwen/Qwen3-4B', max_length=512) - dataset.encode() - dataset.pack_dataset() - - dataloader = DataLoader(dataset, batch_size=8) - for data in dataloader: - print(data) - """ - { - "input_ids": [...], - "position_ids": [...], - ... - } - """ - break - -if __name__ == '__main__': - train() -``` -In the code above, we use PackingDataset to load a dataset called `swift/self-cognition`. PackingDataset can be used to bin-pack data, ensuring that each batch has a length similar to the configured maximum length. -In the loop, we simply used print to display the output. In actual use, you can continue writing your custom training code below. - -All of Twinkle's components support being used separately. Please refer to the component list in the sections below. - -### Single GPU - -Twinkle supports running training on a single GPU. Here is an example: - -```python -from peft import LoraConfig - -from twinkle import get_device_placement, get_logger -from twinkle.dataloader import DataLoader -from twinkle.dataset import Dataset, DatasetMeta -from twinkle.model import TransformersModel -from twinkle.preprocessor import SelfCognitionProcessor - -logger = get_logger() - - -def train(): - # 1000 samples - dataset = Dataset(dataset_meta=DatasetMeta('ms://swift/self-cognition', data_slice=range(1000))) - # Set template to prepare encoding - dataset.set_template('Template', model_id='ms://Qwen/Qwen3-4B') - # Preprocess the dataset to standard format - dataset.map(SelfCognitionProcessor('twinkle LLM', 'ModelScope Community')) - # Encode dataset - dataset.encode() - # Global batch size = 8, for GPUs, so 1 sample per GPU - dataloader = DataLoader(dataset=dataset, batch_size=8) - # Use a TransformersModel - model = TransformersModel(model_id='ms://Qwen/Qwen3-4B') - - lora_config = LoraConfig(r=8, lora_alpha=32, target_modules='all-linear') - - # Add a lora to model, with name `default` - # Comment this to use full-parameter training - model.add_adapter_to_model('default', lora_config, gradient_accumulation_steps=2) - # Add Optimizer for lora `default` - model.set_optimizer(optimizer_cls='AdamW', lr=1e-4) - # Add LRScheduler for lora `default` - model.set_lr_scheduler( - scheduler_cls='CosineWarmupScheduler', num_warmup_steps=5, num_training_steps=len(dataloader)) - logger.info(get_device_placement()) - # Print the training config - logger.info(model.get_train_configs()) - logger.info(f'Total steps: {len(dataloader)}') - for step, batch in enumerate(dataloader): - # Do forward and backward - model.forward_backward(inputs=batch) - # Step - model.clip_grad_and_step() - if step % 20 == 0: - # Print metric - metric = model.calculate_metric(is_training=True) - logger.info(f'Current is step {step} of {len(dataloader)}, metric: {metric}') - model.save(f'last-checkpoint') - - -if __name__ == '__main__': - train() -``` - -In this training code, we constructed a dataset and loaded the Qwen/Qwen3-4B model, used LoRA with the all-linear approach, and completed one training run. In the logs, you can observe the process of loss gradually converging. - -### torchrun - -Twinkle supports running training in torchrun mode. In this scenario, Ray-related dependencies do not need to be installed. - -```python -from peft import LoraConfig - -import twinkle -from twinkle import DeviceMesh, get_device_placement, get_logger -from twinkle.dataloader import DataLoader -from twinkle.dataset import Dataset, DatasetMeta -from twinkle.model import TransformersModel -from twinkle.preprocessor import SelfCognitionProcessor - -# Construct a device_mesh, fsdp=4, dp=2 -device_mesh = DeviceMesh.from_sizes(fsdp_size=4, dp_size=2) -# use torchrun mode -twinkle.initialize(mode='local', global_device_mesh=device_mesh) - -logger = get_logger() - - -def train(): - # 1000 samples - dataset = Dataset(dataset_meta=DatasetMeta('ms://swift/self-cognition', data_slice=range(1000))) - # Set template to prepare encoding - dataset.set_template('Template', model_id='ms://Qwen/Qwen3-4B') - # Preprocess the dataset to standard format - dataset.map(SelfCognitionProcessor('twinkle LLM', 'ModelScope Community')) - # Encode dataset - dataset.encode() - # Global batch size = 8, for GPUs, so 1 sample per GPU - dataloader = DataLoader(dataset=dataset, batch_size=8) - # Use a TransformersModel - model = TransformersModel(model_id='ms://Qwen/Qwen3-4B') - - lora_config = LoraConfig(r=8, lora_alpha=32, target_modules='all-linear') - - # Add a lora to model, with name `default` - # Comment this to use full-parameter training - model.add_adapter_to_model('default', lora_config, gradient_accumulation_steps=2) - # Add Optimizer for lora `default` - model.set_optimizer(optimizer_cls='AdamW', lr=1e-4) - # Add LRScheduler for lora `default` - model.set_lr_scheduler( - scheduler_cls='CosineWarmupScheduler', num_warmup_steps=5, num_training_steps=len(dataloader)) - logger.info(get_device_placement()) - # Print the training config - logger.info(model.get_train_configs()) - logger.info(f'Total steps: {len(dataloader)}') - for step, batch in enumerate(dataloader): - # Do forward and backward - model.forward_backward(inputs=batch) - # Step - model.clip_grad_and_step() - if step % 20 == 0: - # Print metric - metric = model.calculate_metric(is_training=True) - logger.info(f'Current is step {step} of {len(dataloader)}, metric: {metric}') - model.save(f'last-checkpoint') - - -if __name__ == '__main__': - train() -``` - -In the code above, we constructed a hybrid parallel mode combining FSDP2 and DP, and used 8 GPUs for training. You can see that it is basically the same as the single-GPU training code, except that `DeviceMesh` is used to declare the model layout. - -When running, you need to launch training like this: - -```shell -CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 torchrun --nproc_per_node=8 train.py -``` - -### Ray Training - -[Ray](https://github.com/ray-project/ray) is a commonly used scheduling middleware framework for multi-machine model training and inference scenarios. It provides additional optimizations for multi-model, multi-device execution and resource management, and supports integration with Kubernetes systems for production deployment. These characteristics make it particularly suitable for complex training scenarios such as RL and GKD. - -Twinkle supports using Ray for training and sampling, and its code is almost identical to the training API above: - -```python -import os -from typing import List, Tuple, Dict, Any -from peft import LoraConfig -import twinkle -from twinkle import DeviceMesh, DeviceGroup, get_device_placement -from twinkle.advantage import GRPOAdvantage -from twinkle.checkpoint_engine import CheckpointEngineManager -from twinkle.data_format import SamplingParams -from twinkle.dataloader import DataLoader -from twinkle.dataset import Dataset, DatasetMeta -from twinkle.model.megatron import MegatronModel -from twinkle.metric import CompletionRewardMetric -from twinkle.preprocessor.llm import GSM8KProcessor -from twinkle.processor import InputProcessor -from twinkle.reward import GSM8KAccuracyReward, GSM8KFormatReward -from twinkle.sampler import vLLMSampler -from twinkle.template import Template - -MODEL_ID = os.environ.get('MODEL_ID', 'ms://Qwen/Qwen3-4B') -MODEL_GPUS = int(os.environ.get('MODEL_GPUS', 4)) -SAMPLER_GPUS = int(os.environ.get('SAMPLER_GPUS',4)) -NUM_GPUS = MODEL_GPUS + SAMPLER_GPUS -NUM_GENERATIONS = int(os.environ.get('NUM_GENERATIONS', 8)) -MAX_NEW_TOKENS = int(os.environ.get('MAX_NEW_TOKENS', 4096)) -LEARNING_RATE = float(os.environ.get('LR', 1e-5)) -MAX_STEPS = int(os.environ.get('MAX_STEPS', 200)) -BATCH_SIZE = int(os.environ.get('BATCH_SIZE', 16)) # global prompt-level, global completion-level batch size = BATCH_SIZE * num_generations * dp_size -MINI_BATCH_SIZE = int(os.environ.get('MINI_BATCH_SIZE', 16)) # global completion-level mini-batch-size -MICRO_BATCH_SIZE = int(os.environ.get('MICRO_BATCH_SIZE', 2)) # per-device-micro-batch-size (completion-level), batch_size in forward_backward -GRADIENT_ACCUMULATION_STEPS = int(os.environ.get('GRADIENT_ACCUMULATION_STEPS', 1)) -ADAPTER_NAME = 'default' - -def create_gsm8k_dataset(): - dataset = Dataset(DatasetMeta('ms://modelscope/gsm8k', subset_name='main', split='train')) - dataset.set_template('Template', model_id=MODEL_ID, max_length=2048) - dataset.map(GSM8KProcessor()) - dataset.encode(add_generation_prompt=True) - return dataset - -def compute_rewards( - trajectories: List[Dict[str, Any]], -) -> Tuple[List[float], List[float], List[float]]: - accuracy_reward_fn = GSM8KAccuracyReward() - format_reward_fn = GSM8KFormatReward() - accuracy_rewards = accuracy_reward_fn(trajectories) - format_rewards = format_reward_fn(trajectories) - total_rewards = [a + f for a, f in zip(accuracy_rewards, format_rewards)] - return total_rewards, format_rewards, accuracy_rewards - -def main(): - # set sampler and model separate to use different gpus - device_groups = [ - DeviceGroup(name='model',ranks=list(range(MODEL_GPUS)),device_type='GPU'), - DeviceGroup(name='sampler',ranks=list(range(MODEL_GPUS, NUM_GPUS)),device_type='GPU'), - ] - model_mesh = DeviceMesh.from_sizes(world_size=MODEL_GPUS, dp_size=MODEL_GPUS) - sampler_mesh = DeviceMesh.from_sizes(world_size=SAMPLER_GPUS, dp_size=SAMPLER_GPUS) - twinkle.initialize(mode='ray', nproc_per_node=NUM_GPUS, groups=device_groups, lazy_collect=False) - - lora_config = LoraConfig(target_modules='all-linear', r=32, lora_alpha=64, lora_dropout=0.05) - model = MegatronModel(model_id=MODEL_ID, device_mesh=model_mesh, remote_group='model', mixed_precision='bf16') - model.add_adapter_to_model(ADAPTER_NAME, lora_config, gradient_accumulation_steps=1) - model.set_optimizer('default', lr=LEARNING_RATE) - model.set_lr_scheduler('default', lr_decay_steps=MAX_STEPS, max_lr=LEARNING_RATE) - model.set_loss('GRPOLoss', epsilon=0.2) - model.set_processor(InputProcessor) - model.set_template('Template', model_id=MODEL_ID) - - sampler = vLLMSampler( - model_id=MODEL_ID, - engine_args={ - 'gpu_memory_utilization': 0.8, - 'max_model_len': 4096, - 'max_lora_rank': 32, # save as lora_config - 'enable_lora': True, - }, - device_mesh=sampler_mesh, - remote_group='sampler', - ) - sampler.set_template(Template, model_id=MODEL_ID) - ckpt_manager = CheckpointEngineManager(model=model, sampler=sampler) - dataloader = DataLoader( - dataset=create_gsm8k_dataset, - batch_size=BATCH_SIZE * GRADIENT_ACCUMULATION_STEPS, - min_batch_size=BATCH_SIZE * GRADIENT_ACCUMULATION_STEPS, - device_mesh=model_mesh, - remote_group='model', - ) - advantage_fn = GRPOAdvantage() - metrics = CompletionRewardMetric() - sampling_params = SamplingParams(max_tokens=MAX_NEW_TOKENS) - optim_step = 0 - print(get_device_placement()) - - for batch in dataloader: - if optim_step >= MAX_STEPS: - break - metrics.reset() - global_prompts = batch if isinstance(batch, list) else [batch] - ckpt_manager.sync_weights(merge_and_sync=False) - sampler.reset_prefix_cache() - sample_response = sampler.sample( - global_prompts*NUM_GENERATIONS, - sampling_params, - num_samples=1, - ) - all_input_data: List[Dict[str, Any]] = [] - all_old_logps: List[List[float]] = [] - all_completion_lengths: List[int] = [] - - for sequence in sample_response.sequences: - all_input_data.append(sequence.new_input_feature) - all_old_logps.append(sequence.logprobs) - all_completion_lengths.append(len(sequence.tokens)) - total_rewards, format_rewards, accuracy_rewards = compute_rewards( - all_input_data - ) - metrics.accumulate( - completion_lengths=all_completion_lengths, - rewards={ - 'total': total_rewards, - 'format': format_rewards, - 'accuracy': accuracy_rewards, - }, - ) - advantages = advantage_fn(total_rewards, num_generations=NUM_GENERATIONS, scale='group').tolist() - # Split completions into mini-batches and run one optim step per mini-batch. - total_completions = len(all_input_data) - for mb_start in range(0, total_completions, MINI_BATCH_SIZE): - mb_end = min(mb_start + MINI_BATCH_SIZE, total_completions) - mb_inputs = all_input_data[mb_start:mb_end] - mb_old_logps = all_old_logps[mb_start:mb_end] - mb_advantages = advantages[mb_start:mb_end] - - model.forward_backward( - inputs=mb_inputs, - old_logps=mb_old_logps, - advantages=mb_advantages, - micro_batch_size=MICRO_BATCH_SIZE, - ) - model.clip_grad_and_step() - optim_step += 1 - - if optim_step >= MAX_STEPS: - break - log_dict = metrics.calculate() - log_dict.update(model.calculate_metric(is_training=True)) - metrics.reset() - print(f'[Step {optim_step}/{MAX_STEPS}] {log_dict}') - - print(f'Training completed. optim_steps={optim_step}') - model.save('grpo-gsm8k-checkpoint') - -if __name__ == '__main__': - main() -``` - -In the code above, we provide an RL training example. We can clearly see in the code how data is constructed, how the sampler/model are declared and parameterized, and the construction process for advantage and loss. -There is no explicit reference to `ray` anywhere in this process. We only declared Ray mode during initialization: - -```python -twinkle.initialize(mode='ray', nproc_per_node=NUM_GPUS, groups=device_groups, lazy_collect=False) -``` - -Developers can customize the construction and invocation methods of components like models. All Transformers and Megatron model parameters can be passed in when constructing the model. - -All subsequent Ray calls and data distribution are performed implicitly. Running this script requires having Ray installed beforehand. Then run it like this: - -```shell -python train.py -``` - -### Remote Training - -A major feature of Twinkle is support for multi-tenant mixed training. Specifically, multiple users can use a single base model for LoRA training, which can greatly reduce server-side deployment costs. - -Suppose we start a service using eight GPUs. First, we need to start the Ray cluster: - -```shell -CUDA_VISIBLE_DEVICES=0,1 ray start --head --port=6379 --num-gpus=2 -CUDA_VISIBLE_DEVICES=2,3 ray start --address=127.0.0.1:6379 --num-gpus=2 -CUDA_VISIBLE_DEVICES="" ray start --address=127.0.0.1:6379 --num-gpus=0 -``` - -We started a Ray cluster containing three nodes: -- GPUs 0 and 1 as one node -- GPUs 2 and 3 as one node -- CPU resources as one node - -For production environments, you can start more nodes and deploy more replicas to accommodate larger user volumes. Here we only use four GPUs as an example. - -Next, start the server: -```shell - -cd cookbook/client/twinkle/transformer -python server.py -``` - -The server will start three services: a sampler cluster, a model cluster, and a utility cluster. - -Now you can perform client-side training: -```python -import dotenv -dotenv.load_dotenv('.env') -import re -from twinkle.data_format import Trajectory -from twinkle.reward.base import Reward -import gc -from peft import LoraConfig -from typing import List, Tuple - -from twinkle import get_logger -from twinkle.advantage import GRPOAdvantage -from twinkle.dataset import DatasetMeta -from twinkle.metric import CompletionRewardMetric -from twinkle_client import init_twinkle_client -from twinkle_client.dataloader import DataLoader -from twinkle_client.dataset import Dataset -from twinkle_client.model import MultiLoraTransformersModel -from twinkle_client.sampler import vLLMSampler - -logger = get_logger() - -# ========== Configuration ========== -MODEL_ID = 'ms://Qwen/Qwen3-4B' -NUM_GENERATIONS = 4 -MAX_NEW_TOKENS = 1024 -LEARNING_RATE = 1e-5 -MAX_STEPS = 10 -BATCH_SIZE = 2 -TEMPERATURE = 1.0 -SYNC_INTERVAL = 1 # Save weights for sampler every N steps -GRADIENT_ACCUMULATION_STEPS = 4 - - -def create_countdown_dataset(): - """Create Countdown Game dataset for GRPO training.""" - - dataset = Dataset(dataset_meta=DatasetMeta('ms://zouxuhong/Countdown-Tasks-3to4', data_slice=range(500))) - dataset.set_template('Template', model_id=MODEL_ID, max_length=8192) - dataset.map('CountdownProcessor') - dataset.encode(add_generation_prompt=True, batched=True) - return dataset - - -class CountDownAccuracy(Reward): - - @staticmethod - def countdown_accuracy_reward(completion: str, target: int, nums: List[int]) -> float: - """Accuracy reward: checks if equation is correct.""" - try: - match = re.search(r'(.*?)<\/answer>', completion) - if match is None: - return 0.0 - equation = match.group(1).strip() - if '=' in equation: - equation = equation.split('=')[0] - used_numbers = [int(n) for n in re.findall(r'\d+', equation)] - if sorted(used_numbers) != sorted(nums): - return 0.0 - if not re.match(r'^[\d+\-*/().\s]+$', equation): - return 0.0 - result = eval(equation, {'__builtins__': None}, {}) - return 1.0 if abs(float(result) - float(target)) < 1e-5 else 0.0 - except Exception: # noqa - return 0.0 - - def __call__(self, trajectories: List[Trajectory], ground_truths: List[Trajectory]): - rewards = [] - for trajectory in trajectories: - messages = trajectory.get('messages', []) - completion = '' - for msg in reversed(messages): - if msg.get('role') == 'assistant': - completion = msg.get('content', '') - break - user_data = trajectory.get('user_data', [{}]) - data = user_data[0] if isinstance(user_data, list) and user_data else {} - target = data.get('target', 0) - nums = data.get('nums', []) - acc_reward = self.countdown_accuracy_reward(completion, target, nums) - rewards.append(acc_reward) - return rewards - - -def compute_rewards(trajectories: List[dict], ) -> Tuple[List[float], List[float], List[float]]: - """Compute format and accuracy rewards for Countdown game.""" - from twinkle.reward import FormatReward - format_rewards = FormatReward()(trajectories, []) - accuracy_rewards = CountDownAccuracy()(trajectories, []) - total_rewards = [a + b for a, b in zip(accuracy_rewards, format_rewards)] - return total_rewards, format_rewards, accuracy_rewards - - -def train(): - # Step 1: Initialize the Twinkle client - client = init_twinkle_client( - base_url='http://localhost:8000', - api_key='', - ) - - # Step 2: Prepare dataset and dataloader - dataset = create_countdown_dataset() - dataloader = DataLoader(dataset=dataset, batch_size=BATCH_SIZE) - - # Step 3: Configure the training model - model = MultiLoraTransformersModel(model_id=MODEL_ID) - - lora_config = LoraConfig( - target_modules='all-linear', - r=8, - lora_alpha=32, - lora_dropout=0.05, - ) - model.add_adapter_to_model( - 'default', - lora_config, - gradient_accumulation_steps=GRADIENT_ACCUMULATION_STEPS, - ) - - # Set GRPO loss (the key difference from SFT training) - model.set_loss('GRPOLoss', epsilon=0.2, beta=0.0) - - # Set optimizer and LR scheduler - model.set_optimizer('AdamW', lr=LEARNING_RATE) - model.set_lr_scheduler( - 'CosineWarmupScheduler', - num_warmup_steps=500, - num_training_steps=MAX_STEPS, - ) - - # Set processor and template for encoding inputs - model.set_processor('InputProcessor') - model.set_template('Template', model_id=MODEL_ID) - - # Step 4: Configure the sampler - sampler = vLLMSampler(model_id=MODEL_ID) - sampler.set_template('Template', model_id=MODEL_ID) - - # Step 5: Setup metrics and advantage function - advantage_fn = GRPOAdvantage() - metrics = CompletionRewardMetric() - - sampling_params = { - 'max_tokens': MAX_NEW_TOKENS, - 'temperature': TEMPERATURE, - 'top_p': 0.95, - } - - # Track the current adapter path for sampling - current_adapter_uri = None - - step = 0 - for batch in dataloader: - if step >= MAX_STEPS: - break - - metrics.reset() - prompts = batch if isinstance(batch, list) else [batch] - - # ========== 1. Save weights and update adapter_uri ========== - # Instead of sync_weights, save the model checkpoint and pass - # the resulting path to the sampler as adapter_uri - if step % SYNC_INTERVAL == 0: - logger.info(f'Step {step}: Saving weights for sampler...') - twinkle_path = model.save( - name=f'grpo-sampler-step-{step}', - save_optimizer=False, - ) - current_adapter_uri = twinkle_path - logger.info(f'Step {step}: Saved weights to {current_adapter_uri}') - - # ========== 2. Sample completions ========== - sample_response = sampler.sample( - inputs=prompts, - sampling_params=sampling_params, - adapter_uri=current_adapter_uri, - num_samples=NUM_GENERATIONS, - ) - - input_features = [] - old_logps_list = [] - completion_lengths = [] - - sequences = sample_response.get('sequences', []) - for seq in sequences: - input_features.append(seq.get('new_input_feature', seq)) - old_logps_list.append(seq.get('logprobs', [])) - completion_lengths.append(len(seq.get('tokens', []))) - - if not input_features: - logger.warning(f'Step {step}: No valid samples, skipping') - step += 1 - continue - - # ========== 3. Compute rewards ========== - total_rewards, format_rewards, accuracy_rewards = compute_rewards(input_features) - metrics.accumulate( - None, - None, - completion_lengths=completion_lengths, - rewards={ - 'total': total_rewards, - 'format': format_rewards, - 'accuracy': accuracy_rewards, - }) - - # ========== 4. Compute advantages ========== - advantages = advantage_fn( - total_rewards, - num_generations=NUM_GENERATIONS, - scale='group', - ).tolist() - - frac_zero_std = (1.0 if all(abs(a) < 1e-8 for a in advantages) else 0.0) - if frac_zero_std == 1.0: - logger.info(f'Step {step}: All advantages are zero, skipping training') - step += 1 - continue - - # ========== 5. Training step (GRPO) ========== - # forward_backward with GRPO loss: passes advantages and old_logps - # to the server-side GRPOLoss for proper policy optimization - model.forward_backward( - inputs=input_features, - advantages=advantages, - old_logps=old_logps_list, - ) - - # Gradient clipping and optimizer step - model.clip_grad_norm(1.0) - model.step() - model.zero_grad() - model.lr_step() - - gc.collect() - - # ========== 6. Log ========== - log_dict = metrics.calculate() - log_dict.update(model.calculate_metric()) - log_dict['train/frac_reward_zero_std'] = frac_zero_std - logger.info(f'Step {step}: {log_dict}') - step += 1 - - # Save final checkpoint - twinkle_path = model.save(name='grpo-countdown-final', save_optimizer=True) - logger.info(f'Saved final checkpoint: {twinkle_path}') - - -if __name__ == '__main__': - train() -``` - -Multiple developers can use a single base model from this service for parallel training and sampling. Furthermore, the training methods they use are allowed to differ. For example, User A can perform SFT, User B can perform RL, and User C can perform sampling. Similarly, Twinkle also supports Tinker-like APIs for remote training: - ->[!Note] -> One important note: in the current Twinkle implementation, the client-side Twinkle API and Tinker API cannot be used simultaneously on the same server. When you need to provide the Tinker API, you need to start the service under cookbook/client/tinker. -> This issue will be addressed with high priority in upcoming iterations. - -```python -from tinker import types -from tqdm import tqdm -from tinker import ServiceClient -from twinkle.dataloader import DataLoader -from twinkle.dataset import Dataset, DatasetMeta -from twinkle.preprocessor import SelfCognitionProcessor -from twinkle.server.tinker.common import input_feature_to_datum - -# The base model to fine-tune / evaluate -base_model = 'ms://Qwen/Qwen3-4B' - - -def train(): - # Step 1: Prepare the dataset - - # Load the self-cognition dataset from ModelScope (first 500 examples) - dataset = Dataset(dataset_meta=DatasetMeta('ms://swift/self-cognition', data_slice=range(500))) - - # Apply the chat template matching the base model (max 256 tokens per sample) - dataset.set_template('Template', model_id=f'ms://{base_model}', max_length=256) - - # Replace placeholder names with custom model/author identity - dataset.map(SelfCognitionProcessor('twinkle model', 'twinkle team'), load_from_cache_file=False) - - # Tokenize and encode the dataset into model-ready input features - dataset.encode(batched=True, load_from_cache_file=False) - - # Wrap the dataset into a DataLoader that yields batches of size 8 - dataloader = DataLoader(dataset=dataset, batch_size=8) - - # Step 2: Initialize the training client - # Connect to the Twinkle server running locally - service_client = ServiceClient(base_url='http://localhost:8000', api_key='your-api-key') - # Create a LoRA training client for the base model (rank=16 for the LoRA adapter) - training_client = service_client.create_lora_training_client(base_model=base_model, rank=16) - - # Step 3: Run the training loop - for epoch in range(3): - print(f'Epoch {epoch}') - for step, batch in tqdm(enumerate(dataloader)): - # Convert each InputFeature into a Datum for the Tinker API - input_datum = [input_feature_to_datum(input_feature) for input_feature in batch] - - # Send data to server: forward + backward pass (computes gradients) - fwdbwd_future = training_client.forward_backward(input_datum, 'cross_entropy') - - # Optimizer step: update model weights with Adam - optim_future = training_client.optim_step(types.AdamParams(learning_rate=1e-4)) - - # Wait for both operations to complete - fwdbwd_future.result() - optim_result = optim_future.result() - print(f'Training Metrics: {optim_result}') - - # Save a checkpoint after each epoch - save_future = training_client.save_state(f'twinkle-lora-{epoch}') - save_result = save_future.result() - print(f'Saved checkpoint to {save_result.path}') - - -if __name__ == '__main__': - train() -``` - -### Using ModelScope Community's TaaS Training Service - -Concurrent with the open-source release of the Twinkle framework, we also provide a hosted Training as a Service (TaaS) powered by ModelScope's backend services. Developers can experience Twinkle's training API for free through this service. -This service shares the same code as the Tinker API section described above. The only difference is that the Endpoint and Token need to use the official ModelScope information. For details on how to use the official service, please refer to the detailed description in [Training Service](./Train-as-a-Service.md). - -## Using Hugging Face models - -Switch the prefix. - -```text -ms://Qwen/Qwen3-4B -> hf://Qwen/Qwen3-4B -``` - -## 🛠️ Twinkle✨ Modular Ecosystem - -
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
-

Dataset
Data loading and preprocessing

-
-

Template
Encoding and decoding

-
-

DataLoader
Data distribution and batching

-
-

Preprocessor
Data ETL

-
-

InputProcessor
Task-specific input processing

-
-

Model
Large models, supports multiple frameworks

-
-

Sampler
Sampler logic

-
-

Loss
Loss functions

-
-

Metric
Training metrics collection

-
-

Reward
Reward function

-
-

Advantage
Advantage function

-
-

CheckpointEngine
Weight synchronization

-
-

Patch
Patches for model fixes

-
-

Module
Components, e.g., Optimizer

-
-

Kernel
Operators

-
-

Server
Start backend cluster

-
-

Client
Client code

-
-

Infra
Isolate ray and torchrun differences

-
-

Plugin
Use hub components

-
-

Hub
Interface with HF/MS libraries

-
-
- ## Twinkle's Customizable Components In Twinkle's design, training using torchrun, Ray, and HTTP uses the same API and shares the same components and input/output structures. Therefore, many of its components can be customized by developers to implement new algorithm development. Below is a list of recommended components for customization: -| Component Name | Base Class | Description | -| --------------------- | ------------------------------------------ | -------------------------------------------------------------- | -| Loss | twinkle.loss.Loss | Used to define loss functions for model training | -| Metric | twinkle.metric.Metric | Used to define evaluation systems for model training | -| Optimizer/LRScheduler | Based on PyTorch | Used to define optimizers and LR schedulers for model training | -| Patch | twinkle.patch.Patch | Used to fix issues during model training | -| Preprocessor | twinkle.preprocessor.Preprocessor | Used for data preprocessing (ETL) and returns standard format usable by Template | -| Filter | twinkle.preprocessor.Filter | Used to filter raw data for reasonableness | -| Task Data Processor | twinkle.processor.InputProcessor | Used to convert model inputs to data required by each task and add extra fields | -| Model | twinkle.model.TwinkleModel | The large model itself | -| Sampler | twinkle.sampler.Sampler | Sampler, e.g., vLLM | -| Reward | twinkle.reward.Reward | Used to implement rewards for different RL training | -| Advantage | twinkle.advantage.Advantage | Used to implement advantage estimation for different RL training | -| Template | twinkle.template.Template | Used to process standard inputs and convert them to tokens required by the model | -| Weight Synchronization | twinkle.checkpoint_engine.CheckpointEngine | Used for weight synchronization in RL training | +| Component Name | Base Class | Description | +| ---------------------- | ------------------------------------------ | -------------------------------------------------------------------------------- | +| Loss | twinkle.loss.Loss | Used to define loss functions for model training | +| Metric | twinkle.metric.Metric | Used to define evaluation systems for model training | +| Optimizer/LRScheduler | Based on PyTorch | Used to define optimizers and LR schedulers for model training | +| Patch | twinkle.patch.Patch | Used to fix issues during model training | +| Preprocessor | twinkle.preprocessor.Preprocessor | Used for data preprocessing (ETL) and returns standard format usable by Template | +| Filter | twinkle.preprocessor.Filter | Used to filter raw data for reasonableness | +| Task Data Processor | twinkle.processor.InputProcessor | Used to convert model inputs to data required by each task and add extra fields | +| Model | twinkle.model.TwinkleModel | The large model itself | +| Sampler | twinkle.sampler.Sampler | Sampler, e.g., vLLM | +| Reward | twinkle.reward.Reward | Used to implement rewards for different RL training | +| Advantage | twinkle.advantage.Advantage | Used to implement advantage estimation for different RL training | +| Template | twinkle.template.Template | Used to process standard inputs and convert them to tokens required by the model | +| Weight Synchronization | twinkle.checkpoint_engine.CheckpointEngine | Used for weight synchronization in RL training | > Components not listed in the above table, such as Dataset, DataLoader, etc., can also be customized, just follow the base class API design. @@ -874,10 +77,10 @@ DeviceGroup: Define how many resource groups are needed for this training sessio ```python from twinkle.model import TransformersModel -model = TransformersModel(model_id='Qwen/Qwen3-4B', remote_group='default', device_mesh=device_mesh) +model = TransformersModel(model_id='ms://Qwen/Qwen2.5-7B-Instruct', remote_group='default', device_mesh=device_mesh) # Or from twinkle.model import MegatronModel -model = MegatronModel(model_id='Qwen/Qwen3-4B', remote_group='default', device_mesh=device_mesh) +model = MegatronModel(model_id='ms://Qwen/Qwen2.5-7B-Instruct', remote_group='default', device_mesh=device_mesh) ``` DeviceMesh specifies the topology of components like models within the resource group. It can be understood as how to perform parallelization. This affects a series of framework decisions, such as data acquisition, data consumption, data return, etc. @@ -903,7 +106,7 @@ def train(): # 1000 samples dataset = Dataset(dataset_meta=DatasetMeta('ms://swift/self-cognition', data_slice=range(1000))) # Set template to prepare encoding - dataset.set_template('Template', model_id='Qwen/Qwen3-4B') + dataset.set_template('Template', model_id='ms://Qwen/Qwen2.5-7B-Instruct') # Preprocess the dataset to standard format dataset.map(SelfCognitionProcessor('twinkle LLM', 'ModelScope Community')) # Encode dataset @@ -911,7 +114,7 @@ def train(): # Global batch size = 8, for GPUs, so 1 sample per GPU dataloader = DataLoader(dataset=dataset, batch_size=8, min_batch_size=8) # Use a TransformersModel - model = TransformersModel(model_id='Qwen/Qwen3-4B', remote_group='default') + model = TransformersModel(model_id='ms://Qwen/Qwen2.5-7B-Instruct', remote_group='default') lora_config = LoraConfig( r=8, @@ -951,54 +154,26 @@ python3 train.py ## Supported Large Language Models List -| Model Type | Model ID Example | Requires | Support Megatron | HF Model ID | -|---------------------|-------------------------------------------------------------------------------------------------------------------------------|----------------------|------------------|---------------------------------------------------------------------------------------------------------------| -| qwen2 series | [Qwen/Qwen2-0.5B-Instruct](https://modelscope.cn/models/Qwen/Qwen2-0.5B-Instruct) | transformers>=4.37 | ✔ | [Qwen/Qwen2-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2-0.5B-Instruct) | -| | [Qwen/Qwen2-72B-Instruct](https://modelscope.cn/models/Qwen/Qwen2-72B-Instruct) | transformers>=4.37 | ✔ | [Qwen/Qwen2-72B-Instruct](https://huggingface.co/Qwen/Qwen2-72B-Instruct) | -| | [Qwen/Qwen2-1.5B](https://modelscope.cn/models/Qwen/Qwen2-1.5B) | transformers>=4.37 | ✔ | [Qwen/Qwen2-1.5B](https://huggingface.co/Qwen/Qwen2-1.5B) | -| | [Qwen/Qwen2-7B](https://modelscope.cn/models/Qwen/Qwen2-7B) | transformers>=4.37 | ✔ | [Qwen/Qwen2-7B](https://huggingface.co/Qwen/Qwen2-7B) | -| | [Qwen/Qwen2-72B](https://modelscope.cn/models/Qwen/Qwen2-72B) | transformers>=4.37 | ✔ | [Qwen/Qwen2-72B](https://huggingface.co/Qwen/Qwen2-72B) | -| | [Qwen/Qwen2.5-0.5B-Instruct](https://modelscope.cn/models/Qwen/Qwen2.5-0.5B-Instruct) | transformers>=4.37 | ✔ | [Qwen/Qwen2.5-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct) | -| | [Qwen/Qwen2.5-1.5B-Instruct](https://modelscope.cn/models/Qwen/Qwen2.5-1.5B-Instruct) | transformers>=4.37 | ✔ | [Qwen/Qwen2.5-1.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct) | -| | [Qwen/Qwen2.5-72B-Instruct](https://modelscope.cn/models/Qwen/Qwen2.5-72B-Instruct) | transformers>=4.37 | ✔ | [Qwen/Qwen2.5-72B-Instruct](https://huggingface.co/Qwen/Qwen2.5-72B-Instruct) | -| | [Qwen/Qwen2.5-0.5B](https://modelscope.cn/models/Qwen/Qwen2.5-0.5B) | transformers>=4.37 | ✔ | [Qwen/Qwen2.5-0.5B](https://huggingface.co/Qwen/Qwen2.5-0.5B) | -| | [Qwen/Qwen2.5-32B](https://modelscope.cn/models/Qwen/Qwen2.5-32B) | transformers>=4.37 | ✔ | [Qwen/Qwen2.5-32B](https://huggingface.co/Qwen/Qwen2.5-32B) | -| qwen2_moe series | [Qwen/Qwen1.5-MoE-A2.7B-Chat](https://modelscope.cn/models/Qwen/Qwen1.5-MoE-A2.7B-Chat) | transformers>=4.40 | ✔ | [Qwen/Qwen1.5-MoE-A2.7B-Chat](https://huggingface.co/Qwen/Qwen1.5-MoE-A2.7B-Chat) | -| | [Qwen/Qwen1.5-MoE-A2.7B](https://modelscope.cn/models/Qwen/Qwen1.5-MoE-A2.7B) | transformers>=4.40 | ✔ | [Qwen/Qwen1.5-MoE-A2.7B](https://huggingface.co/Qwen/Qwen1.5-MoE-A2.7B) | -| qwen3 series | [Qwen/Qwen3-0.6B-Base](https://modelscope.cn/models/Qwen/Qwen3-0.6B-Base) | transformers>=4.51 | ✔ | [Qwen/Qwen3-0.6B-Base](https://huggingface.co/Qwen/Qwen3-0.6B-Base) | -| | [Qwen/Qwen3-14B-Base](https://modelscope.cn/models/Qwen/Qwen3-14B-Base) | transformers>=4.51 | ✔ | [Qwen/Qwen3-14B-Base](https://huggingface.co/Qwen/Qwen3-14B-Base) | -| | [Qwen/Qwen3-0.6B](https://modelscope.cn/models/Qwen/Qwen3-0.6B) | transformers>=4.51 | ✔ | [Qwen/Qwen3-0.6B](https://huggingface.co/Qwen/Qwen3-0.6B) | -| | [Qwen/Qwen3-1.7B](https://modelscope.cn/models/Qwen/Qwen3-1.7B) | transformers>=4.51 | ✔ | [Qwen/Qwen3-1.7B](https://huggingface.co/Qwen/Qwen3-1.7B) | -| | [Qwen/Qwen3-32B](https://modelscope.cn/models/Qwen/Qwen2.5-32B) | transformers>=4.51 | ✔ | [Qwen/Qwen3-32B](https://huggingface.co/Qwen/Qwen3-32B) | -| qwen3_moe series | [Qwen/Qwen3-30B-A3B-Base](https://modelscope.cn/models/Qwen/Qwen3-30B-A3B-Base) | transformers>=4.51 | ✔ | [Qwen/Qwen3-30B-A3B-Base](https://huggingface.co/Qwen/Qwen3-30B-A3B-Base) | -| | [Qwen/Qwen3-30B-A3B](https://modelscope.cn/models/Qwen/Qwen3-30B-A3B) | transformers>=4.51 | ✔ | [Qwen/Qwen3-30B-A3B](https://huggingface.co/Qwen/Qwen3-30B-A3B) | -| | [Qwen/Qwen3-235B-A22B](https://modelscope.cn/models/Qwen/Qwen3-235B-A22B) | transformers>=4.51 | ✔ | [Qwen/Qwen3-235B-A22B](https://huggingface.co/Qwen/Qwen3-235B-A22B) | -| chatglm2 series | [ZhipuAI/chatglm2-6b](https://modelscope.cn/models/ZhipuAI/chatglm2-6b) | transformers<4.42 | ✘ | [zai-org/chatglm2-6b](https://huggingface.co/zai-org/chatglm2-6b) | -| | [ZhipuAI/chatglm2-6b-32k](https://modelscope.cn/models/ZhipuAI/chatglm2-6b-32k) | transformers<4.42 | ✘ | [zai-org/chatglm2-6b-32k](https://huggingface.co/zai-org/chatglm2-6b-32k) | -| chatglm3 series | [ZhipuAI/chatglm3-6b](https://modelscope.cn/models/ZhipuAI/chatglm3-6b) | transformers<4.42 | ✘ | [zai-org/chatglm3-6b](https://huggingface.co/zai-org/chatglm3-6b) | -| | [ZhipuAI/chatglm3-6b-base](https://modelscope.cn/models/ZhipuAI/chatglm3-6b-base) | transformers<4.42 | ✘ | [zai-org/chatglm3-6b-base](https://huggingface.co/zai-org/chatglm3-6b-base) | -| | [ZhipuAI/chatglm3-6b-32k](https://modelscope.cn/models/ZhipuAI/chatglm3-6b-32k) | transformers<4.42 | ✘ | [zai-org/chatglm3-6b-32k](https://huggingface.co/zai-org/chatglm3-6b-32k) | -| | [ZhipuAI/chatglm3-6b-128k](https://modelscope.cn/models/ZhipuAI/chatglm3-6b-128k) | transformers<4.42 | ✘ | [zai-org/chatglm3-6b-128k](https://huggingface.co/zai-org/chatglm3-6b-128k) | -| chatglm4 series | [ZhipuAI/glm-4-9b-chat](https://modelscope.cn/models/ZhipuAI/glm-4-9b-chat) | transformers>=4.42 | ✘ | [zai-org/glm-4-9b-chat](https://huggingface.co/zai-org/glm-4-9b-chat) | -| | [ZhipuAI/glm-4-9b](https://modelscope.cn/models/ZhipuAI/glm-4-9b) | transformers>=4.42 | ✘ | [zai-org/glm-4-9b](https://huggingface.co/zai-org/glm-4-9b) | -| | [ZhipuAI/glm-4-9b-chat-1m](https://modelscope.cn/models/ZhipuAI/glm-4-9b-chat-1m) | transformers>=4.42 | ✘ | [zai-org/glm-4-9b-chat-1m](https://huggingface.co/zai-org/glm-4-9b-chat-1m) | -| | [ZhipuAI/LongWriter-glm4-9b](https://modelscope.cn/models/ZhipuAI/LongWriter-glm4-9b) | transformers>=4.42 | ✘ | [zai-org/LongWriter-glm4-9b](https://huggingface.co/zai-org/LongWriter-glm4-9b) | -| glm_edge series | [ZhipuAI/glm-edge-1.5b-chat](https://modelscope.cn/models/ZhipuAI/glm-edge-1.5b-chat) | transformers>=4.46 | ✘ | [zai-org/glm-edge-1.5b-chat](https://huggingface.co/zai-org/glm-edge-1.5b-chat) | -| | [ZhipuAI/glm-edge-4b-chat](https://modelscope.cn/models/ZhipuAI/glm-edge-4b-chat) | transformers>=4.46 | ✘ | [zai-org/glm-edge-4b-chat](https://huggingface.co/zai-org/glm-edge-4b-chat) | -| internlm2 series | [Shanghai_AI_Laboratory/internlm2-1_8b](https://modelscope.cn/models/Shanghai_AI_Laboratory/internlm2-1_8b) | transformers>=4.38 | ✘ | [internlm/internlm2-1_8b](https://huggingface.co/internlm/internlm2-1_8b) | -| | [Shanghai_AI_Laboratory/internlm2-chat-1_8b-sft](https://modelscope.cn/models/Shanghai_AI_Laboratory/internlm2-chat-1_8b-sft) | transformers>=4.38 | ✘ | [internlm/internlm2-chat-1_8b-sft](https://huggingface.co/internlm/internlm2-chat-1_8b-sft) | -| | [Shanghai_AI_Laboratory/internlm2-base-7b](https://modelscope.cn/models/Shanghai_AI_Laboratory/internlm2-base-7b) | transformers>=4.38 | ✘ | [internlm/internlm2-base-7b](https://huggingface.co/internlm/internlm2-base-7b) | -| | [Shanghai_AI_Laboratory/internlm2-7b](https://modelscope.cn/models/Shanghai_AI_Laboratory/internlm2-7b) | transformers>=4.38 | ✘ | [internlm/internlm2-7b](https://huggingface.co/internlm/internlm2-7b) | -| | [Shanghai_AI_Laboratory/internlm2-chat-7b](https://modelscope.cn/models/Shanghai_AI_Laboratory/internlm2-chat-7b) | transformers>=4.38 | ✘ | [internlm/internlm2-chat-7b](https://huggingface.co/internlm/internlm2-chat-7b) | -| deepseek_v1 | [deepseek-ai/deepseek-vl-7b-chat](https://modelscope.cn/models/deepseek-ai/deepseek-vl-7b-chat) | transformers>=4.39.4 | ✔ | | -| | [deepseek-ai/DeepSeek-V2-Lite](https://modelscope.cn/models/deepseek-ai/DeepSeek-V2-Lite) | transformers>=4.39.3 | ✔ | [deepseek-ai/DeepSeek-V2-Lite](https://huggingface.co/deepseek-ai/DeepSeek-V2-Lite) | -| | [deepseek-ai/DeepSeek-V2-Lite-Chat](https://modelscope.cn/models/deepseek-ai/DeepSeek-V2-Lite-Chat) | transformers>=4.39.3 | ✔ | [deepseek-ai/DeepSeek-V2-Lite-Chat](https://huggingface.co/deepseek-ai/DeepSeek-V2-Lite-Chat) | -| | [deepseek-ai/DeepSeek-V2](https://modelscope.cn/models/deepseek-ai/DeepSeek-V2) | transformers>=4.39.3 | ✔ | [deepseek-ai/DeepSeek-V2](https://huggingface.co/deepseek-ai/DeepSeek-V2) | -| | [deepseek-ai/DeepSeek-V2-Chat](https://modelscope.cn/models/deepseek-ai/DeepSeek-V2-Chat) | transformers>=4.39.3 | ✔ | [deepseek-ai/DeepSeek-V2-Chat](https://huggingface.co/deepseek-ai/DeepSeek-V2-Chat) | -| | [deepseek-ai/DeepSeek-V2.5](https://modelscope.cn/models/deepseek-ai/DeepSeek-V2.5) | transformers>=4.39.3 | ✔ | [deepseek-ai/DeepSeek-V2.5](https://huggingface.co/deepseek-ai/DeepSeek-V2.5) | -| | [deepseek-ai/DeepSeek-Prover-V2-7B](https://modelscope.cn/models/deepseek-ai/DeepSeek-Prover-V2-7B) | transformers>=4.39.3 | ✔ | [deepseek-ai/DeepSeek-Prover-V2-7B](https://huggingface.co/deepseek-ai/DeepSeek-Prover-V2-7B) | -| | [deepseek-ai/DeepSeek-R1](https://modelscope.cn/models/deepseek-ai/DeepSeek-R1) | transformers>=4.39.3 | ✔ | [deepseek-ai/DeepSeek-R1](https://huggingface.co/deepseek-ai/DeepSeek-R1) | -| deepSeek-r1-distill | [deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B](https://modelscope.cn/models/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B) | transformers>=4.37 | ✔ | [deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B) | -| | [deepseek-ai/DeepSeek-R1-Distill-Qwen-7B](https://modelscope.cn/models/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B) | transformers>=4.37 | ✔ | [deepseek-ai/DeepSeek-R1-Distill-Qwen-7B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B) | -| | [deepseek-ai/DeepSeek-R1-Distill-Qwen-14B](https://modelscope.cn/models/deepseek-ai/DeepSeek-R1-Distill-Qwen-14B) | transformers>=4.37 | ✔ | [deepseek-ai/DeepSeek-R1-Distill-Qwen-14B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-14B) | -| | [deepseek-ai/DeepSeek-R1-Distill-Qwen-32B](https://modelscope.cn/models/deepseek-ai/DeepSeek-R1-Distill-Qwen-32B) | transformers>=4.37 | ✔ | [deepseek-ai/DeepSeek-R1-Distill-Qwen-32B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-32B) | +| Model Type | Model ID Example | Model Size | Requires | Support Megatron | HF Model ID | +| ------------------- | ------------------------------------------------------------------------------------------------------------ | :-------------------------------------: | -------------------- | :--------------: | :----------------------------------------------------------------------------------------------------: | +| qwen2 series | [Qwen/Qwen2-0.5B-Instruct](https://modelscope.cn/models/Qwen/Qwen2-0.5B-Instruct) | 0.5B/1.5B/7B/72B | transformers>=4.37 | ✔ | [Qwen/Qwen2-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2-0.5B-Instruct) | +| | [Qwen/Qwen2-1.5B](https://modelscope.cn/models/Qwen/Qwen2-1.5B) | 0.5B/1.5B/7B/72B | transformers>=4.37 | ✔ | [Qwen/Qwen2-1.5B](https://huggingface.co/Qwen/Qwen2-1.5B) | +| | [Qwen/Qwen2.5-1.5B-Instruct](https://modelscope.cn/models/Qwen/Qwen2.5-1.5B-Instruct) | 0.5B/1.5B/3B/7B/14B/32B/72B | transformers>=4.37 | ✔ | [Qwen/Qwen2.5-1.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct) | +| | [Qwen/Qwen2.5-0.5B](https://modelscope.cn/models/Qwen/Qwen2.5-0.5B) | 0.5B/1.5B/3B/7B/14B/32B | transformers>=4.37 | ✔ | [Qwen/Qwen2.5-0.5B](https://huggingface.co/Qwen/Qwen2.5-0.5B) | +| qwen2_moe series | [Qwen/Qwen1.5-MoE-A2.7B-Chat](https://modelscope.cn/models/Qwen/Qwen1.5-MoE-A2.7B-Chat) | - | transformers>=4.40 | ✔ | [Qwen/Qwen1.5-MoE-A2.7B-Chat](https://huggingface.co/Qwen/Qwen1.5-MoE-A2.7B-Chat) | +| | [Qwen/Qwen1.5-MoE-A2.7B](https://modelscope.cn/models/Qwen/Qwen1.5-MoE-A2.7B) | - | transformers>=4.40 | ✔ | [Qwen/Qwen1.5-MoE-A2.7B](https://huggingface.co/Qwen/Qwen1.5-MoE-A2.7B) | +| qwen3 series | [Qwen/Qwen3-14B-Base](https://modelscope.cn/models/Qwen/Qwen3-14B-Base) | 0.6B/1.7B/4B/8B/14B | transformers>=4.51 | ✔ | [Qwen/Qwen3-14B-Base](https://huggingface.co/Qwen/Qwen3-14B-Base) | +| | [Qwen/Qwen3-32B](https://modelscope.cn/models/Qwen/Qwen3-32B) | 0.6B/1.7B/4B/8B/14B/32B | transformers>=4.51 | ✔ | [Qwen/Qwen3-32B](https://huggingface.co/Qwen/Qwen3-32B) | +| qwen3_moe series | [Qwen/Qwen3-30B-A3B-Base](https://modelscope.cn/models/Qwen/Qwen3-30B-A3B-Base) | - | transformers>=4.51 | ✔ | [Qwen/Qwen3-30B-A3B-Base](https://huggingface.co/Qwen/Qwen3-30B-A3B-Base) | +| | [Qwen/Qwen3-30B-A3B](https://modelscope.cn/models/Qwen/Qwen3-30B-A3B) | - | transformers>=4.51 | ✔ | [Qwen/Qwen3-30B-A3B](https://huggingface.co/Qwen/Qwen3-30B-A3B) | +| | [Qwen/Qwen3-235B-A22B](https://modelscope.cn/models/Qwen/Qwen3-235B-A22B) | - | transformers>=4.51 | ✔ | [Qwen/Qwen3-235B-A22B](https://huggingface.co/Qwen/Qwen3-235B-A22B) | +| chatglm2 series | [ZhipuAI/chatglm2-6b](https://modelscope.cn/models/ZhipuAI/chatglm2-6b) | 6b/6b-32k | transformers<4.42 | ✘ | [zai-org/chatglm2-6b](https://huggingface.co/zai-org/chatglm2-6b) | +| chatglm3 series | [ZhipuAI/chatglm3-6b](https://modelscope.cn/models/ZhipuAI/chatglm3-6b) | 6b/6b-base/6b-32k/6b-128k | transformers<4.42 | ✘ | [zai-org/chatglm3-6b](https://huggingface.co/zai-org/chatglm3-6b) | +| chatglm4 series | [ZhipuAI/glm-4-9b-chat](https://modelscope.cn/models/ZhipuAI/glm-4-9b-chat) | glm-4-9b/glm-4-9b-chat/glm-4-9b-chat-1m | transformers>=4.42 | ✘ | [zai-org/glm-4-9b-chat](https://huggingface.co/zai-org/glm-4-9b-chat) | +| | [ZhipuAI/LongWriter-glm4-9b](https://modelscope.cn/models/ZhipuAI/LongWriter-glm4-9b) | - | transformers>=4.42 | ✘ | [zai-org/LongWriter-glm4-9b](https://huggingface.co/zai-org/LongWriter-glm4-9b) | +| glm_edge series | [ZhipuAI/glm-edge-1.5b-chat](https://modelscope.cn/models/ZhipuAI/glm-edge-1.5b-chat) | 1.5b-chat/4b-chat | transformers>=4.46 | ✘ | [zai-org/glm-edge-1.5b-chat](https://huggingface.co/zai-org/glm-edge-1.5b-chat) | +| internlm2 series | [Shanghai_AI_Laboratory/internlm2-1_8b](https://modelscope.cn/models/Shanghai_AI_Laboratory/internlm2-1_8b) | 1_8b/chat-1_8b-sft/base-7b/7b/chat-7b/ | transformers>=4.38 | ✘ | [internlm/internlm2-1_8b](https://huggingface.co/internlm/internlm2-1_8b) | +| deepseek_v1 | [deepseek-ai/DeepSeek-V2-Lite](https://modelscope.cn/models/deepseek-ai/DeepSeek-V2-Lite) | V2/V2-Lite/V2-Chat/2-Lite-Chat/V2.5 | transformers>=4.39.3 | ✔ | [deepseek-ai/DeepSeek-V2-Lite](https://huggingface.co/deepseek-ai/DeepSeek-V2-Lite) | +| | [deepseek-ai/DeepSeek-Prover-V2-7B](https://modelscope.cn/models/deepseek-ai/DeepSeek-Prover-V2-7B) | - | transformers>=4.39.3 | ✔ | [deepseek-ai/DeepSeek-Prover-V2-7B](https://huggingface.co/deepseek-ai/DeepSeek-Prover-V2-7B) | +| | [deepseek-ai/DeepSeek-R1](https://modelscope.cn/models/deepseek-ai/DeepSeek-R1) | - | transformers>=4.39.3 | ✔ | [deepseek-ai/DeepSeek-R1](https://huggingface.co/deepseek-ai/DeepSeek-R1) | +| deepSeek-r1-distill | [deepseek-ai/DeepSeek-R1-Distill-Qwen-7B](https://modelscope.cn/models/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B) | 1.5B/7B/14B/32B | transformers>=4.37 | ✔ | [deepseek-ai/DeepSeek-R1-Distill-Qwen-7B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B) | diff --git "a/docs/source_zh/\344\275\277\347\224\250\346\214\207\345\274\225/\345\277\253\351\200\237\345\274\200\345\247\213.md" "b/docs/source_zh/\344\275\277\347\224\250\346\214\207\345\274\225/\345\277\253\351\200\237\345\274\200\345\247\213.md" index fa1da56d..95a5add9 100644 --- "a/docs/source_zh/\344\275\277\347\224\250\346\214\207\345\274\225/\345\277\253\351\200\237\345\274\200\345\247\213.md" +++ "b/docs/source_zh/\344\275\277\347\224\250\346\214\207\345\274\225/\345\277\253\351\200\237\345\274\200\345\247\213.md" @@ -4,10 +4,10 @@ 大模型训练组件库。基于 PyTorch,更简洁、更灵活、生产就绪。 -🧩 松耦合架构 · 标准化接口
-🚀 多运行模式 · torchrun / Ray / HTTP
-🔌 多框架兼容 · Transformers / Megatron
-👥 多租户支持 · 单基座模型部署 +🧩 ``松耦合架构 `` · 标准化接口 `
` +🚀 ``多运行模式 `` · torchrun / Ray / HTTP `
` +🔌 ``多框架兼容 `` · Transformers / Megatron `
` +👥 ``多租户支持 `` · 单基座模型部署 ## Twinkle 适配性 @@ -28,826 +28,27 @@ Twinkle 和 [ms-swift](https://github.com/modelscope/ms-swift) 都是模型训 - 如果你需要推理、部署、量化等其他能力 - 如果你对新模型的训练支持敏感,Swift 会保证 day-0 的更新能力 -## 使用模式 - -### 仅使用部分组件 - -开发者可以仅使用Twinkle的一部分组件,结合自己的已有代码来完成训练工作。例如,仅使用Dataset&DataLoader: - -```python -from twinkle.dataset import PackingDataset, DatasetMeta -from twinkle.dataloader import DataLoader -from twinkle.preprocessor import SelfCognitionProcessor - -def train(): - dataset_meta = DatasetMeta( - dataset_id='ms://swift/self-cognition', - ) - - dataset = PackingDataset(dataset_meta) - dataset.map(SelfCognitionProcessor(model_name='Twinkle模型', model_author='ModelScope社区')) - dataset.set_template('Template', model_id='ms://Qwen/Qwen3-4B', max_length=512) - dataset.encode() - dataset.pack_dataset() - - dataloader = DataLoader(dataset, batch_size=8) - for data in dataloader: - print(data) - """ - { - "input_ids": [...], - "position_ids": [...], - ... - } - """ - break - -if __name__ == '__main__': - train() -``` -上面的代码中,使用PackingDataset加载了一个叫做`swift/self-cognition`的数据集。PackingDataset可以用于将数据进行装箱,保证每个batch的长度都与设置的最大长度相似。 -我们在循环中简单地使用了print打印了输出,在实际使用中,你可以在下面继续编写你的自定义训练代码。 - -Twinkle的所有组件都支持单独拆分使用,可以参考下面章节的组件列表。 - -### 单GPU - -Twinkle支持单GPU运行训练。下面是一个例子: - -```python -from peft import LoraConfig - -from twinkle import get_device_placement, get_logger -from twinkle.dataloader import DataLoader -from twinkle.dataset import Dataset, DatasetMeta -from twinkle.model import TransformersModel -from twinkle.preprocessor import SelfCognitionProcessor - -logger = get_logger() - - -def train(): - # 1000 samples - dataset = Dataset(dataset_meta=DatasetMeta('ms://swift/self-cognition', data_slice=range(1000))) - # Set template to prepare encoding - dataset.set_template('Template', model_id='ms://Qwen/Qwen3-4B') - # Preprocess the dataset to standard format - dataset.map(SelfCognitionProcessor('twinkle大模型', 'ModelScope社区')) - # Encode dataset - dataset.encode() - # Global batch size = 8, for GPUs, so 1 sample per GPU - dataloader = DataLoader(dataset=dataset, batch_size=8) - # Use a TransformersModel - model = TransformersModel(model_id='ms://Qwen/Qwen3-4B') - - lora_config = LoraConfig(r=8, lora_alpha=32, target_modules='all-linear') - - # Add a lora to model, with name `default` - # Comment this to use full-parameter training - model.add_adapter_to_model('default', lora_config, gradient_accumulation_steps=2) - # Add Optimizer for lora `default` - model.set_optimizer(optimizer_cls='AdamW', lr=1e-4) - # Add LRScheduler for lora `default` - model.set_lr_scheduler( - scheduler_cls='CosineWarmupScheduler', num_warmup_steps=5, num_training_steps=len(dataloader)) - logger.info(get_device_placement()) - # Print the training config - logger.info(model.get_train_configs()) - logger.info(f'Total steps: {len(dataloader)}') - for step, batch in enumerate(dataloader): - # Do forward and backward - model.forward_backward(inputs=batch) - # Step - model.clip_grad_and_step() - if step % 20 == 0: - # Print metric - metric = model.calculate_metric(is_training=True) - logger.info(f'Current is step {step} of {len(dataloader)}, metric: {metric}') - model.save(f'last-checkpoint') - - -if __name__ == '__main__': - train() - -``` - -在这个训练代码中,我们构造了一个数据集并拉起了Qwen/Qwen3-4B模型,使用all-linear方式加载了lora,并完成了一次训练。在日志中,可以看到loss逐步收敛的过程。 - -### torchrun - -Twinkle支持以torchrun模式运行训练。在这种场景下,不需要安装ray相关的依赖。 - -```python -from peft import LoraConfig - -import twinkle -from twinkle import DeviceMesh, get_device_placement, get_logger -from twinkle.dataloader import DataLoader -from twinkle.dataset import Dataset, DatasetMeta -from twinkle.model import TransformersModel -from twinkle.preprocessor import SelfCognitionProcessor - -# Construct a device_mesh, fsdp=4, dp=2 -device_mesh = DeviceMesh.from_sizes(fsdp_size=4, dp_size=2) -# use torchrun mode -twinkle.initialize(mode='local', global_device_mesh=device_mesh) - -logger = get_logger() - - -def train(): - # 1000 samples - dataset = Dataset(dataset_meta=DatasetMeta('ms://swift/self-cognition', data_slice=range(1000))) - # Set template to prepare encoding - dataset.set_template('Template', model_id='ms://Qwen/Qwen3-4B') - # Preprocess the dataset to standard format - dataset.map(SelfCognitionProcessor('twinkle大模型', 'ModelScope社区')) - # Encode dataset - dataset.encode() - # Global batch size = 8, for GPUs, so 1 sample per GPU - dataloader = DataLoader(dataset=dataset, batch_size=8) - # Use a TransformersModel - model = TransformersModel(model_id='ms://Qwen/Qwen3-4B') - - lora_config = LoraConfig(r=8, lora_alpha=32, target_modules='all-linear') - - # Add a lora to model, with name `default` - # Comment this to use full-parameter training - model.add_adapter_to_model('default', lora_config, gradient_accumulation_steps=2) - # Add Optimizer for lora `default` - model.set_optimizer(optimizer_cls='AdamW', lr=1e-4) - # Add LRScheduler for lora `default` - model.set_lr_scheduler( - scheduler_cls='CosineWarmupScheduler', num_warmup_steps=5, num_training_steps=len(dataloader)) - logger.info(get_device_placement()) - # Print the training config - logger.info(model.get_train_configs()) - logger.info(f'Total steps: {len(dataloader)}') - for step, batch in enumerate(dataloader): - # Do forward and backward - model.forward_backward(inputs=batch) - # Step - model.clip_grad_and_step() - if step % 20 == 0: - # Print metric - metric = model.calculate_metric(is_training=True) - logger.info(f'Current is step {step} of {len(dataloader)}, metric: {metric}') - model.save(f'last-checkpoint') - - -if __name__ == '__main__': - train() -``` - -上面的代码中,构造了fsdp2和dp的hybrid并行模式,并使用了八张卡进行训练。可以看到它和单卡训练的代码基本相同,只是使用了`DeviceMesh`来声明模型布局。 - -运行时,需要这样拉起训练: - -```shell -CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 torchrun --nproc_per_node=8 train.py -``` - -### Ray训练 - -[Ray](https://github.com/ray-project/ray)是多机模型训练和推理场景中常用的调度中间件框架。它针对多模型、多设备的执行和资源管理进行了额外优化, -并支持对接kubernetes系统进行生产化。这样的特性使得它尤其适用于RL、GKD等复杂训练场景中。 - -Twinkle支持使用ray进行训练和采样,并且它的代码和上面的训练API几乎一致: - -```python -import os -from typing import List, Tuple, Dict, Any -from peft import LoraConfig -import twinkle -from twinkle import DeviceMesh, DeviceGroup, get_device_placement -from twinkle.advantage import GRPOAdvantage -from twinkle.checkpoint_engine import CheckpointEngineManager -from twinkle.data_format import SamplingParams -from twinkle.dataloader import DataLoader -from twinkle.dataset import Dataset, DatasetMeta -from twinkle.model.megatron import MegatronModel -from twinkle.metric import CompletionRewardMetric -from twinkle.preprocessor.llm import GSM8KProcessor -from twinkle.processor import InputProcessor -from twinkle.reward import GSM8KAccuracyReward, GSM8KFormatReward -from twinkle.sampler import vLLMSampler -from twinkle.template import Template - -MODEL_ID = os.environ.get('MODEL_ID', 'ms://Qwen/Qwen3-4B') -MODEL_GPUS = int(os.environ.get('MODEL_GPUS', 4)) -SAMPLER_GPUS = int(os.environ.get('SAMPLER_GPUS',4)) -NUM_GPUS = MODEL_GPUS + SAMPLER_GPUS -NUM_GENERATIONS = int(os.environ.get('NUM_GENERATIONS', 8)) -MAX_NEW_TOKENS = int(os.environ.get('MAX_NEW_TOKENS', 4096)) -LEARNING_RATE = float(os.environ.get('LR', 1e-5)) -MAX_STEPS = int(os.environ.get('MAX_STEPS', 200)) -BATCH_SIZE = int(os.environ.get('BATCH_SIZE', 16)) # global prompt-level, global completion-level batch size = BATCH_SIZE * num_generations * dp_size -MINI_BATCH_SIZE = int(os.environ.get('MINI_BATCH_SIZE', 16)) # global completion-level mini-batch-size -MICRO_BATCH_SIZE = int(os.environ.get('MICRO_BATCH_SIZE', 2)) # per-device-micro-batch-size (completion-level), batch_size in forward_backward -GRADIENT_ACCUMULATION_STEPS = int(os.environ.get('GRADIENT_ACCUMULATION_STEPS', 1)) -ADAPTER_NAME = 'default' - -def create_gsm8k_dataset(): - dataset = Dataset(DatasetMeta('ms://modelscope/gsm8k', subset_name='main', split='train')) - dataset.set_template('Template', model_id=MODEL_ID, max_length=2048) - dataset.map(GSM8KProcessor()) - dataset.encode(add_generation_prompt=True) - return dataset - -def compute_rewards( - trajectories: List[Dict[str, Any]], -) -> Tuple[List[float], List[float], List[float]]: - accuracy_reward_fn = GSM8KAccuracyReward() - format_reward_fn = GSM8KFormatReward() - accuracy_rewards = accuracy_reward_fn(trajectories) - format_rewards = format_reward_fn(trajectories) - total_rewards = [a + f for a, f in zip(accuracy_rewards, format_rewards)] - return total_rewards, format_rewards, accuracy_rewards - -def main(): - # set sampler and model separate to use different gpus - device_groups = [ - DeviceGroup(name='model',ranks=list(range(MODEL_GPUS)),device_type='GPU'), - DeviceGroup(name='sampler',ranks=list(range(MODEL_GPUS, NUM_GPUS)),device_type='GPU'), - ] - model_mesh = DeviceMesh.from_sizes(world_size=MODEL_GPUS, dp_size=MODEL_GPUS) - sampler_mesh = DeviceMesh.from_sizes(world_size=SAMPLER_GPUS, dp_size=SAMPLER_GPUS) - twinkle.initialize(mode='ray', nproc_per_node=NUM_GPUS, groups=device_groups, lazy_collect=False) - - lora_config = LoraConfig(target_modules='all-linear', r=32, lora_alpha=64, lora_dropout=0.05) - model = MegatronModel(model_id=MODEL_ID, device_mesh=model_mesh, remote_group='model', mixed_precision='bf16') - model.add_adapter_to_model(ADAPTER_NAME, lora_config, gradient_accumulation_steps=1) - model.set_optimizer('default', lr=LEARNING_RATE) - model.set_lr_scheduler('default', lr_decay_steps=MAX_STEPS, max_lr=LEARNING_RATE) - model.set_loss('GRPOLoss', epsilon=0.2) - model.set_processor(InputProcessor) - model.set_template('Template', model_id=MODEL_ID) - - sampler = vLLMSampler( - model_id=MODEL_ID, - engine_args={ - 'gpu_memory_utilization': 0.8, - 'max_model_len': 4096, - 'max_lora_rank': 32, # save as lora_config - 'enable_lora': True, - }, - device_mesh=sampler_mesh, - remote_group='sampler', - ) - sampler.set_template(Template, model_id=MODEL_ID) - ckpt_manager = CheckpointEngineManager(model=model, sampler=sampler) - dataloader = DataLoader( - dataset=create_gsm8k_dataset, - batch_size=BATCH_SIZE * GRADIENT_ACCUMULATION_STEPS, - min_batch_size=BATCH_SIZE * GRADIENT_ACCUMULATION_STEPS, - device_mesh=model_mesh, - remote_group='model', - ) - advantage_fn = GRPOAdvantage() - metrics = CompletionRewardMetric() - sampling_params = SamplingParams(max_tokens=MAX_NEW_TOKENS) - optim_step = 0 - print(get_device_placement()) - - for batch in dataloader: - if optim_step >= MAX_STEPS: - break - metrics.reset() - global_prompts = batch if isinstance(batch, list) else [batch] - ckpt_manager.sync_weights(merge_and_sync=False) - sampler.reset_prefix_cache() - sample_response = sampler.sample( - global_prompts*NUM_GENERATIONS, - sampling_params, - num_samples=1, - ) - all_input_data: List[Dict[str, Any]] = [] - all_old_logps: List[List[float]] = [] - all_completion_lengths: List[int] = [] - - for sequence in sample_response.sequences: - all_input_data.append(sequence.new_input_feature) - all_old_logps.append(sequence.logprobs) - all_completion_lengths.append(len(sequence.tokens)) - total_rewards, format_rewards, accuracy_rewards = compute_rewards( - all_input_data - ) - metrics.accumulate( - completion_lengths=all_completion_lengths, - rewards={ - 'total': total_rewards, - 'format': format_rewards, - 'accuracy': accuracy_rewards, - }, - ) - advantages = advantage_fn(total_rewards, num_generations=NUM_GENERATIONS, scale='group').tolist() - # Split completions into mini-batches and run one optim step per mini-batch. - total_completions = len(all_input_data) - for mb_start in range(0, total_completions, MINI_BATCH_SIZE): - mb_end = min(mb_start + MINI_BATCH_SIZE, total_completions) - mb_inputs = all_input_data[mb_start:mb_end] - mb_old_logps = all_old_logps[mb_start:mb_end] - mb_advantages = advantages[mb_start:mb_end] - - model.forward_backward( - inputs=mb_inputs, - old_logps=mb_old_logps, - advantages=mb_advantages, - micro_batch_size=MICRO_BATCH_SIZE, - ) - model.clip_grad_and_step() - optim_step += 1 - - if optim_step >= MAX_STEPS: - break - log_dict = metrics.calculate() - log_dict.update(model.calculate_metric(is_training=True)) - metrics.reset() - print(f'[Step {optim_step}/{MAX_STEPS}] {log_dict}') - - print(f'Training completed. optim_steps={optim_step}') - model.save('grpo-gsm8k-checkpoint') - -if __name__ == '__main__': - main() -``` - -在上面的代码中,我们给出了一个RL的训练代码。我们可以在代码中清晰看到数据如何构造、sampler/model如何声明和传参,以及advantage和loss的构造过程。 -这个过程没有任何显示引用`ray`的地方。我们仅在初始化时声明了ray模式: - -```python -twinkle.initialize(mode='ray', nproc_per_node=NUM_GPUS, groups=device_groups, lazy_collect=False) -``` - -开发者可以定制模型等组件的构造和调用方式,所有transformers、Megatron的模型参数都可以在构造模型时传入。 - -后面所有的ray调用和数据分发,都是隐式进行的。运行这个脚本需要提前安装好ray。之后这样运行: - -```shell -python train.py -``` - -### 远程训练 - -Twinkle的一大特色是支持多租户用户混合训练。具体来说,多个用户可以使用一个基模进行lora训练,这样可以极大减小服务端部署成本。 - -假设我们使用八卡开启一个服务。首先我们需要启动ray集群: - -```shell -CUDA_VISIBLE_DEVICES=0,1 ray start --head --port=6379 --num-gpus=2 -CUDA_VISIBLE_DEVICES=2,3 ray start --address=127.0.0.1:6379 --num-gpus=2 -CUDA_VISIBLE_DEVICES="" ray start --address=127.0.0.1:6379 --num-gpus=0 -``` - -我们启动了一组包含三个node的ray集群: -- 01两张卡作为一个node -- 23两张卡作为一个node -- cpu资源作为一个node - -如果在生产环境使用,可以启动更多node,并部署更多replica以兼容更大的用户量。在这里我们仅以四卡作为例子。 - -下面,启动server: -```shell - -cd cookbook/client/twinkle/transformer -python server.py -``` - -服务端会启动一个包含了一个sampler集群、一个模型集群、一个工具集群的三个服务。 - -下面可以进行client端训练: -```python -import dotenv -dotenv.load_dotenv('.env') -import re -from twinkle.data_format import Trajectory -from twinkle.reward.base import Reward -import gc -from peft import LoraConfig -from typing import List, Tuple - -from twinkle import get_logger -from twinkle.advantage import GRPOAdvantage -from twinkle.dataset import DatasetMeta -from twinkle.metric import CompletionRewardMetric -from twinkle_client import init_twinkle_client -from twinkle_client.dataloader import DataLoader -from twinkle_client.dataset import Dataset -from twinkle_client.model import MultiLoraTransformersModel -from twinkle_client.sampler import vLLMSampler - -logger = get_logger() - -# ========== Configuration ========== -MODEL_ID = 'ms://Qwen/Qwen3-4B' -NUM_GENERATIONS = 4 -MAX_NEW_TOKENS = 1024 -LEARNING_RATE = 1e-5 -MAX_STEPS = 10 -BATCH_SIZE = 2 -TEMPERATURE = 1.0 -SYNC_INTERVAL = 1 # Save weights for sampler every N steps -GRADIENT_ACCUMULATION_STEPS = 4 - - -def create_countdown_dataset(): - """Create Countdown Game dataset for GRPO training.""" - - dataset = Dataset(dataset_meta=DatasetMeta('ms://zouxuhong/Countdown-Tasks-3to4', data_slice=range(500))) - dataset.set_template('Template', model_id=MODEL_ID, max_length=8192) - dataset.map('CountdownProcessor') - dataset.encode(add_generation_prompt=True, batched=True) - return dataset - - -class CountDownAccuracy(Reward): - - @staticmethod - def countdown_accuracy_reward(completion: str, target: int, nums: List[int]) -> float: - """Accuracy reward: checks if equation is correct.""" - try: - match = re.search(r'(.*?)<\/answer>', completion) - if match is None: - return 0.0 - equation = match.group(1).strip() - if '=' in equation: - equation = equation.split('=')[0] - used_numbers = [int(n) for n in re.findall(r'\d+', equation)] - if sorted(used_numbers) != sorted(nums): - return 0.0 - if not re.match(r'^[\d+\-*/().\s]+$', equation): - return 0.0 - result = eval(equation, {'__builtins__': None}, {}) - return 1.0 if abs(float(result) - float(target)) < 1e-5 else 0.0 - except Exception: # noqa - return 0.0 - - def __call__(self, trajectories: List[Trajectory], ground_truths: List[Trajectory]): - rewards = [] - for trajectory in trajectories: - messages = trajectory.get('messages', []) - completion = '' - for msg in reversed(messages): - if msg.get('role') == 'assistant': - completion = msg.get('content', '') - break - user_data = trajectory.get('user_data', [{}]) - data = user_data[0] if isinstance(user_data, list) and user_data else {} - target = data.get('target', 0) - nums = data.get('nums', []) - acc_reward = self.countdown_accuracy_reward(completion, target, nums) - rewards.append(acc_reward) - return rewards - - -def compute_rewards(trajectories: List[dict], ) -> Tuple[List[float], List[float], List[float]]: - """Compute format and accuracy rewards for Countdown game.""" - from twinkle.reward import FormatReward - format_rewards = FormatReward()(trajectories, []) - accuracy_rewards = CountDownAccuracy()(trajectories, []) - total_rewards = [a + b for a, b in zip(accuracy_rewards, format_rewards)] - return total_rewards, format_rewards, accuracy_rewards - - -def train(): - # Step 1: Initialize the Twinkle client - client = init_twinkle_client( - base_url='http://localhost:8000', - api_key='', - ) - - # Step 2: Prepare dataset and dataloader - dataset = create_countdown_dataset() - dataloader = DataLoader(dataset=dataset, batch_size=BATCH_SIZE) - - # Step 3: Configure the training model - model = MultiLoraTransformersModel(model_id=MODEL_ID) - - lora_config = LoraConfig( - target_modules='all-linear', - r=8, - lora_alpha=32, - lora_dropout=0.05, - ) - model.add_adapter_to_model( - 'default', - lora_config, - gradient_accumulation_steps=GRADIENT_ACCUMULATION_STEPS, - ) - - # Set GRPO loss (the key difference from SFT training) - model.set_loss('GRPOLoss', epsilon=0.2, beta=0.0) - - # Set optimizer and LR scheduler - model.set_optimizer('AdamW', lr=LEARNING_RATE) - model.set_lr_scheduler( - 'CosineWarmupScheduler', - num_warmup_steps=500, - num_training_steps=MAX_STEPS, - ) - - # Set processor and template for encoding inputs - model.set_processor('InputProcessor') - model.set_template('Template', model_id=MODEL_ID) - - # Step 4: Configure the sampler - sampler = vLLMSampler(model_id=MODEL_ID) - sampler.set_template('Template', model_id=MODEL_ID) - - # Step 5: Setup metrics and advantage function - advantage_fn = GRPOAdvantage() - metrics = CompletionRewardMetric() - - sampling_params = { - 'max_tokens': MAX_NEW_TOKENS, - 'temperature': TEMPERATURE, - 'top_p': 0.95, - } - - # Track the current adapter path for sampling - current_adapter_uri = None - - step = 0 - for batch in dataloader: - if step >= MAX_STEPS: - break - - metrics.reset() - prompts = batch if isinstance(batch, list) else [batch] - - # ========== 1. Save weights and update adapter_uri ========== - # Instead of sync_weights, save the model checkpoint and pass - # the resulting path to the sampler as adapter_uri - if step % SYNC_INTERVAL == 0: - logger.info(f'Step {step}: Saving weights for sampler...') - twinkle_path = model.save( - name=f'grpo-sampler-step-{step}', - save_optimizer=False, - ) - current_adapter_uri = twinkle_path - logger.info(f'Step {step}: Saved weights to {current_adapter_uri}') - - # ========== 2. Sample completions ========== - sample_response = sampler.sample( - inputs=prompts, - sampling_params=sampling_params, - adapter_uri=current_adapter_uri, - num_samples=NUM_GENERATIONS, - ) - - input_features = [] - old_logps_list = [] - completion_lengths = [] - - sequences = sample_response.get('sequences', []) - for seq in sequences: - input_features.append(seq.get('new_input_feature', seq)) - old_logps_list.append(seq.get('logprobs', [])) - completion_lengths.append(len(seq.get('tokens', []))) - - if not input_features: - logger.warning(f'Step {step}: No valid samples, skipping') - step += 1 - continue - - # ========== 3. Compute rewards ========== - total_rewards, format_rewards, accuracy_rewards = compute_rewards(input_features) - metrics.accumulate( - None, - None, - completion_lengths=completion_lengths, - rewards={ - 'total': total_rewards, - 'format': format_rewards, - 'accuracy': accuracy_rewards, - }) - - # ========== 4. Compute advantages ========== - advantages = advantage_fn( - total_rewards, - num_generations=NUM_GENERATIONS, - scale='group', - ).tolist() - - frac_zero_std = (1.0 if all(abs(a) < 1e-8 for a in advantages) else 0.0) - if frac_zero_std == 1.0: - logger.info(f'Step {step}: All advantages are zero, skipping training') - step += 1 - continue - - # ========== 5. Training step (GRPO) ========== - # forward_backward with GRPO loss: passes advantages and old_logps - # to the server-side GRPOLoss for proper policy optimization - model.forward_backward( - inputs=input_features, - advantages=advantages, - old_logps=old_logps_list, - ) - - # Gradient clipping and optimizer step - model.clip_grad_norm(1.0) - model.step() - model.zero_grad() - model.lr_step() - - gc.collect() - - # ========== 6. Log ========== - log_dict = metrics.calculate() - log_dict.update(model.calculate_metric()) - log_dict['train/frac_reward_zero_std'] = frac_zero_std - logger.info(f'Step {step}: {log_dict}') - step += 1 - - # Save final checkpoint - twinkle_path = model.save(name='grpo-countdown-final', save_optimizer=True) - logger.info(f'Saved final checkpoint: {twinkle_path}') - - -if __name__ == '__main__': - train() -``` - -多个开发者可以并行使用这个服务的单个基模并行训练和采样。并且,他们进行的训练方式允许不同。例如,A用户可以进行SFT,B用户可以进行RL,C用户可以进行采样。 同样,Twinkle也支持Tinker-like API进行远端训练: - ->[!Note] -> 需要注意的一点,在当前Twinkle的实现中,客户端的Twinkle API和Tinker API是无法同时在一个服务端使用的。当你需要提供Tinker API时,你需要启动cookbook/client/tinker下的服务。 -> 这个问题会在接下来的迭代高优解决。 - -```python -from tinker import types -from tqdm import tqdm -from tinker import ServiceClient -from twinkle.dataloader import DataLoader -from twinkle.dataset import Dataset, DatasetMeta -from twinkle.preprocessor import SelfCognitionProcessor -from twinkle.server.tinker.common import input_feature_to_datum - -# The base model to fine-tune / evaluate -base_model = 'Qwen/Qwen3-4B' - - -def train(): - # Step 1: Prepare the dataset - - # Load the self-cognition dataset from ModelScope (first 500 examples) - dataset = Dataset(dataset_meta=DatasetMeta('ms://swift/self-cognition', data_slice=range(500))) - - # Apply the chat template matching the base model (max 256 tokens per sample) - dataset.set_template('Template', model_id=f'ms://{base_model}', max_length=256) - - # Replace placeholder names with custom model/author identity - dataset.map(SelfCognitionProcessor('twinkle模型', 'twinkle团队'), load_from_cache_file=False) - - # Tokenize and encode the dataset into model-ready input features - dataset.encode(batched=True, load_from_cache_file=False) - - # Wrap the dataset into a DataLoader that yields batches of size 8 - dataloader = DataLoader(dataset=dataset, batch_size=8) - - # Step 2: Initialize the training client - # Connect to the Twinkle server running locally - service_client = ServiceClient(base_url='http://localhost:8000', api_key='your-api-key') - # Create a LoRA training client for the base model (rank=16 for the LoRA adapter) - training_client = service_client.create_lora_training_client(base_model=base_model, rank=16) - - # Step 3: Run the training loop - for epoch in range(3): - print(f'Epoch {epoch}') - for step, batch in tqdm(enumerate(dataloader)): - # Convert each InputFeature into a Datum for the Tinker API - input_datum = [input_feature_to_datum(input_feature) for input_feature in batch] - - # Send data to server: forward + backward pass (computes gradients) - fwdbwd_future = training_client.forward_backward(input_datum, 'cross_entropy') - - # Optimizer step: update model weights with Adam - optim_future = training_client.optim_step(types.AdamParams(learning_rate=1e-4)) - - # Wait for both operations to complete - fwdbwd_future.result() - optim_result = optim_future.result() - print(f'Training Metrics: {optim_result}') - - # Save a checkpoint after each epoch - save_future = training_client.save_state(f'twinkle-lora-{epoch}') - save_result = save_future.result() - print(f'Saved checkpoint to {save_result.path}') - - -if __name__ == '__main__': - train() -``` - -### 使用魔搭社区提供的TaaS化训练服务 - -在 Twinkle 框架开源的同时,我们依托ModelScope的后台服务,也提供了托管的模型训练服务(Training as a Service),开发者可以通过这一服务, 免费体验Twinkle的训练API。 -该服务和上面叙述的Tinker API部分代码是相同的,唯一不同的是Endpoint和Token需要使用魔搭官方的对应信息。关于如何使用官方服务,请查看[训练服务](./训练服务.md)的详细描述。 - -## 使用Hugging Face的模型 - -切换前缀即可。 - -```text -ms://Qwen/Qwen3-4B -> hf://Qwen/Qwen3-4B -``` - -## 🛠️ Twinkle✨ 模块化生态系统 - -
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
-

Dataset
数据加载和预处理

-
-

Template
编码和解码

-
-

DataLoader
数据分发和批处理

-
-

Preprocessor
数据 ETL

-
-

InputProcessor
任务特定的输入处理

-
-

Model
大模型,支持多种框架

-
-

Sampler
采样逻辑

-
-

Loss
损失函数

-
-

Metric
训练指标收集

-
-

Reward
奖励函数

-
-

Advantage
优势函数

-
-

CheckpointEngine
权重同步

-
-

Patch
模型修复补丁

-
-

Module
组件,如优化器

-
-

Kernel
算子

-
-

Server
启动后端集群

-
-

Client
客户端代码

-
-

Infra
隔离 ray 和 torchrun 的差异

-
-

Plugin
使用 hub 组件

-
-

Hub
与 HF/MS 库对接

-
-
- ## Twinkle 的可定制组件 在 Twinkle 的设计中,torchrun、Ray、HTTP 的训练使用同样的 API,并共享相同的组件和输入输出结构。因此,其很多组件可以由开发者自定义来实现新的算法开发。 下面我们列出推荐定制的组件列表: -| 组件名称 | 基类 | 说明 | -| --------------------- | ------------------------------------------ | ------------------------------------------------------- | -| 损失 | twinkle.loss.Loss | 用于定义模型训练的损失函数 | -| 指标 | twinkle.metric.Metric | 用于定义模型训练的评价体系 | -| Optimizer/LRScheduler | 基于PyTorch | 用于定义模型训练的优化器和LR衰减器 | -| 补丁 | twinkle.patch.Patch | 用于修复模型训练过程中的问题 | +| 组件名称 | 基类 | 说明 | +| --------------------- | ------------------------------------------ | ----------------------------------------------------------- | +| 损失 | twinkle.loss.Loss | 用于定义模型训练的损失函数 | +| 指标 | twinkle.metric.Metric | 用于定义模型训练的评价体系 | +| Optimizer/LRScheduler | 基于PyTorch | 用于定义模型训练的优化器和LR衰减器 | +| 补丁 | twinkle.patch.Patch | 用于修复模型训练过程中的问题 | | 预处理器 | twinkle.preprocessor.Preprocessor | 用于对数据进行预处理(ETL),并返回 Template 可用的标准格式 | -| 过滤器 | twinkle.preprocessor.Filter | 用于对原始数据进行合理性过滤 | -| 任务数据处理器 | twinkle.processor.InputProcessor | 用于将模型输入转换为各任务需要的数据,并添加额外字段 | -| 模型 | twinkle.model.TwinkleModel | 大模型本身 | -| 采样器 | twinkle.sampler.Sampler | 采样器,例如 vLLM | -| 奖励 | twinkle.reward.Reward | 用于实现不同 RL 训练的奖励 | -| 优势 | twinkle.advantage.Advantage | 用于实现不同 RL 训练的优势估计 | -| 模板 | twinkle.template.Template | 用于处理标准输入,并转换成模型需要的 token | -| 权重同步 | twinkle.checkpoint_engine.CheckpointEngine | 用于 RL 训练中的权重同步 | +| 过滤器 | twinkle.preprocessor.Filter | 用于对原始数据进行合理性过滤 | +| 任务数据处理器 | twinkle.processor.InputProcessor | 用于将模型输入转换为各任务需要的数据,并添加额外字段 | +| 模型 | twinkle.model.TwinkleModel | 大模型本身 | +| 采样器 | twinkle.sampler.Sampler | 采样器,例如 vLLM | +| 奖励 | twinkle.reward.Reward | 用于实现不同 RL 训练的奖励 | +| 优势 | twinkle.advantage.Advantage | 用于实现不同 RL 训练的优势估计 | +| 模板 | twinkle.template.Template | 用于处理标准输入,并转换成模型需要的 token | +| 权重同步 | twinkle.checkpoint_engine.CheckpointEngine | 用于 RL 训练中的权重同步 | > 未在上表中列出的组件,如Dataset、DataLoader等也可以实现定制,只需要跟随基类API设计即可。 @@ -876,10 +77,10 @@ DeviceGroup:定义本次训练需要多少个资源组。定义后,组件可 ```python from twinkle.model import TransformersModel -model = TransformersModel(model_id='ms://Qwen/Qwen3-4B', remote_group='default', device_mesh=device_mesh) +model = TransformersModel(model_id='ms://Qwen/Qwen2.5-7B-Instruct', remote_group='default', device_mesh=device_mesh) # 或者 from twinkle.model import MegatronModel -model = MegatronModel(model_id='ms://Qwen/Qwen3-4B', remote_group='default', device_mesh=device_mesh) +model = MegatronModel(model_id='ms://Qwen/Qwen2.5-7B-Instruct', remote_group='default', device_mesh=device_mesh) ``` DeviceMesh 指定了模型等组件在资源组中的拓扑结构。可以理解为如何进行并行。这会影响一系列的框架决策,例如数据获取、数据消费、数据返回等。 @@ -905,7 +106,7 @@ def train(): # 1000 samples dataset = Dataset(dataset_meta=DatasetMeta('ms://swift/self-cognition', data_slice=range(1000))) # Set template to prepare encoding - dataset.set_template('Template', model_id='ms://Qwen/Qwen3-4B') + dataset.set_template('Template', model_id='ms://Qwen/Qwen2.5-7B-Instruct') # Preprocess the dataset to standard format dataset.map(SelfCognitionProcessor('twinkle大模型', 'ModelScope社区')) # Encode dataset @@ -913,7 +114,7 @@ def train(): # Global batch size = 8, for GPUs, so 1 sample per GPU dataloader = DataLoader(dataset=dataset, batch_size=8, min_batch_size=8) # Use a TransformersModel - model = TransformersModel(model_id='ms://Qwen/Qwen3-4B', remote_group='default') + model = TransformersModel(model_id='ms://Qwen/Qwen2.5-7B-Instruct', remote_group='default') lora_config = LoraConfig( r=8, @@ -953,54 +154,26 @@ python3 train.py ## 支持的大语言模型列表 -| Model Type | Model ID 举例 | Requires | Support Megatron | HF Model ID | -| ------------------- | -------------------------------------------------------------------------------------------------------------------------- | -------------------- | ---------------- | ---------------------------------------------------------------------------------------------------------- | -| qwen2 全系列 | [Qwen/Qwen2-0.5B-Instruct](https://modelscope.cn/models/Qwen/Qwen2-0.5B-Instruct) | transformers>=4.37 | ✔ | [Qwen/Qwen2-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2-0.5B-Instruct) | -| | [Qwen/Qwen2-72B-Instruct](https://modelscope.cn/models/Qwen/Qwen2-72B-Instruct) | transformers>=4.37 | ✔ | [Qwen/Qwen2-72B-Instruct](https://huggingface.co/Qwen/Qwen2-72B-Instruct) | -| | [Qwen/Qwen2-1.5B](https://modelscope.cn/models/Qwen/Qwen2-1.5B) | transformers>=4.37 | ✔ | [Qwen/Qwen2-1.5B](https://huggingface.co/Qwen/Qwen2-1.5B) | -| | [Qwen/Qwen2-7B](https://modelscope.cn/models/Qwen/Qwen2-7B) | transformers>=4.37 | ✔ | [Qwen/Qwen2-7B](https://huggingface.co/Qwen/Qwen2-7B) | -| | [Qwen/Qwen2-72B](https://modelscope.cn/models/Qwen/Qwen2-72B) | transformers>=4.37 | ✔ | [Qwen/Qwen2-72B](https://huggingface.co/Qwen/Qwen2-72B) | -| | [Qwen/Qwen2.5-0.5B-Instruct](https://modelscope.cn/models/Qwen/Qwen2.5-0.5B-Instruct) | transformers>=4.37 | ✔ | [Qwen/Qwen2.5-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct) | -| | [Qwen/Qwen2.5-1.5B-Instruct](https://modelscope.cn/models/Qwen/Qwen2.5-1.5B-Instruct) | transformers>=4.37 | ✔ | [Qwen/Qwen2.5-1.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct) | -| | [Qwen/Qwen2.5-72B-Instruct](https://modelscope.cn/models/Qwen/Qwen2.5-72B-Instruct) | transformers>=4.37 | ✔ | [Qwen/Qwen2.5-72B-Instruct](https://huggingface.co/Qwen/Qwen2.5-72B-Instruct) | -| | [Qwen/Qwen2.5-0.5B](https://modelscope.cn/models/Qwen/Qwen2.5-0.5B) | transformers>=4.37 | ✔ | [Qwen/Qwen2.5-0.5B](https://huggingface.co/Qwen/Qwen2.5-0.5B) | -| | [Qwen/Qwen2.5-32B](https://modelscope.cn/models/Qwen/Qwen2.5-32B) | transformers>=4.37 | ✔ | [Qwen/Qwen2.5-32B](https://huggingface.co/Qwen/Qwen2.5-32B) | -| qwen2_moe 全系列 | [Qwen/Qwen1.5-MoE-A2.7B-Chat](https://modelscope.cn/models/Qwen/Qwen1.5-MoE-A2.7B-Chat) | transformers>=4.40 | ✔ | [Qwen/Qwen1.5-MoE-A2.7B-Chat](https://huggingface.co/Qwen/Qwen1.5-MoE-A2.7B-Chat) | -| | [Qwen/Qwen1.5-MoE-A2.7B](https://modelscope.cn/models/Qwen/Qwen1.5-MoE-A2.7B) | transformers>=4.40 | ✔ | [Qwen/Qwen1.5-MoE-A2.7B](https://huggingface.co/Qwen/Qwen1.5-MoE-A2.7B) | -| qwen3 全系列 | [Qwen/Qwen3-0.6B-Base](https://modelscope.cn/models/Qwen/Qwen3-0.6B-Base) | transformers>=4.51 | ✔ | [Qwen/Qwen3-0.6B-Base](https://huggingface.co/Qwen/Qwen3-0.6B-Base) | -| | [Qwen/Qwen3-14B-Base](https://modelscope.cn/models/Qwen/Qwen3-14B-Base) | transformers>=4.51 | ✔ | [Qwen/Qwen3-14B-Base](https://huggingface.co/Qwen/Qwen3-14B-Base) | -| | [Qwen/Qwen3-0.6B](https://modelscope.cn/models/Qwen/Qwen3-0.6B) | transformers>=4.51 | ✔ | [Qwen/Qwen3-0.6B](https://huggingface.co/Qwen/Qwen3-0.6B) | -| | [Qwen/Qwen3-1.7B](https://modelscope.cn/models/Qwen/Qwen3-1.7B) | transformers>=4.51 | ✔ | [Qwen/Qwen3-1.7B](https://huggingface.co/Qwen/Qwen3-1.7B) | -| | [Qwen/Qwen3-32B](https://modelscope.cn/models/Qwen/Qwen2.5-32B) | transformers>=4.51 | ✔ | [Qwen/Qwen3-32B](https://huggingface.co/Qwen/Qwen3-32B) | -| qwen3_moe 全系列 | [Qwen/Qwen3-30B-A3B-Base](https://modelscope.cn/models/Qwen/Qwen3-30B-A3B-Base) | transformers>=4.51 | ✔ | [Qwen/Qwen3-30B-A3B-Base](https://huggingface.co/Qwen/Qwen3-30B-A3B-Base) | -| | [Qwen/Qwen3-30B-A3B](https://modelscope.cn/models/Qwen/Qwen3-30B-A3B) | transformers>=4.51 | ✔ | [Qwen/Qwen3-30B-A3B](https://huggingface.co/Qwen/Qwen3-30B-A3B) | -| | [Qwen/Qwen3-235B-A22B](https://modelscope.cn/models/Qwen/Qwen3-235B-A22B) | transformers>=4.51 | ✔ | [Qwen/Qwen3-235B-A22B](https://huggingface.co/Qwen/Qwen3-235B-A22B) | -| chatglm2 全系列 | [ZhipuAI/chatglm2-6b](https://modelscope.cn/models/ZhipuAI/chatglm2-6b) | transformers<4.42 | ✘ | [zai-org/chatglm2-6b](https://huggingface.co/zai-org/chatglm2-6b) | -| | [ZhipuAI/chatglm2-6b-32k](https://modelscope.cn/models/ZhipuAI/chatglm2-6b-32k) | transformers<4.42 | ✘ | [zai-org/chatglm2-6b-32k](https://huggingface.co/zai-org/chatglm2-6b-32k) | -| chatglm3 全系列 | [ZhipuAI/chatglm3-6b](https://modelscope.cn/models/ZhipuAI/chatglm3-6b) | transformers<4.42 | ✘ | [zai-org/chatglm3-6b](https://huggingface.co/zai-org/chatglm3-6b) | -| | [ZhipuAI/chatglm3-6b-base](https://modelscope.cn/models/ZhipuAI/chatglm3-6b-base) | transformers<4.42 | ✘ | [zai-org/chatglm3-6b-base](https://huggingface.co/zai-org/chatglm3-6b-base) | -| | [ZhipuAI/chatglm3-6b-32k](https://modelscope.cn/models/ZhipuAI/chatglm3-6b-32k) | transformers<4.42 | ✘ | [zai-org/chatglm3-6b-32k](https://huggingface.co/zai-org/chatglm3-6b-32k) | -| | [ZhipuAI/chatglm3-6b-128k](https://modelscope.cn/models/ZhipuAI/chatglm3-6b-128k) | transformers<4.42 | ✘ | [zai-org/chatglm3-6b-128k](https://huggingface.co/zai-org/chatglm3-6b-128k) | -| chatglm4 全系列 | [ZhipuAI/glm-4-9b-chat](https://modelscope.cn/models/ZhipuAI/glm-4-9b-chat) | transformers>=4.42 | ✘ | [zai-org/glm-4-9b-chat](https://huggingface.co/zai-org/glm-4-9b-chat) | -| | [ZhipuAI/glm-4-9b](https://modelscope.cn/models/ZhipuAI/glm-4-9b) | transformers>=4.42 | ✘ | [zai-org/glm-4-9b](https://huggingface.co/zai-org/glm-4-9b) | -| | [ZhipuAI/glm-4-9b-chat-1m](https://modelscope.cn/models/ZhipuAI/glm-4-9b-chat-1m) | transformers>=4.42 | ✘ | [zai-org/glm-4-9b-chat-1m](https://huggingface.co/zai-org/glm-4-9b-chat-1m) | -| | [ZhipuAI/LongWriter-glm4-9b](https://modelscope.cn/models/ZhipuAI/LongWriter-glm4-9b) | transformers>=4.42 | ✘ | [zai-org/LongWriter-glm4-9b](https://huggingface.co/zai-org/LongWriter-glm4-9b) | -| glm_edge 全系列 | [ZhipuAI/glm-edge-1.5b-chat](https://modelscope.cn/models/ZhipuAI/glm-edge-1.5b-chat) | transformers>=4.46 | ✘ | [zai-org/glm-edge-1.5b-chat](https://huggingface.co/zai-org/glm-edge-1.5b-chat) | -| | [ZhipuAI/glm-edge-4b-chat](https://modelscope.cn/models/ZhipuAI/glm-edge-4b-chat) | transformers>=4.46 | ✘ | [zai-org/glm-edge-4b-chat](https://huggingface.co/zai-org/glm-edge-4b-chat) | -| internlm2 全系列 | [Shanghai_AI_Laboratory/internlm2-1_8b](https://modelscope.cn/models/Shanghai_AI_Laboratory/internlm2-1_8b) | transformers>=4.38 | ✘ | [internlm/internlm2-1_8b](https://huggingface.co/internlm/internlm2-1_8b) | -| | [Shanghai_AI_Laboratory/internlm2-chat-1_8b-sft](https://modelscope.cn/models/Shanghai_AI_Laboratory/internlm2-chat-1_8b-sft) | transformers>=4.38 | ✘ | [internlm/internlm2-chat-1_8b-sft](https://huggingface.co/internlm/internlm2-chat-1_8b-sft) | -| | [Shanghai_AI_Laboratory/internlm2-base-7b](https://modelscope.cn/models/Shanghai_AI_Laboratory/internlm2-base-7b) | transformers>=4.38 | ✘ | [internlm/internlm2-base-7b](https://huggingface.co/internlm/internlm2-base-7b) | -| | [Shanghai_AI_Laboratory/internlm2-7b](https://modelscope.cn/models/Shanghai_AI_Laboratory/internlm2-7b) | transformers>=4.38 | ✘ | [internlm/internlm2-7b](https://huggingface.co/internlm/internlm2-7b) | -| | [Shanghai_AI_Laboratory/internlm2-chat-7b](https://modelscope.cn/models/Shanghai_AI_Laboratory/internlm2-chat-7b) | transformers>=4.38 | ✘ | [internlm/internlm2-chat-7b](https://huggingface.co/internlm/internlm2-chat-7b) | -| deepseek_v1 | [deepseek-ai/deepseek-vl-7b-chat](https://modelscope.cn/models/deepseek-ai/deepseek-vl-7b-chat) | transformers>=4.39.4 | ✔ | | -| | [deepseek-ai/DeepSeek-V2-Lite](https://modelscope.cn/models/deepseek-ai/DeepSeek-V2-Lite) | transformers>=4.39.3 | ✔ | [deepseek-ai/DeepSeek-V2-Lite](https://huggingface.co/deepseek-ai/DeepSeek-V2-Lite) | -| | [deepseek-ai/DeepSeek-V2-Lite-Chat](https://modelscope.cn/models/deepseek-ai/DeepSeek-V2-Lite-Chat) | transformers>=4.39.3 | ✔ | [deepseek-ai/DeepSeek-V2-Lite-Chat](https://huggingface.co/deepseek-ai/DeepSeek-V2-Lite-Chat) | -| | [deepseek-ai/DeepSeek-V2](https://modelscope.cn/models/deepseek-ai/DeepSeek-V2) | transformers>=4.39.3 | ✔ | [deepseek-ai/DeepSeek-V2](https://huggingface.co/deepseek-ai/DeepSeek-V2) | -| | [deepseek-ai/DeepSeek-V2-Chat](https://modelscope.cn/models/deepseek-ai/DeepSeek-V2-Chat) | transformers>=4.39.3 | ✔ | [deepseek-ai/DeepSeek-V2-Chat](https://huggingface.co/deepseek-ai/DeepSeek-V2-Chat) | -| | [deepseek-ai/DeepSeek-V2.5](https://modelscope.cn/models/deepseek-ai/DeepSeek-V2.5) | transformers>=4.39.3 | ✔ | [deepseek-ai/DeepSeek-V2.5](https://huggingface.co/deepseek-ai/DeepSeek-V2.5) | -| | [deepseek-ai/DeepSeek-Prover-V2-7B](https://modelscope.cn/models/deepseek-ai/DeepSeek-Prover-V2-7B) | transformers>=4.39.3 | ✔ | [deepseek-ai/DeepSeek-Prover-V2-7B](https://huggingface.co/deepseek-ai/DeepSeek-Prover-V2-7B) | -| | [deepseek-ai/DeepSeek-R1](https://modelscope.cn/models/deepseek-ai/DeepSeek-R1) | transformers>=4.39.3 | ✔ | [deepseek-ai/DeepSeek-R1](https://huggingface.co/deepseek-ai/DeepSeek-R1) | -| deepSeek-r1-distill | [deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B](https://modelscope.cn/models/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B) | transformers>=4.37 | ✔ | [deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B) | -| | [deepseek-ai/DeepSeek-R1-Distill-Qwen-7B](https://modelscope.cn/models/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B) | transformers>=4.37 | ✔ | [deepseek-ai/DeepSeek-R1-Distill-Qwen-7B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B) | -| | [deepseek-ai/DeepSeek-R1-Distill-Qwen-14B](https://modelscope.cn/models/deepseek-ai/DeepSeek-R1-Distill-Qwen-14B) | transformers>=4.37 | ✔ | [deepseek-ai/DeepSeek-R1-Distill-Qwen-14B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-14B) | -| | [deepseek-ai/DeepSeek-R1-Distill-Qwen-32B](https://modelscope.cn/models/deepseek-ai/DeepSeek-R1-Distill-Qwen-32B) | transformers>=4.37 | ✔ | [deepseek-ai/DeepSeek-R1-Distill-Qwen-32B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-32B) | +| Model Type | Model ID 举例 | Model Size | Requires | Support Megatron | HF Model ID | +| ------------------- | ------------------------------------------------------------------------------------------------------------ | :-------------------------------------: | -------------------- | :--------------: | :----------------------------------------------------------------------------------------------------: | +| qwen2 全系列 | [Qwen/Qwen2-0.5B-Instruct](https://modelscope.cn/models/Qwen/Qwen2-0.5B-Instruct) | 0.5B/1.5B/7B/72B | transformers>=4.37 | ✔ | [Qwen/Qwen2-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2-0.5B-Instruct) | +| | [Qwen/Qwen2-1.5B](https://modelscope.cn/models/Qwen/Qwen2-1.5B) | 0.5B/1.5B/7B/72B | transformers>=4.37 | ✔ | [Qwen/Qwen2-1.5B](https://huggingface.co/Qwen/Qwen2-1.5B) | +| | [Qwen/Qwen2.5-1.5B-Instruct](https://modelscope.cn/models/Qwen/Qwen2.5-1.5B-Instruct) | 0.5B/1.5B/3B/7B/14B/32B/72B | transformers>=4.37 | ✔ | [Qwen/Qwen2.5-1.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-1.5B-Instruct) | +| | [Qwen/Qwen2.5-0.5B](https://modelscope.cn/models/Qwen/Qwen2.5-0.5B) | 0.5B/1.5B/3B/7B/14B/32B | transformers>=4.37 | ✔ | [Qwen/Qwen2.5-0.5B](https://huggingface.co/Qwen/Qwen2.5-0.5B) | +| qwen2_moe 全系列 | [Qwen/Qwen1.5-MoE-A2.7B-Chat](https://modelscope.cn/models/Qwen/Qwen1.5-MoE-A2.7B-Chat) | - | transformers>=4.40 | ✔ | [Qwen/Qwen1.5-MoE-A2.7B-Chat](https://huggingface.co/Qwen/Qwen1.5-MoE-A2.7B-Chat) | +| | [Qwen/Qwen1.5-MoE-A2.7B](https://modelscope.cn/models/Qwen/Qwen1.5-MoE-A2.7B) | - | transformers>=4.40 | ✔ | [Qwen/Qwen1.5-MoE-A2.7B](https://huggingface.co/Qwen/Qwen1.5-MoE-A2.7B) | +| qwen3 全系列 | [Qwen/Qwen3-14B-Base](https://modelscope.cn/models/Qwen/Qwen3-14B-Base) | 0.6B/1.7B/4B/8B/14B | transformers>=4.51 | ✔ | [Qwen/Qwen3-14B-Base](https://huggingface.co/Qwen/Qwen3-14B-Base) | +| | [Qwen/Qwen3-32B](https://modelscope.cn/models/Qwen/Qwen3-32B) | 0.6B/1.7B/4B/8B/14B/32B | transformers>=4.51 | ✔ | [Qwen/Qwen3-32B](https://huggingface.co/Qwen/Qwen3-32B) | +| qwen3_moe 全系列 | [Qwen/Qwen3-30B-A3B-Base](https://modelscope.cn/models/Qwen/Qwen3-30B-A3B-Base) | - | transformers>=4.51 | ✔ | [Qwen/Qwen3-30B-A3B-Base](https://huggingface.co/Qwen/Qwen3-30B-A3B-Base) | +| | [Qwen/Qwen3-30B-A3B](https://modelscope.cn/models/Qwen/Qwen3-30B-A3B) | - | transformers>=4.51 | ✔ | [Qwen/Qwen3-30B-A3B](https://huggingface.co/Qwen/Qwen3-30B-A3B) | +| | [Qwen/Qwen3-235B-A22B](https://modelscope.cn/models/Qwen/Qwen3-235B-A22B) | - | transformers>=4.51 | ✔ | [Qwen/Qwen3-235B-A22B](https://huggingface.co/Qwen/Qwen3-235B-A22B) | +| chatglm2 全系列 | [ZhipuAI/chatglm2-6b](https://modelscope.cn/models/ZhipuAI/chatglm2-6b) | 6b/6b-32k | transformers<4.42 | ✘ | [zai-org/chatglm2-6b](https://huggingface.co/zai-org/chatglm2-6b) | +| chatglm3 全系列 | [ZhipuAI/chatglm3-6b](https://modelscope.cn/models/ZhipuAI/chatglm3-6b) | 6b/6b-base/6b-32k/6b-128k | transformers<4.42 | ✘ | [zai-org/chatglm3-6b](https://huggingface.co/zai-org/chatglm3-6b) | +| chatglm4 全系列 | [ZhipuAI/glm-4-9b-chat](https://modelscope.cn/models/ZhipuAI/glm-4-9b-chat) | glm-4-9b/glm-4-9b-chat/glm-4-9b-chat-1m | transformers>=4.42 | ✘ | [zai-org/glm-4-9b-chat](https://huggingface.co/zai-org/glm-4-9b-chat) | +| | [ZhipuAI/LongWriter-glm4-9b](https://modelscope.cn/models/ZhipuAI/LongWriter-glm4-9b) | - | transformers>=4.42 | ✘ | [zai-org/LongWriter-glm4-9b](https://huggingface.co/zai-org/LongWriter-glm4-9b) | +| glm_edge 全系列 | [ZhipuAI/glm-edge-1.5b-chat](https://modelscope.cn/models/ZhipuAI/glm-edge-1.5b-chat) | 1.5b-chat/4b-chat | transformers>=4.46 | ✘ | [zai-org/glm-edge-1.5b-chat](https://huggingface.co/zai-org/glm-edge-1.5b-chat) | +| internlm2 全系列 | [Shanghai_AI_Laboratory/internlm2-1_8b](https://modelscope.cn/models/Shanghai_AI_Laboratory/internlm2-1_8b) | 1_8b/chat-1_8b-sft/base-7b/7b/chat-7b/ | transformers>=4.38 | ✘ | [internlm/internlm2-1_8b](https://huggingface.co/internlm/internlm2-1_8b) | +| deepseek_v1 | [deepseek-ai/DeepSeek-V2-Lite](https://modelscope.cn/models/deepseek-ai/DeepSeek-V2-Lite) | V2/V2-Lite/V2-Chat/2-Lite-Chat/V2.5 | transformers>=4.39.3 | ✔ | [deepseek-ai/DeepSeek-V2-Lite](https://huggingface.co/deepseek-ai/DeepSeek-V2-Lite) | +| | [deepseek-ai/DeepSeek-Prover-V2-7B](https://modelscope.cn/models/deepseek-ai/DeepSeek-Prover-V2-7B) | - | transformers>=4.39.3 | ✔ | [deepseek-ai/DeepSeek-Prover-V2-7B](https://huggingface.co/deepseek-ai/DeepSeek-Prover-V2-7B) | +| | [deepseek-ai/DeepSeek-R1](https://modelscope.cn/models/deepseek-ai/DeepSeek-R1) | - | transformers>=4.39.3 | ✔ | [deepseek-ai/DeepSeek-R1](https://huggingface.co/deepseek-ai/DeepSeek-R1) | +| deepSeek-r1-distill | [deepseek-ai/DeepSeek-R1-Distill-Qwen-7B](https://modelscope.cn/models/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B) | 1.5B/7B/14B/32B | transformers>=4.37 | ✔ | [deepseek-ai/DeepSeek-R1-Distill-Qwen-7B](https://huggingface.co/deepseek-ai/DeepSeek-R1-Distill-Qwen-7B) |