Skip to content

Commit 3d72cfe

Browse files
authored
Merge pull request #1 from modelscope/yunlin
[WIP] update pyproject.toml
2 parents 02b9eea + cd7a76b commit 3d72cfe

File tree

22 files changed

+8320
-63
lines changed

22 files changed

+8320
-63
lines changed

.github/copilot-instructions.md

Lines changed: 63 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,63 @@
1+
# Twinkle AI Coding Agent Guidelines
2+
3+
These instructions help AI agents work productively in this repo. Focus on concrete repo patterns and workflows.
4+
5+
## Big Picture
6+
- **Goal:** Training and serving LLMs with multi-adapter LoRA, efficient data handling, and distributed execution across Ray and Torch.
7+
- **Core Modules:**
8+
- Infrastructure & distributed orchestration: [src/twinkle/infra/__init__.py](src/twinkle/infra/__init__.py)
9+
- Device layout & platform abstraction: [src/twinkle/utils/platform.py](src/twinkle/utils/platform.py), [src/twinkle/utils/framework.py](src/twinkle/utils/framework.py)
10+
- Model stack (Transformers + Multi-LoRA): [src/twinkle/model/multi_lora_transformers.py](src/twinkle/model/multi_lora_transformers.py)
11+
- Sampler (vLLM integration): [src/twinkle/sampler/vllm_sampler.py](src/twinkle/sampler/vllm_sampler.py)
12+
- Losses & metrics: [src/twinkle/loss](src/twinkle/loss), [src/twinkle/metric](src/twinkle/metric)
13+
- Templates & preprocessing: [src/twinkle/template](src/twinkle/template), [src/twinkle/preprocessor](src/twinkle/preprocessor)
14+
- Model/Processor HTTP services via Ray Serve: [src/twinkle/server/twinkle](src/twinkle/server/twinkle)
15+
- Hub integrations (ModelScope/HF): [src/twinkle/hub/hub.py](src/twinkle/hub/hub.py)
16+
17+
## Architecture & Patterns
18+
- **Lazy import surface:** [src/twinkle/__init__.py](src/twinkle/__init__.py) exposes a small, lazy API (`_LazyModule`), import public symbols from here when possible.
19+
- **Distributed mode selection:** `twinkle.infra.initialize()` toggles between local and Ray modes. Ray mode requires `TWINKLE_MODE=ray` or `initialize(mode='ray', ...)`.
20+
- **Remote execution decorators:**
21+
- `remote_class()` wraps classes for Ray placement; auto-injects `DeviceMesh` if missing.
22+
- `remote_function(dispatch='slice', execute='all', collect='none')` patches methods for distributed dispatch/collect.
23+
- See usage in [src/twinkle/model/multi_lora_transformers.py](src/twinkle/model/multi_lora_transformers.py) and [src/twinkle/sampler/vllm_sampler.py](src/twinkle/sampler/vllm_sampler.py).
24+
- **Device topology:** Represented by `DeviceMesh`/`DeviceGroup`. Visualize with `twinkle.infra.get_device_placement()`; examples in [tests/infra/test_infra_graph.py](tests/infra/test_infra_graph.py).
25+
- **Platform abstractions:** `GPU`/`NPU` selection via env and device discovery. Rank/world size read from env (`RANK`, `WORLD_SIZE`, etc.). See [src/twinkle/utils/platform.py](src/twinkle/utils/platform.py).
26+
- **Hub usage:** `HubOperation` routes to HF or ModelScope by `hf://` or `ms://` prefixes. Dataset/model download/push helpers in [src/twinkle/hub/hub.py](src/twinkle/hub/hub.py).
27+
- **Plugin loading:** Use `Plugin.load_plugin(id, Base)` for remote code from hubs; guarded by `trust_remote_code()` to prevent unsafe execution. See [src/twinkle/utils/plugin.py](src/twinkle/utils/plugin.py).
28+
- **Multi-LoRA conventions:**
29+
- `MultiLoraTransformersModel` wraps a base Transformers model via `MultiAdapter` to manage multiple LoRA adapters.
30+
- FSDP is unsupported for Multi-LoRA (`fsdp_world_size == 1` enforced). Adapter params are strictly controlled to avoid training base weights.
31+
- Adapter ops are routed through remote functions and grouped by DP process groups.
32+
33+
## Developer Workflows
34+
- **Install:** Python 3.11+. Install with Poetry or pip.
35+
- Poetry: `poetry install --with transformers,ray`
36+
- Pip (editable): `pip install -e .[transformers,ray]`
37+
- **Run tests:**
38+
- Unit tests: `python -m unittest tests/infra/test_infra_graph.py`
39+
- **Local single-process dev:**
40+
- Initialize infra: `twinkle.initialize(mode='local', seed=42)`
41+
- Inspect device placement: call `twinkle.infra.get_device_placement()`.
42+
- **Ray Serve demo (HTTP services):**
43+
- Config and launcher: [cookbook/client/server.py](cookbook/client/server.py), [cookbook/client/server_config.yaml](cookbook/client/server_config.yaml)
44+
- Start:
45+
- `python cookbook/client/server.py`
46+
- Endpoints print on startup (default `localhost:8000`).
47+
- Model app binds `MultiLoraTransformersModel` and exposes routes like `/add_adapter_to_model`, `/forward`, `/calculate_loss`, etc. See [src/twinkle/server/twinkle/model.py](src/twinkle/server/twinkle/model.py).
48+
- **vLLM inference:** Use `VLLMSampler` with engine args; LoRA weight sync via `patch.vllm_lora_weights`. See [src/twinkle/sampler/vllm_sampler.py](src/twinkle/sampler/vllm_sampler.py).
49+
50+
## Conventions & Gotchas
51+
- **Safety:** Remote plugin code requires `trust_remote_code()` true; avoid loading arbitrary strings into adapter configs (enforced in Multi-LoRA).
52+
- **Env-driven ranks:** Many utilities read ranks/world size from env; set `WORLD_SIZE`, `RANK`, `LOCAL_RANK` when using torchrun.
53+
- **Determinism:** `seed_everything(seed, full_determinism)` controls CUDA/NPU determinism; may set envs like `CUDA_LAUNCH_BLOCKING`.
54+
- **Adapter lifecycle:** Server auto-removes inactive adapters (heartbeat required); per-token adapter limits are enforced. See cleanup in [src/twinkle/server/twinkle/model.py](src/twinkle/server/twinkle/model.py).
55+
- **Templates:** Tokenization/encode via `Template` (e.g., `Qwen3Template`), producing `InputFeature` for model forward. See [src/twinkle/template/base.py](src/twinkle/template/base.py).
56+
57+
## Examples
58+
- **Visualize a custom mesh:** create `DeviceMesh` and call `get_device_placement()`; example in [tests/infra/test_infra_graph.py](tests/infra/test_infra_graph.py).
59+
- **Add LoRA adapter via HTTP:** POST to `/add_adapter_to_model` with serialized `LoraConfig`; see server routes in [src/twinkle/server/twinkle/model.py](src/twinkle/server/twinkle/model.py).
60+
- **Sample with vLLM:** Configure `VLLMSampler`, set `Template`/`Processor`, then `sample()` on `Trajectory` list; see [src/twinkle/sampler/vllm_sampler.py](src/twinkle/sampler/vllm_sampler.py).
61+
62+
---
63+
Questions or gaps? Tell us where guidance is unclear (e.g., missing run scripts, Ray cluster setup), and we’ll refine this document.

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -157,3 +157,4 @@ megatron_output/
157157

158158
# ast template
159159
ast_index_file.py
160+
test_cookbook/

cookbook/sft/multi_lora.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -30,7 +30,7 @@
3030
)
3131

3232

33-
twinkle.initialize(mode='ray', nproc_per_node=4, groups=device_group, global_device_mesh=device_mesh, lazy_collect=False)
33+
twinkle.initialize(mode='local', nproc_per_node=4, groups=device_group, global_device_mesh=device_mesh, lazy_collect=False)
3434

3535

3636
def train():

cookbook/sft/streaming_dataset.py

Lines changed: 9 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -19,10 +19,16 @@
1919
]
2020

2121

22+
# device_mesh = DeviceMesh(
23+
# device_type='cuda',
24+
# mesh=np.array([[0,1], [2,3]]),
25+
# mesh_dim_names=('dp', 'fsdp')
26+
# )
27+
2228
device_mesh = DeviceMesh(
23-
device_type='cuda',
24-
mesh=np.array([[0,1], [2,3]]),
25-
mesh_dim_names=('dp', 'fsdp')
29+
device_type='cuda',
30+
mesh=np.array([0,1,2,3]),
31+
mesh_dim_names=('dp',)
2632
)
2733

2834
twinkle.initialize(mode='ray', groups=device_group, global_device_mesh=device_mesh)

cookbook/tinker/lora.py

Lines changed: 99 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,99 @@
1+
#%%
2+
3+
from twinkle_client import init_tinker_compat_client
4+
service_client = init_tinker_compat_client(base_url='http://localhost:8000')
5+
6+
print("Available models:")
7+
for item in service_client.get_server_capabilities().supported_models:
8+
print("- " + item.model_name)
9+
10+
#%%
11+
base_model = "ms://Qwen/Qwen2.5-0.5B-Instruct"
12+
training_client = service_client.create_lora_training_client(
13+
base_model=base_model
14+
)
15+
16+
#%%
17+
# Create some training examples
18+
examples = [
19+
{"input": "banana split", "output": "anana-bay plit-say"},
20+
{"input": "quantum physics", "output": "uantum-qay ysics-phay"},
21+
{"input": "donut shop", "output": "onut-day op-shay"},
22+
{"input": "pickle jar", "output": "ickle-pay ar-jay"},
23+
{"input": "space exploration", "output": "ace-spay exploration-way"},
24+
{"input": "rubber duck", "output": "ubber-ray uck-day"},
25+
{"input": "coding wizard", "output": "oding-cay izard-way"},
26+
]
27+
28+
# Convert examples into the format expected by the training client
29+
from tinker import types
30+
from modelscope import AutoTokenizer
31+
# Get the tokenizer from the training client
32+
# tokenizer = training_client.get_tokenizer() # NOTE: network call huggingface
33+
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct", trust_remote_code=True)
34+
35+
def process_example(example: dict, tokenizer) -> types.Datum:
36+
# Format the input with Input/Output template
37+
# For most real use cases, you'll want to use a renderer / chat template,
38+
# (see later docs) but here, we'll keep it simple.
39+
prompt = f"English: {example['input']}\nPig Latin:"
40+
41+
prompt_tokens = tokenizer.encode(prompt, add_special_tokens=True)
42+
prompt_weights = [0] * len(prompt_tokens)
43+
# Add a space before the output string, and finish with double newline
44+
completion_tokens = tokenizer.encode(f" {example['output']}\n\n", add_special_tokens=False)
45+
completion_weights = [1] * len(completion_tokens)
46+
47+
tokens = prompt_tokens + completion_tokens
48+
weights = prompt_weights + completion_weights
49+
50+
input_tokens = tokens[:-1]
51+
target_tokens = tokens[1:] # We're predicting the next token, so targets need to be shifted.
52+
weights = weights[1:]
53+
54+
# A datum is a single training example for the loss function.
55+
# It has model_input, which is the input sequence that'll be passed into the LLM,
56+
# loss_fn_inputs, which is a dictionary of extra inputs used by the loss function.
57+
return types.Datum(
58+
model_input=types.ModelInput.from_ints(tokens=input_tokens),
59+
loss_fn_inputs=dict(weights=weights, target_tokens=target_tokens)
60+
)
61+
62+
processed_examples = [process_example(ex, tokenizer) for ex in examples]
63+
64+
# Visualize the first example for debugging purposes
65+
datum0 = processed_examples[0]
66+
print(f"{'Input':<20} {'Target':<20} {'Weight':<10}")
67+
print("-" * 50)
68+
for i, (inp, tgt, wgt) in enumerate(zip(datum0.model_input.to_ints(), datum0.loss_fn_inputs['target_tokens'].tolist(), datum0.loss_fn_inputs['weights'].tolist())):
69+
print(f"{repr(tokenizer.decode([inp])):<20} {repr(tokenizer.decode([tgt])):<20} {wgt:<10}")
70+
71+
#%%
72+
import numpy as np
73+
for _ in range(6):
74+
fwdbwd_future = training_client.forward_backward(processed_examples, "cross_entropy")
75+
optim_future = training_client.optim_step(types.AdamParams(learning_rate=1e-4))
76+
77+
# Wait for the results
78+
fwdbwd_result = fwdbwd_future.result()
79+
optim_result = optim_future.result()
80+
81+
# fwdbwd_result contains the logprobs of all the tokens we put in. Now we can compute the weighted
82+
# average log loss per token.
83+
logprobs = np.concatenate([output['logprobs'].tolist() for output in fwdbwd_result.loss_fn_outputs])
84+
weights = np.concatenate([example.loss_fn_inputs['weights'].tolist() for example in processed_examples])
85+
print(f"Loss per token: {-np.dot(logprobs, weights) / weights.sum():.4f}")
86+
87+
#%%
88+
# First, create a sampling client. We need to transfer weights
89+
sampling_client = training_client.save_weights_and_get_sampling_client(name='pig-latin-model')
90+
91+
# Now, we can sample from the model.
92+
prompt = types.ModelInput.from_ints(tokenizer.encode("English: coffee break\nPig Latin:"))
93+
params = types.SamplingParams(max_tokens=20, temperature=0.0, stop=["\n"]) # Greedy sampling
94+
future = sampling_client.sample(prompt=prompt, sampling_params=params, num_samples=8)
95+
result = future.result()
96+
print("Responses:")
97+
for i, seq in enumerate(result.sequences):
98+
print(f"{i}: {repr(tokenizer.decode(seq.tokens))}")
99+
# %%

cookbook/tinker/server.py

Lines changed: 48 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,48 @@
1+
import os
2+
os.environ['RAY_DEBUG'] = '1'
3+
import ray
4+
from omegaconf import OmegaConf
5+
from ray import serve
6+
from twinkle.server.tinker import build_model_app, build_server_app
7+
8+
ray.init(namespace="twinkle_cluster")
9+
serve.shutdown()
10+
import time
11+
time.sleep(5)
12+
13+
file_dir = os.path.abspath(os.path.dirname(__file__))
14+
config = OmegaConf.load(os.path.join(file_dir, 'server_config.yaml'))
15+
16+
APP_BUILDERS = {
17+
'main:build_server_app': build_server_app,
18+
'main:build_model_app': build_model_app,
19+
# 'main:build_sampler_app': build_sampler_app,
20+
}
21+
22+
for app_config in config.applications:
23+
print(f"Starting {app_config.name} at {app_config.route_prefix}...")
24+
25+
builder = APP_BUILDERS[app_config.import_path]
26+
args = OmegaConf.to_container(app_config.args, resolve=True) if app_config.args else {}
27+
28+
deploy_options = {}
29+
deploy_config = app_config.deployments[0]
30+
if 'autoscaling_config' in deploy_config:
31+
deploy_options['autoscaling_config'] = OmegaConf.to_container(deploy_config.autoscaling_config)
32+
if 'ray_actor_options' in deploy_config:
33+
deploy_options['ray_actor_options'] = OmegaConf.to_container(deploy_config.ray_actor_options)
34+
35+
app = builder(
36+
deploy_options=deploy_options,
37+
route_prefix=app_config.route_prefix,
38+
**{k: v for k, v in args.items()}
39+
)
40+
41+
serve.run(app, name=app_config.name, route_prefix=app_config.route_prefix)
42+
43+
print("\nAll applications started!")
44+
print("Endpoints:")
45+
for app_config in config.applications:
46+
print(f" - http://localhost:8000{app_config.route_prefix}")
47+
48+
input("\nPress Enter to stop the server...")

cookbook/tinker/server_config.yaml

Lines changed: 42 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,42 @@
1+
proxy_location: EveryNode
2+
http_options:
3+
host: 0.0.0.0
4+
port: 8000
5+
6+
applications:
7+
- name: server
8+
route_prefix: /api/v1
9+
import_path: main:build_server_app
10+
args:
11+
12+
deployments:
13+
- name: TinkerCompatServer
14+
autoscaling_config:
15+
min_replicas: 1
16+
max_replicas: 1
17+
target_ongoing_requests: 128
18+
ray_actor_options:
19+
num_cpus: 0.1
20+
21+
- name: models
22+
route_prefix: /api/v1/model
23+
import_path: main:build_model_app
24+
args:
25+
nproc_per_node: 2
26+
device_group:
27+
name: model
28+
ranks: [ 0,1 ]
29+
device_type: cuda
30+
device_mesh:
31+
device_type: cuda
32+
mesh: [0,1 ]
33+
mesh_dim_names: ['dp']
34+
deployments:
35+
- name: ModelManagement
36+
autoscaling_config:
37+
min_replicas: 1
38+
max_replicas: 1
39+
target_ongoing_requests: 16
40+
ray_actor_options:
41+
num_cpus: 0.1
42+

0 commit comments

Comments
 (0)