Skip to content

Commit 89fbf59

Browse files
committed
feat: Create tiny config for GPT-NeoX for quick tests
The main zero1.yaml uses a config that exceeds the allocation limit of existing AWS machines. Here we create a tiny yaml to allow quick testing. Test: ``` det experiment create zero1_tiny.yaml . ``` Job: http://ec2-44-213-33-242.compute-1.amazonaws.com:8080/det/experiments/10/overview ghstack-source-id: 2b558da Pull Request resolved: #6669
1 parent 4b9971e commit 89fbf59

2 files changed

Lines changed: 49 additions & 1 deletion

File tree

examples/deepspeed/gpt_neox/README.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -39,7 +39,8 @@ mounted at `/run/determined/workdir/shared_fs`. This is done by default for clu
3939

4040
Once a cluster is available, run the following command:
4141
```
42-
det experiment create zero1.yaml .
42+
det experiment create zero1.yaml . # For full training
43+
det experiment create zero1_tiny.yaml . # For quick tests
4344
```
4445

4546
**Note:** You will need to run on GPUs that support fp16 training.
Lines changed: 47 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,47 @@
1+
name: gpt-neox-zero1-2-7B
2+
debug: false
3+
profiling:
4+
enabled: false
5+
begin_on_batch: 50
6+
end_after_batch: 100
7+
sync_timings: false
8+
hyperparameters:
9+
search_world_size: false
10+
conf_dir: /gpt-neox/configs
11+
conf_file:
12+
- 2-7B.yml
13+
- determined_cluster.yml
14+
overwrite_values:
15+
pipe_parallel_size: 2
16+
model_parallel_size: 2
17+
train_batch_size: 64
18+
train_micro_batch_size_per_gpu: 2
19+
wandb_group: null
20+
wandb_team: null
21+
user_script: null
22+
eval_tasks: null
23+
environment:
24+
environment_variables:
25+
- NCCL_DEBUG=INFO
26+
# You may need to modify this to match your network configuration.
27+
- NCCL_SOCKET_IFNAME=ens,eth,ib
28+
force_pull_image: true
29+
image:
30+
gpu: determinedai/gpt-neox:4850e79
31+
resources:
32+
slots_per_trial: 16
33+
searcher:
34+
name: single
35+
metric: lm_loss
36+
smaller_is_better: false
37+
max_length:
38+
batches: 100
39+
min_validation_period:
40+
batches: 5000
41+
max_restarts: 0
42+
entrypoint:
43+
- python3
44+
- -m
45+
- determined.launch.deepspeed
46+
- --trial
47+
- gpt2_trial:GPT2Trial

0 commit comments

Comments
 (0)