Skip to content

b200 local NVMe caching to speed up server start time #822

@functionstackx

Description

@functionstackx

currently b200 dgcx gcp cluster stores ckpts on the lustre shared cluster level storage instead of local node level /raid/ NVMe leading to 1-2 hour loads for kimi k2.5

switching to /raid/ will lead to 6-7x more job completions throughput per hour for b200

Metadata

Metadata

Labels

Type

Projects

Status

No status

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions