GitHub

SVD-Flash

Setup Steps

Launch a Tranium instance using AWS EC2 with the following settings:
i. Name and tags: SVD-Flash
ii. Amazon Machine Image: Deep Learning AMI Neuron (Ubuntu 22.04)
iii. Instance type: trn1.2xlarge
iv. Key pair (login): create a new key pair
v. Metadata version [under “Advanced details”]: V2 only (otherwise, you will encounter a not authorized error)
vi. When connecting to these instances via SSH, use the username of ubuntu.

Activate the Neuron virtual environment to run inference by running

source /opt/aws_neuronx_venv_pytorch_2_7_nxd_inference/bin/activate

Download Llama-3.2-1B from Hugging face

mkdir models

huggingface-cli download --token  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 meta-llama/Llama-3.2-1B --local-dir ./models/llama-3.2-1b

cd /home/ubuntu/models/llama-3.2-1b

mv model.safetensors  model_ori.safetensors

Download the weights after SVD and post-training processing

wget "https://huggingface.co/SVD-Flash/llama-3.2-1b_0.8_svd/resolve/main/llama-3.2-1b_svd_0.8_weights.safetensors?download=true" \
  -O model.safetensors

Download the v0.0.1 repo

cd ~   
git clone -b v0.0.1 --single-branch https://github.com/dinghongsong/SVD-Flash.git

Testing Example (Without Tensor Parallelism): Llama inference with logit matching accuracy check using custom error tolerances

python llama_inference.py \
 --model-type llama \
 --task-type causal-lm \
 run \
 --model-path /home/ubuntu/models/llama-3.2-1b \
 --compiled-model-path /home/ubuntu/traced_model/llama-3.2-1b \
 --torch-dtype bfloat16 \
 --batch-size 1 \
 --max-context-length 32 \
 --seq-len 64 \
 --check-accuracy-mode logit-matching \
 --divergence-difference-tol 0.005 \
 --tol-map "{5: (1e-5, 0.02)}" \
 --enable-bucketing \
 --top-k 1 \
 --pad-token-id 2 \
 --prompt "I believe the meaning of life is" \
 --prompt "The color of the sky is" \
 --compress-ratio 0.8

Output Example

------------------------------------------------------------------------------------------
model:  /home/ubuntu/models/llama-3.2-1b
{
 "e2e_model": {
     "latency_ms_p50": 1299.866795539856,
     "latency_ms_p90": 1301.309323310852,
     "latency_ms_p95": 1302.1685719490051,
     "latency_ms_p99": 1302.9363083839417,
     "latency_ms_p100": 1303.1282424926758,
     "latency_ms_avg": 1300.0563144683838,
     "throughput": 49.22863670422672
 },
 "context_encoding_model": {
     "latency_ms_p50": 70.28031349182129,
     "latency_ms_p90": 70.3099250793457,
     "latency_ms_p95": 70.31856775283813,
     "latency_ms_p99": 70.3455662727356,
     "latency_ms_p100": 70.35231590270996,
     "latency_ms_avg": 70.27335166931152,
     "throughput": 455.3646473357901
 },
 "token_generation_model": {
     "latency_ms_p50": 39.081573486328125,
     "latency_ms_p90": 39.14194107055664,
     "latency_ms_p95": 39.16501998901367,
     "latency_ms_p99": 39.197025299072266,
     "latency_ms_p100": 39.25800323486328,
     "latency_ms_avg": 39.088456092342255,
     "throughput": 26.40825879839129
 }
}
------------------------------------------------------------------------------------------
model:  /home/ubuntu/models/llama-3.2-1b/svd_llama
{
 "e2e_model": {
     "latency_ms_p50": 893.8226699829102,
     "latency_ms_p90": 894.6243762969971,
     "latency_ms_p95": 894.8212623596191,
     "latency_ms_p99": 895.5112648010254,
     "latency_ms_p100": 895.683765411377,
     "latency_ms_avg": 893.8456416130066,
     "throughput": 71.60072950012662
 },
 "context_encoding_model": {
     "latency_ms_p50": 66.6283369064331,
     "latency_ms_p90": 66.76206588745117,
     "latency_ms_p95": 66.76559448242188,
     "latency_ms_p99": 66.77356719970703,
     "latency_ms_p100": 66.77556037902832,
     "latency_ms_avg": 66.64743423461914,
     "throughput": 480.1385134700057
 },
 "token_generation_model": {
     "latency_ms_p50": 26.091694831848145,
     "latency_ms_p90": 26.137852668762207,
     "latency_ms_p95": 26.164603233337402,
     "latency_ms_p99": 26.198863983154297,
     "latency_ms_p100": 26.267528533935547,
     "latency_ms_avg": 26.096261316730132,
     "throughput": 39.55578356560814
 }
}
e2e_model time wo svd:  1300.0563144683838
e2e_model time with svd:  893.8456416130066
E2E Speedup:  1.4544528204247231

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
.vscode		.vscode
images		images
models		models
modules		modules
profiler		profiler
scripts		scripts
utils		utils
.gitignore		.gitignore
README.md		README.md
benchmark_report.json		benchmark_report.json
generation_demo_llama.py		generation_demo_llama.py
global_metric_store.json		global_metric_store.json
llama_inference.py		llama_inference.py
qwen_inference.py		qwen_inference.py
run.sh		run.sh
test_speedup.py		test_speedup.py
tuning_mlp_up.py		tuning_mlp_up.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

SVD-Flash

Setup Steps

Output Example

About

Uh oh!

Releases

Packages

Languages

dinghongsong/NeuronMM

Folders and files

Latest commit

History

Repository files navigation

SVD-Flash

Setup Steps

Output Example

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages