-
Notifications
You must be signed in to change notification settings - Fork 5
Open
Description
I was able to run float16 and int4 trt-llm benchmarks with mistral 7B on L4 GPU (GCP), the reported performance is 40.96 ± 0.37 t/s in float16 and 166.02 ± 0.52 t/s with int4, which is significantly faster compared to both exllamav2 and vllm with batch size 1 on llama 3 8B (also int4), and also 2x higher than theoretically possible given the memory bandwidth available.
I did some debugging and believe that reported results are incorrect, in terms of number of generated tokens, e.g. after this line
benchmarks/bench_tensorrtllm/bench.py
Line 101 in 0710037
| output_tokens = output_ids[0][0].detach().cpu().tolist()[num_input_tokens:] |
num_output_tokens as output_tokens.index(2) (which is obviously not a general solution but works for now for mistral), then I get the values which are much closer to vllm and also get same generation speed in the speed test and in the subsequent quality test.Metadata
Metadata
Assignees
Labels
No labels