Incorrect results reported for TensorRT-LLM

I was able to run float16 and int4 trt-llm benchmarks with mistral 7B on L4 GPU (GCP), the reported performance is 40.96 ± 0.37 t/s in float16 and 166.02 ± 0.52 t/s with int4, which is significantly faster compared to both exllamav2 and vllm with batch size 1 on llama 3 8B (also int4), and also 2x higher than theoretically possible given the memory bandwidth available. 

I did some debugging and believe that reported results are incorrect, in terms of number of generated tokens, e.g. after this line https://github.com/premAI-io/benchmarks/blob/07100376e062c2690639fa02e41dbc01b83a1cf5/bench_tensorrtllm/bench.py#L101 the output_tokens variable contains only values "2" after a while, and I suspect that generation finishes once the first "2" is encountered, and the rest of "2" are not generated actually. This also matches what happens in the second stage where quality is checked -- there speed is not measured but if it was measured using the same method as used in this repo, it would be even higher. If I compute `num_output_tokens` as `output_tokens.index(2)` (which is obviously not a general solution but works for now for mistral), then I get the values which are much closer to vllm and also get same generation speed in the speed test and in the subsequent quality test.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Incorrect results reported for TensorRT-LLM #187

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Incorrect results reported for TensorRT-LLM #187

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions