Skip to content

Incorrect results reported for TensorRT-LLM #187

@lopuhin

Description

@lopuhin

I was able to run float16 and int4 trt-llm benchmarks with mistral 7B on L4 GPU (GCP), the reported performance is 40.96 ± 0.37 t/s in float16 and 166.02 ± 0.52 t/s with int4, which is significantly faster compared to both exllamav2 and vllm with batch size 1 on llama 3 8B (also int4), and also 2x higher than theoretically possible given the memory bandwidth available.

I did some debugging and believe that reported results are incorrect, in terms of number of generated tokens, e.g. after this line

output_tokens = output_ids[0][0].detach().cpu().tolist()[num_input_tokens:]
the output_tokens variable contains only values "2" after a while, and I suspect that generation finishes once the first "2" is encountered, and the rest of "2" are not generated actually. This also matches what happens in the second stage where quality is checked -- there speed is not measured but if it was measured using the same method as used in this repo, it would be even higher. If I compute num_output_tokens as output_tokens.index(2) (which is obviously not a general solution but works for now for mistral), then I get the values which are much closer to vllm and also get same generation speed in the speed test and in the subsequent quality test.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions