This repository was archived by the owner on Dec 17, 2025. It is now read-only.

Description
Hi team, thanks for the great work on llmperf — it's been super helpful for benchmarking LLM APIs.
I’m wondering if there are any plans to extend the correctness test framework to support multimodal models, e.g., models that accept both text and image inputs (like Qwen2-VL-7B-Instruct, glm-4v-9b, etc.).
This would be especially useful for evaluating models on tasks like OCR, image-to-text, or visual question answering.
Would love to hear your thoughts!
Thanks 🙏