From 774f8011806d992eacf11b3661e92ccb968459a4 Mon Sep 17 00:00:00 2001 From: burtenshaw Date: Tue, 12 Aug 2025 13:07:47 +0200 Subject: [PATCH 1/4] add leaderboards to docs --- docs/source/index.mdx | 51 ++++++++++++++++++++++++++++++++++++++++--- 1 file changed, 48 insertions(+), 3 deletions(-) diff --git a/docs/source/index.mdx b/docs/source/index.mdx index 5b46bddb..bfe0b4c2 100644 --- a/docs/source/index.mdx +++ b/docs/source/index.mdx @@ -1,17 +1,52 @@


- +

+# Evaluate on the Hub + +You can evaluate AI models on the Hub in a multiple ways and this page will guide you through the different options: + +- **Community Leaderboards** bring together the best models for a given task or domain and make them accessible to everyone. +- **Model Cards** provide a comprehensive overview of a model's capabilities from the author's perspective. +- **Libaries and Packages** give you the tools to evaluate your models on the Hub. + +## Community Leaderboards + +Community leaderboard show how a model performs on a given task or domain. For example, their are leaderboards for question-answering, reasoning, classification, vision, and audio. If you're tackling a new task, you can use a leaderboard to see how a model performs on it. + +Here are some examples of community leaderboards: + +| Leaderboard | Task | Description | +| --- | --- | --- | +| [MTEB](https://huggingface.co/spaces/mteb/leaderboard)| Embedding | MTEB leaderboard compares 100+ text and image embedding models across 1000+ languages. We refer to the publication of each selectable benchmark for details on metrics, languages, tasks, and task types. Anyone is welcome to add a model, add benchmarks, help us improve zero-shot annotations or propose other changes to the leaderboard. | +| [GAIA](https://huggingface.co/spaces/gaia-benchmark/leaderboard)| Agentic | GAIA is a benchmark which aims at evaluating next-generation LLMs (LLMs with augmented capabilities due to added tooling, efficient prompting, access to search, etc). (See our paper for more details.) | +| [OpenVLM Leaderboard](https://huggingface.co/spaces/opencompass/open_vlm_leaderboard)| Vision Language Models | The OpenVLM Leaderboard evaluates 272+ Vision-Language Models (including GPT-4v, Gemini, QwenVLPlus, LLaVA) across 31 different multi-modal benchmarks using the VLMEvalKit framework. It focuses on open-source VLMs and publicly available API models. | +| [Open ASR Leaderboard](https://huggingface.co/spaces/hf-audio/open_asr_leaderboard)| Audio | The Open ASR Leaderboard ranks and evaluates speech recognition models on the Hugging Face Hub. Models are ranked based on their Average WER, from lowest to highest. | +| [LLM-Perf Leaderboard](https://huggingface.co/spaces/llm-perf/leaderboard)| LLM Performance | The 🤗 LLM-Perf Leaderboard 🏋️ is a leaderboard at the intersection of quality and performance. Its aim is to benchmark the performance (latency, throughput, memory & energy) of Large Language Models (LLMs) with different hardwares, backends and optimizations using Optimum-Benhcmark. | + +There are tonnes more leaderboards on the Hub. Check out all the leaderboards via this [search](https://huggingface.co/spaces?category=model-benchmarking) or use this [dedicated space](https://huggingface.co/spaces/OpenEvals/find-a-leaderboard) to find a leaderboard for your task. + +## Model Cards + +Model cards provide an overview of a model's capabilities evaluated by the model's author. They are a great way to understand a model's capabilities and limitations. + +![qwen-model-card](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/evaluate-docs/qwen-model-card.png) + +Unlike leaderboards, model card evaluation scores are not created openly by the community. + -For more recent evaluation approaches that are popular on the Hugging Face Hub that are currently more actively maintained, check out [LightEval](https://github.com/huggingface/lighteval). For information on reporting results, see details on [the Model Card Evaluation Results metadata](https://huggingface.co/docs/hub/en/model-cards#evaluation-results). + +## Libaries and Packages -# 🤗 Evaluate +There are a number of open-source libraries and packages that you can use to evaluate your models on the Hub. These are useful if you want to evaluate a custom model or with a custom evaluation task. + +### 🤗 Evaluate A library for easily evaluating machine learning models and datasets. @@ -39,3 +74,13 @@ Visit the 🤗 Evaluate [organization](https://huggingface.co/evaluate-metric) f + +### LightEval + +LightEval is a library for evaluating LLMs. It is designed to be comprehensive and customizable. Visit the LightEval [repository](https://github.com/huggingface/lighteval) for more information. + + + +For more recent evaluation approaches that are popular on the Hugging Face Hub that are currently more actively maintained, check out [LightEval](https://github.com/huggingface/lighteval). + + \ No newline at end of file From 587b8b366e90d5038d46c157a8d28ff4660c93af Mon Sep 17 00:00:00 2001 From: burtenshaw Date: Tue, 12 Aug 2025 14:19:58 +0200 Subject: [PATCH 2/4] proof --- docs/source/index.mdx | 22 +++++++++++----------- 1 file changed, 11 insertions(+), 11 deletions(-) diff --git a/docs/source/index.mdx b/docs/source/index.mdx index bfe0b4c2..6fc5451c 100644 --- a/docs/source/index.mdx +++ b/docs/source/index.mdx @@ -1,20 +1,20 @@ +# Evaluate on the Hub +


- + Evaluate on the Hub banner

-# Evaluate on the Hub - -You can evaluate AI models on the Hub in a multiple ways and this page will guide you through the different options: +You can evaluate AI models on the Hub in multiple ways and this page will guide you through the different options: - **Community Leaderboards** bring together the best models for a given task or domain and make them accessible to everyone. - **Model Cards** provide a comprehensive overview of a model's capabilities from the author's perspective. -- **Libaries and Packages** give you the tools to evaluate your models on the Hub. +- **Libraries and packages** give you the tools to evaluate your models on the Hub. ## Community Leaderboards -Community leaderboard show how a model performs on a given task or domain. For example, their are leaderboards for question-answering, reasoning, classification, vision, and audio. If you're tackling a new task, you can use a leaderboard to see how a model performs on it. +Community leaderboards show how a model performs on a given task or domain. For example, there are leaderboards for question answering, reasoning, classification, vision, and audio. If you're tackling a new task, you can use a leaderboard to see how a model performs on it. Here are some examples of community leaderboards: @@ -24,15 +24,15 @@ Here are some examples of community leaderboards: | [GAIA](https://huggingface.co/spaces/gaia-benchmark/leaderboard)| Agentic | GAIA is a benchmark which aims at evaluating next-generation LLMs (LLMs with augmented capabilities due to added tooling, efficient prompting, access to search, etc). (See our paper for more details.) | | [OpenVLM Leaderboard](https://huggingface.co/spaces/opencompass/open_vlm_leaderboard)| Vision Language Models | The OpenVLM Leaderboard evaluates 272+ Vision-Language Models (including GPT-4v, Gemini, QwenVLPlus, LLaVA) across 31 different multi-modal benchmarks using the VLMEvalKit framework. It focuses on open-source VLMs and publicly available API models. | | [Open ASR Leaderboard](https://huggingface.co/spaces/hf-audio/open_asr_leaderboard)| Audio | The Open ASR Leaderboard ranks and evaluates speech recognition models on the Hugging Face Hub. Models are ranked based on their Average WER, from lowest to highest. | -| [LLM-Perf Leaderboard](https://huggingface.co/spaces/llm-perf/leaderboard)| LLM Performance | The 🤗 LLM-Perf Leaderboard 🏋️ is a leaderboard at the intersection of quality and performance. Its aim is to benchmark the performance (latency, throughput, memory & energy) of Large Language Models (LLMs) with different hardwares, backends and optimizations using Optimum-Benhcmark. | +| [LLM-Perf Leaderboard](https://huggingface.co/spaces/llm-perf/leaderboard)| LLM Performance | The 🤗 LLM-Perf Leaderboard 🏋️ is a leaderboard at the intersection of quality and performance. Its aim is to benchmark the performance (latency, throughput, memory & energy) of Large Language Models (LLMs) with different hardware, backends and optimizations using Optimum-Benchmark. | -There are tonnes more leaderboards on the Hub. Check out all the leaderboards via this [search](https://huggingface.co/spaces?category=model-benchmarking) or use this [dedicated space](https://huggingface.co/spaces/OpenEvals/find-a-leaderboard) to find a leaderboard for your task. +There are many more leaderboards on the Hub. Check out all the leaderboards via this [search](https://huggingface.co/spaces?category=model-benchmarking) or use this [dedicated Space](https://huggingface.co/spaces/OpenEvals/find-a-leaderboard) to find a leaderboard for your task. ## Model Cards Model cards provide an overview of a model's capabilities evaluated by the model's author. They are a great way to understand a model's capabilities and limitations. -![qwen-model-card](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/evaluate-docs/qwen-model-card.png) +![Qwen model card](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/evaluate-docs/qwen-model-card.png) Unlike leaderboards, model card evaluation scores are not created openly by the community. @@ -42,7 +42,7 @@ For information on reporting results, see details on [the Model Card Evaluation -## Libaries and Packages +## Libraries and packages There are a number of open-source libraries and packages that you can use to evaluate your models on the Hub. These are useful if you want to evaluate a custom model or with a custom evaluation task. @@ -50,7 +50,7 @@ There are a number of open-source libraries and packages that you can use to eva A library for easily evaluating machine learning models and datasets. -With a single line of code, you get access to dozens of evaluation methods for different domains (NLP, Computer Vision, Reinforcement Learning, and more!). Be it on your local machine or in a distributed training setup, you can evaluate your models in a consistent and reproducible way! +With a single line of code, you get access to dozens of evaluation methods for different domains (NLP, Computer Vision, Reinforcement Learning, and more!). Be it on your local machine or in a distributed training setup, you can evaluate your models in a consistent and reproducible way! Visit the 🤗 Evaluate [organization](https://huggingface.co/evaluate-metric) for a full list of available metrics. Each metric has a dedicated Space with an interactive demo for how to use the metric, and a documentation card detailing the metrics limitations and usage. From 3136bf9bd3147ce50d724a9778d8520ec340824f Mon Sep 17 00:00:00 2001 From: burtenshaw Date: Wed, 13 Aug 2025 14:49:50 +0200 Subject: [PATCH 3/4] change order of libraries --- docs/source/index.mdx | 20 ++++++++++---------- 1 file changed, 10 insertions(+), 10 deletions(-) diff --git a/docs/source/index.mdx b/docs/source/index.mdx index 6fc5451c..e3f319ba 100644 --- a/docs/source/index.mdx +++ b/docs/source/index.mdx @@ -46,6 +46,16 @@ For information on reporting results, see details on [the Model Card Evaluation There are a number of open-source libraries and packages that you can use to evaluate your models on the Hub. These are useful if you want to evaluate a custom model or with a custom evaluation task. +### LightEval + +LightEval is a library for evaluating LLMs. It is designed to be comprehensive and customizable. Visit the LightEval [repository](https://github.com/huggingface/lighteval) for more information. + + + +For more recent evaluation approaches that are popular on the Hugging Face Hub that are currently more actively maintained, check out [LightEval](https://github.com/huggingface/lighteval). + + + ### 🤗 Evaluate A library for easily evaluating machine learning models and datasets. @@ -74,13 +84,3 @@ Visit the 🤗 Evaluate [organization](https://huggingface.co/evaluate-metric) f - -### LightEval - -LightEval is a library for evaluating LLMs. It is designed to be comprehensive and customizable. Visit the LightEval [repository](https://github.com/huggingface/lighteval) for more information. - - - -For more recent evaluation approaches that are popular on the Hugging Face Hub that are currently more actively maintained, check out [LightEval](https://github.com/huggingface/lighteval). - - \ No newline at end of file From e3a88e8cddbfa0ab937c6dd44e2f76bca184cb8d Mon Sep 17 00:00:00 2001 From: burtenshaw Date: Thu, 14 Aug 2025 10:13:48 +0200 Subject: [PATCH 4/4] respond to feedback --- docs/source/index.mdx | 16 ++++++++-------- 1 file changed, 8 insertions(+), 8 deletions(-) diff --git a/docs/source/index.mdx b/docs/source/index.mdx index e3f319ba..720abcdc 100644 --- a/docs/source/index.mdx +++ b/docs/source/index.mdx @@ -8,9 +8,9 @@ You can evaluate AI models on the Hub in multiple ways and this page will guide you through the different options: -- **Community Leaderboards** bring together the best models for a given task or domain and make them accessible to everyone. +- **Community Leaderboards** bring together the best models for a given task or domain and make them accessible to everyone by ranking them. - **Model Cards** provide a comprehensive overview of a model's capabilities from the author's perspective. -- **Libraries and packages** give you the tools to evaluate your models on the Hub. +- **Libraries and Packages** give you the tools to evaluate your models on the Hub. ## Community Leaderboards @@ -18,10 +18,10 @@ Community leaderboards show how a model performs on a given task or domain. For Here are some examples of community leaderboards: -| Leaderboard | Task | Description | +| Leaderboard | Model Type | Description | | --- | --- | --- | -| [MTEB](https://huggingface.co/spaces/mteb/leaderboard)| Embedding | MTEB leaderboard compares 100+ text and image embedding models across 1000+ languages. We refer to the publication of each selectable benchmark for details on metrics, languages, tasks, and task types. Anyone is welcome to add a model, add benchmarks, help us improve zero-shot annotations or propose other changes to the leaderboard. | -| [GAIA](https://huggingface.co/spaces/gaia-benchmark/leaderboard)| Agentic | GAIA is a benchmark which aims at evaluating next-generation LLMs (LLMs with augmented capabilities due to added tooling, efficient prompting, access to search, etc). (See our paper for more details.) | +| [MTEB](https://huggingface.co/spaces/mteb/leaderboard)| Embedding | The Massive Text Embedding Benchmark leaderboard compares 100+ text and image embedding models across 1000+ languages. Refer to the publication of each selectable benchmark for details on metrics, languages, tasks, and task types. Anyone is welcome to add a model, add benchmarks, help improve zero-shot annotations, or propose other changes to the leaderboard. | +| [GAIA](https://huggingface.co/spaces/gaia-benchmark/leaderboard)| Agentic | GAIA is a benchmark which aims at evaluating next-generation LLMs (LLMs with augmented capabilities due to added tooling, efficient prompting, access to search, etc). (See [the paper](https://arxiv.org/abs/2311.12983) for more details.) | | [OpenVLM Leaderboard](https://huggingface.co/spaces/opencompass/open_vlm_leaderboard)| Vision Language Models | The OpenVLM Leaderboard evaluates 272+ Vision-Language Models (including GPT-4v, Gemini, QwenVLPlus, LLaVA) across 31 different multi-modal benchmarks using the VLMEvalKit framework. It focuses on open-source VLMs and publicly available API models. | | [Open ASR Leaderboard](https://huggingface.co/spaces/hf-audio/open_asr_leaderboard)| Audio | The Open ASR Leaderboard ranks and evaluates speech recognition models on the Hugging Face Hub. Models are ranked based on their Average WER, from lowest to highest. | | [LLM-Perf Leaderboard](https://huggingface.co/spaces/llm-perf/leaderboard)| LLM Performance | The 🤗 LLM-Perf Leaderboard 🏋️ is a leaderboard at the intersection of quality and performance. Its aim is to benchmark the performance (latency, throughput, memory & energy) of Large Language Models (LLMs) with different hardware, backends and optimizations using Optimum-Benchmark. | @@ -30,11 +30,11 @@ There are many more leaderboards on the Hub. Check out all the leaderboards via ## Model Cards -Model cards provide an overview of a model's capabilities evaluated by the model's author. They are a great way to understand a model's capabilities and limitations. +Model cards provide an overview of a model's capabilities evaluated by the community or the model's author. They are a great way to understand a model's capabilities and limitations. ![Qwen model card](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/evaluate-docs/qwen-model-card.png) -Unlike leaderboards, model card evaluation scores are not created openly by the community. +Unlike leaderboards, model card evaluation scores are often created by the author, rather than by the community. @@ -44,7 +44,7 @@ For information on reporting results, see details on [the Model Card Evaluation ## Libraries and packages -There are a number of open-source libraries and packages that you can use to evaluate your models on the Hub. These are useful if you want to evaluate a custom model or with a custom evaluation task. +There are a number of open-source libraries and packages that you can use to evaluate your models on the Hub. These are useful if you want to evaluate a custom model or performance on a custom evaluation task. ### LightEval