From 774f8011806d992eacf11b3661e92ccb968459a4 Mon Sep 17 00:00:00 2001
From: burtenshaw <ben.burtenshaw@gmail.com>
Date: Tue, 12 Aug 2025 13:07:47 +0200
Subject: [PATCH 1/4] add leaderboards to docs

---
 docs/source/index.mdx | 51 ++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 48 insertions(+), 3 deletions(-)
diff --git a/docs/source/index.mdx b/docs/source/index.mdx
index 5b46bddb..bfe0b4c2 100644
--- a/docs/source/index.mdx
+++ b/docs/source/index.mdx
@@ -1,17 +1,52 @@
 <p align="center">
     <br>
-    <img src="https://huggingface.co/datasets/evaluate/media/resolve/main/evaluate-banner.png" width="400"/>
+    <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/evaluate-docs/evaluate-on-hub-banner.png" width="400"/>
     <br>
 </p>
 
+# Evaluate on the Hub
+
+You can evaluate AI models on the Hub in a multiple ways and this page will guide you through the different options:
+
+- **Community Leaderboards** bring together the best models for a given task or domain and make them accessible to everyone.
+- **Model Cards** provide a comprehensive overview of a model's capabilities from the author's perspective.
+- **Libaries and Packages** give you the tools to evaluate your models on the Hub.
+
+## Community Leaderboards
+
+Community leaderboard show how a model performs on a given task or domain. For example, their are leaderboards for question-answering, reasoning, classification, vision, and audio. If you're tackling a new task, you can use a leaderboard to see how a model performs on it.
+
+Here are some examples of community leaderboards:
+
+| Leaderboard | Task | Description |
+| --- | --- | --- |
+| [MTEB](https://huggingface.co/spaces/mteb/leaderboard)| Embedding | MTEB leaderboard compares 100+ text and image embedding models across 1000+ languages. We refer to the publication of each selectable benchmark for details on metrics, languages, tasks, and task types. Anyone is welcome to add a model, add benchmarks, help us improve zero-shot annotations or propose other changes to the leaderboard. |
+| [GAIA](https://huggingface.co/spaces/gaia-benchmark/leaderboard)| Agentic | GAIA is a benchmark which aims at evaluating next-generation LLMs (LLMs with augmented capabilities due to added tooling, efficient prompting, access to search, etc). (See our paper for more details.) |
+| [OpenVLM Leaderboard](https://huggingface.co/spaces/opencompass/open_vlm_leaderboard)| Vision Language Models | The OpenVLM Leaderboard evaluates 272+ Vision-Language Models (including GPT-4v, Gemini, QwenVLPlus, LLaVA) across 31 different multi-modal benchmarks using the VLMEvalKit framework. It focuses on open-source VLMs and publicly available API models. |
+| [Open ASR Leaderboard](https://huggingface.co/spaces/hf-audio/open_asr_leaderboard)| Audio | The Open ASR Leaderboard ranks and evaluates speech recognition models on the Hugging Face Hub. Models are ranked based on their Average WER, from lowest to highest. |
+| [LLM-Perf Leaderboard](https://huggingface.co/spaces/llm-perf/leaderboard)| LLM Performance | The 🤗 LLM-Perf Leaderboard 🏋️ is a leaderboard at the intersection of quality and performance. Its aim is to benchmark the performance (latency, throughput, memory & energy) of Large Language Models (LLMs) with different hardwares, backends and optimizations using Optimum-Benhcmark. |
+
+There are tonnes more leaderboards on the Hub. Check out all the leaderboards via this [search](https://huggingface.co/spaces?category=model-benchmarking) or use this [dedicated space](https://huggingface.co/spaces/OpenEvals/find-a-leaderboard) to find a leaderboard for your task.
+
+## Model Cards
+
+Model cards provide an overview of a model's capabilities evaluated by the model's author. They are a great way to understand a model's capabilities and limitations.
+
+![qwen-model-card](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/evaluate-docs/qwen-model-card.png)
+
+Unlike leaderboards, model card evaluation scores are not created openly by the community.
+
 <Tip>
-For more recent evaluation approaches that are popular on the Hugging Face Hub that are currently more actively maintained, check out [LightEval](https://github.com/huggingface/lighteval).
 
 For information on reporting results, see details on [the Model Card Evaluation Results metadata](https://huggingface.co/docs/hub/en/model-cards#evaluation-results).
+
 </Tip>
 
+## Libaries and Packages
 
-# 🤗 Evaluate
+There are a number of open-source libraries and packages that you can use to evaluate your models on the Hub. These are useful if you want to evaluate a custom model or with a custom evaluation task.
+
+### 🤗 Evaluate
 
 A library for easily evaluating machine learning models and datasets.
 
@@ -39,3 +74,13 @@ Visit the 🤗 Evaluate [organization](https://huggingface.co/evaluate-metric) f
     </a>
   </div>
 </div>
+
+### LightEval
+
+LightEval is a library for evaluating LLMs. It is designed to be comprehensive and customizable. Visit the LightEval [repository](https://github.com/huggingface/lighteval) for more information.
+
+<Tip>
+
+For more recent evaluation approaches that are popular on the Hugging Face Hub that are currently more actively maintained, check out [LightEval](https://github.com/huggingface/lighteval).
+
+</Tip>
\ No newline at end of file

From 587b8b366e90d5038d46c157a8d28ff4660c93af Mon Sep 17 00:00:00 2001
From: burtenshaw <ben.burtenshaw@gmail.com>
Date: Tue, 12 Aug 2025 14:19:58 +0200
Subject: [PATCH 2/4] proof

---
 docs/source/index.mdx | 22 +++++++++++-----------
 1 file changed, 11 insertions(+), 11 deletions(-)

diff --git a/docs/source/index.mdx b/docs/source/index.mdx
index bfe0b4c2..6fc5451c 100644
--- a/docs/source/index.mdx
+++ b/docs/source/index.mdx
@@ -1,20 +1,20 @@
+# Evaluate on the Hub
+
 <p align="center">
     <br>
-    <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/evaluate-docs/evaluate-on-hub-banner.png" width="400"/>
+    <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/evaluate-docs/evaluate-on-hub-banner.png" alt="Evaluate on the Hub banner" width="400"/>
     <br>
 </p>
 
-# Evaluate on the Hub
-
-You can evaluate AI models on the Hub in a multiple ways and this page will guide you through the different options:
+You can evaluate AI models on the Hub in multiple ways and this page will guide you through the different options:
 
 - **Community Leaderboards** bring together the best models for a given task or domain and make them accessible to everyone.
 - **Model Cards** provide a comprehensive overview of a model's capabilities from the author's perspective.
-- **Libaries and Packages** give you the tools to evaluate your models on the Hub.
+- **Libraries and packages** give you the tools to evaluate your models on the Hub.
 
 ## Community Leaderboards
 
-Community leaderboard show how a model performs on a given task or domain. For example, their are leaderboards for question-answering, reasoning, classification, vision, and audio. If you're tackling a new task, you can use a leaderboard to see how a model performs on it.
+Community leaderboards show how a model performs on a given task or domain. For example, there are leaderboards for question answering, reasoning, classification, vision, and audio. If you're tackling a new task, you can use a leaderboard to see how a model performs on it.
 
 Here are some examples of community leaderboards:
 
@@ -24,15 +24,15 @@ Here are some examples of community leaderboards:
 | [GAIA](https://huggingface.co/spaces/gaia-benchmark/leaderboard)| Agentic | GAIA is a benchmark which aims at evaluating next-generation LLMs (LLMs with augmented capabilities due to added tooling, efficient prompting, access to search, etc). (See our paper for more details.) |
 | [OpenVLM Leaderboard](https://huggingface.co/spaces/opencompass/open_vlm_leaderboard)| Vision Language Models | The OpenVLM Leaderboard evaluates 272+ Vision-Language Models (including GPT-4v, Gemini, QwenVLPlus, LLaVA) across 31 different multi-modal benchmarks using the VLMEvalKit framework. It focuses on open-source VLMs and publicly available API models. |
 | [Open ASR Leaderboard](https://huggingface.co/spaces/hf-audio/open_asr_leaderboard)| Audio | The Open ASR Leaderboard ranks and evaluates speech recognition models on the Hugging Face Hub. Models are ranked based on their Average WER, from lowest to highest. |
-| [LLM-Perf Leaderboard](https://huggingface.co/spaces/llm-perf/leaderboard)| LLM Performance | The 🤗 LLM-Perf Leaderboard 🏋️ is a leaderboard at the intersection of quality and performance. Its aim is to benchmark the performance (latency, throughput, memory & energy) of Large Language Models (LLMs) with different hardwares, backends and optimizations using Optimum-Benhcmark. |
+| [LLM-Perf Leaderboard](https://huggingface.co/spaces/llm-perf/leaderboard)| LLM Performance | The 🤗 LLM-Perf Leaderboard 🏋️ is a leaderboard at the intersection of quality and performance. Its aim is to benchmark the performance (latency, throughput, memory & energy) of Large Language Models (LLMs) with different hardware, backends and optimizations using Optimum-Benchmark. |
 
-There are tonnes more leaderboards on the Hub. Check out all the leaderboards via this [search](https://huggingface.co/spaces?category=model-benchmarking) or use this [dedicated space](https://huggingface.co/spaces/OpenEvals/find-a-leaderboard) to find a leaderboard for your task.
+There are many more leaderboards on the Hub. Check out all the leaderboards via this [search](https://huggingface.co/spaces?category=model-benchmarking) or use this [dedicated Space](https://huggingface.co/spaces/OpenEvals/find-a-leaderboard) to find a leaderboard for your task.
 
 ## Model Cards
 
 Model cards provide an overview of a model's capabilities evaluated by the model's author. They are a great way to understand a model's capabilities and limitations.
 
-![qwen-model-card](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/evaluate-docs/qwen-model-card.png)
+![Qwen model card](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/evaluate-docs/qwen-model-card.png)
 
 Unlike leaderboards, model card evaluation scores are not created openly by the community.
 
@@ -42,7 +42,7 @@ For information on reporting results, see details on [the Model Card Evaluation
 
 </Tip>
 
-## Libaries and Packages
+## Libraries and packages
 
 There are a number of open-source libraries and packages that you can use to evaluate your models on the Hub. These are useful if you want to evaluate a custom model or with a custom evaluation task.
 
@@ -50,7 +50,7 @@ There are a number of open-source libraries and packages that you can use to eva
 
 A library for easily evaluating machine learning models and datasets.
 
-With a single line of code, you get access to dozens of evaluation methods for different domains (NLP, Computer Vision, Reinforcement Learning, and more!). Be it on your local machine or in a distributed training setup, you can evaluate your models in a consistent and reproducible way! 
+With a single line of code, you get access to dozens of evaluation methods for different domains (NLP, Computer Vision, Reinforcement Learning, and more!). Be it on your local machine or in a distributed training setup, you can evaluate your models in a consistent and reproducible way!
 
 Visit the 🤗 Evaluate [organization](https://huggingface.co/evaluate-metric) for a full list of available metrics. Each metric has a dedicated Space with an interactive demo for how to use the metric, and a documentation card detailing the metrics limitations and usage.
 

From 3136bf9bd3147ce50d724a9778d8520ec340824f Mon Sep 17 00:00:00 2001
From: burtenshaw <ben.burtenshaw@gmail.com>
Date: Wed, 13 Aug 2025 14:49:50 +0200
Subject: [PATCH 3/4] change order of libraries

---
 docs/source/index.mdx | 20 ++++++++++----------
 1 file changed, 10 insertions(+), 10 deletions(-)

diff --git a/docs/source/index.mdx b/docs/source/index.mdx
index 6fc5451c..e3f319ba 100644
--- a/docs/source/index.mdx
+++ b/docs/source/index.mdx
@@ -46,6 +46,16 @@ For information on reporting results, see details on [the Model Card Evaluation
 
 There are a number of open-source libraries and packages that you can use to evaluate your models on the Hub. These are useful if you want to evaluate a custom model or with a custom evaluation task.
 
+### LightEval
+
+LightEval is a library for evaluating LLMs. It is designed to be comprehensive and customizable. Visit the LightEval [repository](https://github.com/huggingface/lighteval) for more information.
+
+<Tip>
+
+For more recent evaluation approaches that are popular on the Hugging Face Hub that are currently more actively maintained, check out [LightEval](https://github.com/huggingface/lighteval).
+
+</Tip>
+
 ### 🤗 Evaluate
 
 A library for easily evaluating machine learning models and datasets.
@@ -74,13 +84,3 @@ Visit the 🤗 Evaluate [organization](https://huggingface.co/evaluate-metric) f
     </a>
   </div>
 </div>
-
-### LightEval
-
-LightEval is a library for evaluating LLMs. It is designed to be comprehensive and customizable. Visit the LightEval [repository](https://github.com/huggingface/lighteval) for more information.
-
-<Tip>
-
-For more recent evaluation approaches that are popular on the Hugging Face Hub that are currently more actively maintained, check out [LightEval](https://github.com/huggingface/lighteval).
-
-</Tip>
\ No newline at end of file

From e3a88e8cddbfa0ab937c6dd44e2f76bca184cb8d Mon Sep 17 00:00:00 2001
From: burtenshaw <ben.burtenshaw@gmail.com>
Date: Thu, 14 Aug 2025 10:13:48 +0200
Subject: [PATCH 4/4] respond to feedback

---
 docs/source/index.mdx | 16 ++++++++--------
 1 file changed, 8 insertions(+), 8 deletions(-)

diff --git a/docs/source/index.mdx b/docs/source/index.mdx
index e3f319ba..720abcdc 100644
--- a/docs/source/index.mdx
+++ b/docs/source/index.mdx
@@ -8,9 +8,9 @@
 
 You can evaluate AI models on the Hub in multiple ways and this page will guide you through the different options:
 
-- **Community Leaderboards** bring together the best models for a given task or domain and make them accessible to everyone.
+- **Community Leaderboards** bring together the best models for a given task or domain and make them accessible to everyone by ranking them.
 - **Model Cards** provide a comprehensive overview of a model's capabilities from the author's perspective.
-- **Libraries and packages** give you the tools to evaluate your models on the Hub.
+- **Libraries and Packages** give you the tools to evaluate your models on the Hub.
 
 ## Community Leaderboards
 
@@ -18,10 +18,10 @@ Community leaderboards show how a model performs on a given task or domain. For
 
 Here are some examples of community leaderboards:
 
-| Leaderboard | Task | Description |
+| Leaderboard | Model Type | Description |
 | --- | --- | --- |
-| [MTEB](https://huggingface.co/spaces/mteb/leaderboard)| Embedding | MTEB leaderboard compares 100+ text and image embedding models across 1000+ languages. We refer to the publication of each selectable benchmark for details on metrics, languages, tasks, and task types. Anyone is welcome to add a model, add benchmarks, help us improve zero-shot annotations or propose other changes to the leaderboard. |
-| [GAIA](https://huggingface.co/spaces/gaia-benchmark/leaderboard)| Agentic | GAIA is a benchmark which aims at evaluating next-generation LLMs (LLMs with augmented capabilities due to added tooling, efficient prompting, access to search, etc). (See our paper for more details.) |
+| [MTEB](https://huggingface.co/spaces/mteb/leaderboard)| Embedding | The Massive Text Embedding Benchmark leaderboard compares 100+ text and image embedding models across 1000+ languages. Refer to the publication of each selectable benchmark for details on metrics, languages, tasks, and task types. Anyone is welcome to add a model, add benchmarks, help improve zero-shot annotations, or propose other changes to the leaderboard. |
+| [GAIA](https://huggingface.co/spaces/gaia-benchmark/leaderboard)| Agentic | GAIA is a benchmark which aims at evaluating next-generation LLMs (LLMs with augmented capabilities due to added tooling, efficient prompting, access to search, etc). (See [the paper](https://arxiv.org/abs/2311.12983) for more details.) |
 | [OpenVLM Leaderboard](https://huggingface.co/spaces/opencompass/open_vlm_leaderboard)| Vision Language Models | The OpenVLM Leaderboard evaluates 272+ Vision-Language Models (including GPT-4v, Gemini, QwenVLPlus, LLaVA) across 31 different multi-modal benchmarks using the VLMEvalKit framework. It focuses on open-source VLMs and publicly available API models. |
 | [Open ASR Leaderboard](https://huggingface.co/spaces/hf-audio/open_asr_leaderboard)| Audio | The Open ASR Leaderboard ranks and evaluates speech recognition models on the Hugging Face Hub. Models are ranked based on their Average WER, from lowest to highest. |
 | [LLM-Perf Leaderboard](https://huggingface.co/spaces/llm-perf/leaderboard)| LLM Performance | The 🤗 LLM-Perf Leaderboard 🏋️ is a leaderboard at the intersection of quality and performance. Its aim is to benchmark the performance (latency, throughput, memory & energy) of Large Language Models (LLMs) with different hardware, backends and optimizations using Optimum-Benchmark. |
@@ -30,11 +30,11 @@ There are many more leaderboards on the Hub. Check out all the leaderboards via
 
 ## Model Cards
 
-Model cards provide an overview of a model's capabilities evaluated by the model's author. They are a great way to understand a model's capabilities and limitations.
+Model cards provide an overview of a model's capabilities evaluated by the community or the model's author. They are a great way to understand a model's capabilities and limitations.
 
 ![Qwen model card](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/evaluate-docs/qwen-model-card.png)
 
-Unlike leaderboards, model card evaluation scores are not created openly by the community.
+Unlike leaderboards, model card evaluation scores are often created by the author, rather than by the community.
 
 <Tip>
 
@@ -44,7 +44,7 @@ For information on reporting results, see details on [the Model Card Evaluation
 
 ## Libraries and packages
 
-There are a number of open-source libraries and packages that you can use to evaluate your models on the Hub. These are useful if you want to evaluate a custom model or with a custom evaluation task.
+There are a number of open-source libraries and packages that you can use to evaluate your models on the Hub. These are useful if you want to evaluate a custom model or performance on a custom evaluation task.
 
 ### LightEval