huggingface · burtenshaw · Aug 14, 2025 · Aug 12, 2025 · Aug 12, 2025 · Aug 13, 2025
diff --git a/docs/source/index.mdx b/docs/source/index.mdx
@@ -1,21 +1,66 @@
+# Evaluate on the Hub
+
 <p align="center">
     <br>
-    <img src="https://huggingface.co/datasets/evaluate/media/resolve/main/evaluate-banner.png" width="400"/>
+    <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/evaluate-docs/evaluate-on-hub-banner.png" alt="Evaluate on the Hub banner" width="400"/>
     <br>
 </p>
 
+You can evaluate AI models on the Hub in multiple ways and this page will guide you through the different options:
+
+- **Community Leaderboards** bring together the best models for a given task or domain and make them accessible to everyone by ranking them.
+- **Model Cards** provide a comprehensive overview of a model's capabilities from the author's perspective.
+- **Libraries and Packages** give you the tools to evaluate your models on the Hub.
+
+## Community Leaderboards
+
+Community leaderboards show how a model performs on a given task or domain. For example, there are leaderboards for question answering, reasoning, classification, vision, and audio. If you're tackling a new task, you can use a leaderboard to see how a model performs on it.
+
+Here are some examples of community leaderboards:
+
+| Leaderboard | Model Type | Description |
+| --- | --- | --- |
+| [MTEB](https://huggingface.co/spaces/mteb/leaderboard)| Embedding | The Massive Text Embedding Benchmark leaderboard compares 100+ text and image embedding models across 1000+ languages. Refer to the publication of each selectable benchmark for details on metrics, languages, tasks, and task types. Anyone is welcome to add a model, add benchmarks, help improve zero-shot annotations, or propose other changes to the leaderboard. |
+| [GAIA](https://huggingface.co/spaces/gaia-benchmark/leaderboard)| Agentic | GAIA is a benchmark which aims at evaluating next-generation LLMs (LLMs with augmented capabilities due to added tooling, efficient prompting, access to search, etc). (See [the paper](https://arxiv.org/abs/2311.12983) for more details.) |
+| [OpenVLM Leaderboard](https://huggingface.co/spaces/opencompass/open_vlm_leaderboard)| Vision Language Models | The OpenVLM Leaderboard evaluates 272+ Vision-Language Models (including GPT-4v, Gemini, QwenVLPlus, LLaVA) across 31 different multi-modal benchmarks using the VLMEvalKit framework. It focuses on open-source VLMs and publicly available API models. |
+| [Open ASR Leaderboard](https://huggingface.co/spaces/hf-audio/open_asr_leaderboard)| Audio | The Open ASR Leaderboard ranks and evaluates speech recognition models on the Hugging Face Hub. Models are ranked based on their Average WER, from lowest to highest. |
+| [LLM-Perf Leaderboard](https://huggingface.co/spaces/llm-perf/leaderboard)| LLM Performance | The 🤗 LLM-Perf Leaderboard 🏋️ is a leaderboard at the intersection of quality and performance. Its aim is to benchmark the performance (latency, throughput, memory & energy) of Large Language Models (LLMs) with different hardware, backends and optimizations using Optimum-Benchmark. |
+
+There are many more leaderboards on the Hub. Check out all the leaderboards via this [search](https://huggingface.co/spaces?category=model-benchmarking) or use this [dedicated Space](https://huggingface.co/spaces/OpenEvals/find-a-leaderboard) to find a leaderboard for your task.
+
+## Model Cards
+
+Model cards provide an overview of a model's capabilities evaluated by the community or the model's author. They are a great way to understand a model's capabilities and limitations.
+
+![Qwen model card](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/evaluate-docs/qwen-model-card.png)
+
+Unlike leaderboards, model card evaluation scores are often created by the author, rather than by the community.
+
 <Tip>
-For more recent evaluation approaches that are popular on the Hugging Face Hub that are currently more actively maintained, check out [LightEval](https://github.com/huggingface/lighteval).
 
 For information on reporting results, see details on [the Model Card Evaluation Results metadata](https://huggingface.co/docs/hub/en/model-cards#evaluation-results).
+
 </Tip>
 
+## Libraries and packages
+
+There are a number of open-source libraries and packages that you can use to evaluate your models on the Hub. These are useful if you want to evaluate a custom model or performance on a custom evaluation task.
+
+### LightEval
+
+LightEval is a library for evaluating LLMs. It is designed to be comprehensive and customizable. Visit the LightEval [repository](https://github.com/huggingface/lighteval) for more information.
+
+<Tip>
+
+For more recent evaluation approaches that are popular on the Hugging Face Hub that are currently more actively maintained, check out [LightEval](https://github.com/huggingface/lighteval).
+
+</Tip>
 
-# 🤗 Evaluate
+### 🤗 Evaluate
 
 A library for easily evaluating machine learning models and datasets.
 
-With a single line of code, you get access to dozens of evaluation methods for different domains (NLP, Computer Vision, Reinforcement Learning, and more!). Be it on your local machine or in a distributed training setup, you can evaluate your models in a consistent and reproducible way! 
+With a single line of code, you get access to dozens of evaluation methods for different domains (NLP, Computer Vision, Reinforcement Learning, and more!). Be it on your local machine or in a distributed training setup, you can evaluate your models in a consistent and reproducible way!
 
 Visit the 🤗 Evaluate [organization](https://huggingface.co/evaluate-metric) for a full list of available metrics. Each metric has a dedicated Space with an interactive demo for how to use the metric, and a documentation card detailing the metrics limitations and usage.