Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
53 changes: 49 additions & 4 deletions docs/source/index.mdx
Original file line number Diff line number Diff line change
@@ -1,21 +1,66 @@
# Evaluate on the Hub

<p align="center">
<br>
<img src="https://huggingface.co/datasets/evaluate/media/resolve/main/evaluate-banner.png" width="400"/>
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/evaluate-docs/evaluate-on-hub-banner.png" alt="Evaluate on the Hub banner" width="400"/>
<br>
</p>

You can evaluate AI models on the Hub in multiple ways and this page will guide you through the different options:

- **Community Leaderboards** bring together the best models for a given task or domain and make them accessible to everyone by ranking them.
- **Model Cards** provide a comprehensive overview of a model's capabilities from the author's perspective.
- **Libraries and Packages** give you the tools to evaluate your models on the Hub.

## Community Leaderboards

Community leaderboards show how a model performs on a given task or domain. For example, there are leaderboards for question answering, reasoning, classification, vision, and audio. If you're tackling a new task, you can use a leaderboard to see how a model performs on it.

Here are some examples of community leaderboards:

| Leaderboard | Model Type | Description |
| --- | --- | --- |
| [MTEB](https://huggingface.co/spaces/mteb/leaderboard)| Embedding | The Massive Text Embedding Benchmark leaderboard compares 100+ text and image embedding models across 1000+ languages. Refer to the publication of each selectable benchmark for details on metrics, languages, tasks, and task types. Anyone is welcome to add a model, add benchmarks, help improve zero-shot annotations, or propose other changes to the leaderboard. |
| [GAIA](https://huggingface.co/spaces/gaia-benchmark/leaderboard)| Agentic | GAIA is a benchmark which aims at evaluating next-generation LLMs (LLMs with augmented capabilities due to added tooling, efficient prompting, access to search, etc). (See [the paper](https://arxiv.org/abs/2311.12983) for more details.) |
| [OpenVLM Leaderboard](https://huggingface.co/spaces/opencompass/open_vlm_leaderboard)| Vision Language Models | The OpenVLM Leaderboard evaluates 272+ Vision-Language Models (including GPT-4v, Gemini, QwenVLPlus, LLaVA) across 31 different multi-modal benchmarks using the VLMEvalKit framework. It focuses on open-source VLMs and publicly available API models. |
| [Open ASR Leaderboard](https://huggingface.co/spaces/hf-audio/open_asr_leaderboard)| Audio | The Open ASR Leaderboard ranks and evaluates speech recognition models on the Hugging Face Hub. Models are ranked based on their Average WER, from lowest to highest. |
| [LLM-Perf Leaderboard](https://huggingface.co/spaces/llm-perf/leaderboard)| LLM Performance | The 🤗 LLM-Perf Leaderboard 🏋️ is a leaderboard at the intersection of quality and performance. Its aim is to benchmark the performance (latency, throughput, memory & energy) of Large Language Models (LLMs) with different hardware, backends and optimizations using Optimum-Benchmark. |

There are many more leaderboards on the Hub. Check out all the leaderboards via this [search](https://huggingface.co/spaces?category=model-benchmarking) or use this [dedicated Space](https://huggingface.co/spaces/OpenEvals/find-a-leaderboard) to find a leaderboard for your task.

## Model Cards

Model cards provide an overview of a model's capabilities evaluated by the community or the model's author. They are a great way to understand a model's capabilities and limitations.

![Qwen model card](https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/evaluate-docs/qwen-model-card.png)

Unlike leaderboards, model card evaluation scores are often created by the author, rather than by the community.

<Tip>
For more recent evaluation approaches that are popular on the Hugging Face Hub that are currently more actively maintained, check out [LightEval](https://github.com/huggingface/lighteval).

For information on reporting results, see details on [the Model Card Evaluation Results metadata](https://huggingface.co/docs/hub/en/model-cards#evaluation-results).

</Tip>

## Libraries and packages

There are a number of open-source libraries and packages that you can use to evaluate your models on the Hub. These are useful if you want to evaluate a custom model or performance on a custom evaluation task.

### LightEval

LightEval is a library for evaluating LLMs. It is designed to be comprehensive and customizable. Visit the LightEval [repository](https://github.com/huggingface/lighteval) for more information.

<Tip>

For more recent evaluation approaches that are popular on the Hugging Face Hub that are currently more actively maintained, check out [LightEval](https://github.com/huggingface/lighteval).

</Tip>

# 🤗 Evaluate
### 🤗 Evaluate

A library for easily evaluating machine learning models and datasets.

With a single line of code, you get access to dozens of evaluation methods for different domains (NLP, Computer Vision, Reinforcement Learning, and more!). Be it on your local machine or in a distributed training setup, you can evaluate your models in a consistent and reproducible way!
With a single line of code, you get access to dozens of evaluation methods for different domains (NLP, Computer Vision, Reinforcement Learning, and more!). Be it on your local machine or in a distributed training setup, you can evaluate your models in a consistent and reproducible way!

Visit the 🤗 Evaluate [organization](https://huggingface.co/evaluate-metric) for a full list of available metrics. Each metric has a dedicated Space with an interactive demo for how to use the metric, and a documentation card detailing the metrics limitations and usage.

Expand Down
Loading