Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 2 additions & 5 deletions docs/source/a_quick_tour.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -113,11 +113,8 @@ Before we can apply a metric or other evaluation module to a use-case, we need t
}
```

<Tip>

Note that features always describe the type of a single input element. In general we will add lists of elements so you can always think of a list around the types in `features`. Evaluate accepts various input formats (Python lists, NumPy arrays, PyTorch tensors, etc.) and converts them to an appropriate format for storage and computation.

</Tip>
> [!TIP]
> Note that features always describe the type of a single input element. In general we will add lists of elements so you can always think of a list around the types in `features`. Evaluate accepts various input formats (Python lists, NumPy arrays, PyTorch tensors, etc.) and converts them to an appropriate format for storage and computation.

## Compute

Expand Down
14 changes: 4 additions & 10 deletions docs/source/base_evaluator.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -68,11 +68,8 @@ eval_results = task_evaluator.compute(
)
print(eval_results)
```
<Tip>

Without specifying a device, the default for model inference will be the first GPU on the machine if one is available, and else CPU. If you want to use a specific device you can pass `device` to `compute` where -1 will use the GPU and a positive integer (starting with 0) will use the associated CUDA device.

</Tip>
> [!TIP]
> Without specifying a device, the default for model inference will be the first GPU on the machine if one is available, and else CPU. If you want to use a specific device you can pass `device` to `compute` where -1 will use the GPU and a positive integer (starting with 0) will use the associated CUDA device.


The results will look as follows:
Expand All @@ -87,11 +84,8 @@ The results will look as follows:

Note that evaluation results include both the requested metric, and information about the time it took to obtain predictions through the pipeline.

<Tip>

The time performances can give useful indication on model speed for inference but should be taken with a grain of salt: they include all the processing that goes on in the pipeline. This may include tokenizing, post-processing, that may be different depending on the model. Furthermore, it depends a lot on the hardware you are running the evaluation on and you may be able to improve the performance by optimizing things like the batch size.

</Tip>
> [!TIP]
> The time performances can give useful indication on model speed for inference but should be taken with a grain of salt: they include all the processing that goes on in the pipeline. This may include tokenizing, post-processing, that may be different depending on the model. Furthermore, it depends a lot on the hardware you are running the evaluation on and you may be able to improve the performance by optimizing things like the batch size.

### Evaluate multiple metrics

Expand Down
7 changes: 3 additions & 4 deletions docs/source/choosing_a_metric.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -43,10 +43,9 @@ You can find the right metric for your task by:

Some datasets have specific metrics associated with them -- this is especially in the case of popular benchmarks like [GLUE](https://huggingface.co/metrics/glue) and [SQuAD](https://huggingface.co/metrics/squad).

<Tip warning={true}>
💡
GLUE is actually a collection of different subsets on different tasks, so first you need to choose the one that corresponds to the NLI task, such as mnli, which is described as “crowdsourced collection of sentence pairs with textual entailment annotations”
</Tip>
> [!WARNING]
> 💡
> GLUE is actually a collection of different subsets on different tasks, so first you need to choose the one that corresponds to the NLI task, such as mnli, which is described as “crowdsourced collection of sentence pairs with textual entailment annotations”


If you are evaluating your model on a benchmark dataset like the ones mentioned above, you can use its dedicated evaluation metric. Make sure you respect the format that they require. For example, to evaluate your model on the [SQuAD](https://huggingface.co/datasets/squad) dataset, you need to feed the `question` and `context` into your model and return the `prediction_text`, which should be compared with the `references` (based on matching the `id` of the question) :
Expand Down
5 changes: 2 additions & 3 deletions docs/source/considerations.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -18,9 +18,8 @@ Some datasets on the 🤗 Hub are already separated into these three splits. How

If the dataset you're using doesn't have a predefined train-test split, it is up to you to define which part of the dataset you want to use for training your model and which you want to use for hyperparameter tuning or final evaluation.

<Tip warning={true}>
Training and evaluating on the same split can misrepresent your results! If you overfit on your training data the evaluation results on that split will look great but the model will perform poorly on new data.
</Tip>
> [!WARNING]
> Training and evaluating on the same split can misrepresent your results! If you overfit on your training data the evaluation results on that split will look great but the model will perform poorly on new data.

Depending on the size of the dataset, you can keep anywhere from 10-30% for evaluation and the rest for training, while aiming to set up the test set to reflect the production data as close as possible. Check out [this thread](https://discuss.huggingface.co/t/how-to-split-main-dataset-into-train-dev-test-as-datasetdict/1090) for a more in-depth discussion of dataset splitting!

Expand Down
14 changes: 4 additions & 10 deletions docs/source/index.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -36,11 +36,8 @@ Model cards provide an overview of a model's capabilities evaluated by the commu

Unlike leaderboards, model card evaluation scores are often created by the author, rather than by the community.

<Tip>

For information on reporting results, see details on [the Model Card Evaluation Results metadata](https://huggingface.co/docs/hub/en/model-cards#evaluation-results).

</Tip>
> [!TIP]
> For information on reporting results, see details on [the Model Card Evaluation Results metadata](https://huggingface.co/docs/hub/en/model-cards#evaluation-results).

## Libraries and packages

Expand All @@ -50,11 +47,8 @@ There are a number of open-source libraries and packages that you can use to eva

LightEval is a library for evaluating LLMs. It is designed to be comprehensive and customizable. Visit the LightEval [repository](https://github.com/huggingface/lighteval) for more information.

<Tip>

For more recent evaluation approaches that are popular on the Hugging Face Hub that are currently more actively maintained, check out [LightEval](https://github.com/huggingface/lighteval).

</Tip>
> [!TIP]
> For more recent evaluation approaches that are popular on the Hugging Face Hub that are currently more actively maintained, check out [LightEval](https://github.com/huggingface/lighteval).

### 🤗 Evaluate

Expand Down
16 changes: 5 additions & 11 deletions src/evaluate/evaluator/audio_classification.py
Original file line number Diff line number Diff line change
Expand Up @@ -30,11 +30,8 @@
TASK_DOCUMENTATION = r"""
Examples:

<Tip>

Remember that, in order to process audio files, you need ffmpeg installed (https://ffmpeg.org/download.html)

</Tip>
> [!TIP]
> Remember that, in order to process audio files, you need ffmpeg installed (https://ffmpeg.org/download.html)

```python
>>> from evaluate import evaluator
Expand All @@ -52,12 +49,9 @@
>>> )
```

<Tip>

The evaluator supports raw audio data as well, in the form of a numpy array. However, be aware that calling
the audio column automatically decodes and resamples the audio files, which can be slow for large datasets.

</Tip>
> [!TIP]
> The evaluator supports raw audio data as well, in the form of a numpy array. However, be aware that calling
> the audio column automatically decodes and resamples the audio files, which can be slow for large datasets.

```python
>>> from evaluate import evaluator
Expand Down
9 changes: 3 additions & 6 deletions src/evaluate/evaluator/question_answering.py
Original file line number Diff line number Diff line change
Expand Up @@ -53,12 +53,9 @@
>>> )
```

<Tip>

Datasets where the answer may be missing in the context are supported, for example SQuAD v2 dataset. In this case, it is safer to pass `squad_v2_format=True` to
the compute() call.

</Tip>
> [!TIP]
> Datasets where the answer may be missing in the context are supported, for example SQuAD v2 dataset. In this case, it is safer to pass `squad_v2_format=True` to
> the compute() call.

```python
>>> from evaluate import evaluator
Expand Down
76 changes: 35 additions & 41 deletions src/evaluate/evaluator/token_classification.py
Original file line number Diff line number Diff line change
Expand Up @@ -43,47 +43,41 @@
>>> )
```

<Tip>

For example, the following dataset format is accepted by the evaluator:

```python
dataset = Dataset.from_dict(
mapping={
"tokens": [["New", "York", "is", "a", "city", "and", "Felix", "a", "person", "."]],
"ner_tags": [[1, 2, 0, 0, 0, 0, 3, 0, 0, 0]],
},
features=Features({
"tokens": Sequence(feature=Value(dtype="string")),
"ner_tags": Sequence(feature=ClassLabel(names=["O", "B-LOC", "I-LOC", "B-PER", "I-PER"])),
}),
)
```

</Tip>

<Tip warning={true}>

For example, the following dataset format is **not** accepted by the evaluator:

```python
dataset = Dataset.from_dict(
mapping={
"tokens": [["New York is a city and Felix a person."]],
"starts": [[0, 23]],
"ends": [[7, 27]],
"ner_tags": [["LOC", "PER"]],
},
features=Features({
"tokens": Value(dtype="string"),
"starts": Sequence(feature=Value(dtype="int32")),
"ends": Sequence(feature=Value(dtype="int32")),
"ner_tags": Sequence(feature=Value(dtype="string")),
}),
)
```

</Tip>
> [!TIP]
> For example, the following dataset format is accepted by the evaluator:
>
> ```python
> dataset = Dataset.from_dict(
> mapping={
> "tokens": [["New", "York", "is", "a", "city", "and", "Felix", "a", "person", "."]],
> "ner_tags": [[1, 2, 0, 0, 0, 0, 3, 0, 0, 0]],
> },
> features=Features({
> "tokens": Sequence(feature=Value(dtype="string")),
> "ner_tags": Sequence(feature=ClassLabel(names=["O", "B-LOC", "I-LOC", "B-PER", "I-PER"])),
> }),
> )
> ```

> [!WARNING]
> For example, the following dataset format is **not** accepted by the evaluator:
>
> ```python
> dataset = Dataset.from_dict(
> mapping={
> "tokens": [["New York is a city and Felix a person."]],
> "starts": [[0, 23]],
> "ends": [[7, 27]],
> "ner_tags": [["LOC", "PER"]],
> },
> features=Features({
> "tokens": Value(dtype="string"),
> "starts": Sequence(feature=Value(dtype="int32")),
> "ends": Sequence(feature=Value(dtype="int32")),
> "ner_tags": Sequence(feature=Value(dtype="string")),
> }),
> )
> ```
"""


Expand Down
17 changes: 7 additions & 10 deletions src/evaluate/utils/logging.py
Original file line number Diff line number Diff line change
Expand Up @@ -87,16 +87,13 @@ def get_verbosity() -> int:
Returns:
Logging level, e.g., `evaluate.logging.DEBUG` and `evaluate.logging.INFO`.

<Tip>

Hugging Face Evaluate library has following logging levels:
- `evaluate.logging.CRITICAL`, `evaluate.logging.FATAL`
- `evaluate.logging.ERROR`
- `evaluate.logging.WARNING`, `evaluate.logging.WARN`
- `evaluate.logging.INFO`
- `evaluate.logging.DEBUG`

</Tip>
> [!TIP]
> Hugging Face Evaluate library has following logging levels:
> - `evaluate.logging.CRITICAL`, `evaluate.logging.FATAL`
> - `evaluate.logging.ERROR`
> - `evaluate.logging.WARNING`, `evaluate.logging.WARN`
> - `evaluate.logging.INFO`
> - `evaluate.logging.DEBUG`
"""
return _get_library_root_logger().getEffectiveLevel()

Expand Down
Loading