huggingface · lhoestq · Sep 25, 2025 · Sep 25, 2025
diff --git a/docs/source/a_quick_tour.mdx b/docs/source/a_quick_tour.mdx
@@ -113,11 +113,8 @@ Before we can apply a metric or other evaluation module to a use-case, we need t
 }
 ```
 
-<Tip>
-
-Note that features always describe the type of a single input element. In general we will add lists of elements so you can always think of a list around the types in `features`. Evaluate accepts various input formats (Python lists, NumPy arrays, PyTorch tensors, etc.) and converts them to an appropriate format for storage and computation.
-
-</Tip>
+> [!TIP]
+> Note that features always describe the type of a single input element. In general we will add lists of elements so you can always think of a list around the types in `features`. Evaluate accepts various input formats (Python lists, NumPy arrays, PyTorch tensors, etc.) and converts them to an appropriate format for storage and computation.
 
 ## Compute
 

diff --git a/docs/source/base_evaluator.mdx b/docs/source/base_evaluator.mdx
@@ -68,11 +68,8 @@ eval_results = task_evaluator.compute(
 )
 print(eval_results)
 ```
-<Tip>
-
-Without specifying a device, the default for model inference will be the first GPU on the machine if one is available, and else CPU. If you want to use a specific device you can pass `device` to `compute` where -1 will use the GPU and a positive integer (starting with 0) will use the associated CUDA device.
-
-</Tip>
+> [!TIP]
+> Without specifying a device, the default for model inference will be the first GPU on the machine if one is available, and else CPU. If you want to use a specific device you can pass `device` to `compute` where -1 will use the GPU and a positive integer (starting with 0) will use the associated CUDA device.
 
 
 The results will look as follows:
@@ -87,11 +84,8 @@ The results will look as follows:
 
 Note that evaluation results include both the requested metric, and information about the time it took to obtain predictions through the pipeline.
 
-<Tip>
-
-The time performances can give useful indication on model speed for inference but should be taken with a grain of salt: they include all the processing that goes on in the pipeline. This may include tokenizing, post-processing, that may be different depending on the model. Furthermore, it depends a lot on the hardware you are running the evaluation on and you may be able to improve the performance by optimizing things like the batch size.
-
-</Tip>
+> [!TIP]
+> The time performances can give useful indication on model speed for inference but should be taken with a grain of salt: they include all the processing that goes on in the pipeline. This may include tokenizing, post-processing, that may be different depending on the model. Furthermore, it depends a lot on the hardware you are running the evaluation on and you may be able to improve the performance by optimizing things like the batch size.
 
 ### Evaluate multiple metrics
 

diff --git a/docs/source/choosing_a_metric.mdx b/docs/source/choosing_a_metric.mdx
@@ -43,10 +43,9 @@ You can find the right metric for your task by:
 
 Some datasets have specific metrics associated with them -- this is especially in the case of popular benchmarks like [GLUE](https://huggingface.co/metrics/glue) and [SQuAD](https://huggingface.co/metrics/squad).
 
-<Tip warning={true}>
-💡
-GLUE is actually a collection of different subsets on different tasks, so first you need to choose the one that corresponds to the NLI task, such as mnli, which is described as “crowdsourced collection of sentence pairs with textual entailment annotations”
-</Tip>
+> [!WARNING]
+> 💡
+> GLUE is actually a collection of different subsets on different tasks, so first you need to choose the one that corresponds to the NLI task, such as mnli, which is described as “crowdsourced collection of sentence pairs with textual entailment annotations”
 
 
 If you are evaluating your model on a benchmark dataset like the ones mentioned above, you can use its dedicated evaluation metric. Make sure you respect the format that they require. For example, to evaluate your model on the [SQuAD](https://huggingface.co/datasets/squad) dataset, you need to feed the `question` and `context` into your model and return the `prediction_text`, which should be compared with the `references` (based on matching the `id` of the question) :

diff --git a/docs/source/considerations.mdx b/docs/source/considerations.mdx
@@ -18,9 +18,8 @@ Some datasets on the 🤗 Hub are already separated into these three splits. How
 
 If the dataset you're using doesn't have a predefined train-test split, it is up to you to define which part of the dataset you want to use for training your model and  which you want to use for hyperparameter tuning or final evaluation.
 
-<Tip warning={true}>
-Training and evaluating on the same split can misrepresent your results! If you overfit on your training data the evaluation results on that split will look great but the model will perform poorly on new data.
-</Tip>
+> [!WARNING]
+> Training and evaluating on the same split can misrepresent your results! If you overfit on your training data the evaluation results on that split will look great but the model will perform poorly on new data.
 
 Depending on the size of the dataset, you can keep anywhere from 10-30% for evaluation and the rest for training, while aiming to set up the test set to reflect the production data as close as possible. Check out [this thread](https://discuss.huggingface.co/t/how-to-split-main-dataset-into-train-dev-test-as-datasetdict/1090) for a more in-depth discussion of dataset splitting!
 

diff --git a/docs/source/index.mdx b/docs/source/index.mdx
@@ -36,11 +36,8 @@ Model cards provide an overview of a model's capabilities evaluated by the commu
 
 Unlike leaderboards, model card evaluation scores are often created by the author, rather than by the community.
 
-<Tip>
-
-For information on reporting results, see details on [the Model Card Evaluation Results metadata](https://huggingface.co/docs/hub/en/model-cards#evaluation-results).
-
-</Tip>
+> [!TIP]
+> For information on reporting results, see details on [the Model Card Evaluation Results metadata](https://huggingface.co/docs/hub/en/model-cards#evaluation-results).
 
 ## Libraries and packages
 
@@ -50,11 +47,8 @@ There are a number of open-source libraries and packages that you can use to eva
 
 LightEval is a library for evaluating LLMs. It is designed to be comprehensive and customizable. Visit the LightEval [repository](https://github.com/huggingface/lighteval) for more information.
 
-<Tip>
-
-For more recent evaluation approaches that are popular on the Hugging Face Hub that are currently more actively maintained, check out [LightEval](https://github.com/huggingface/lighteval).
-
-</Tip>
+> [!TIP]
+> For more recent evaluation approaches that are popular on the Hugging Face Hub that are currently more actively maintained, check out [LightEval](https://github.com/huggingface/lighteval).
 
 ### 🤗 Evaluate
 

diff --git a/src/evaluate/evaluator/audio_classification.py b/src/evaluate/evaluator/audio_classification.py
@@ -30,11 +30,8 @@
 TASK_DOCUMENTATION = r"""
     Examples:
 
-    <Tip>
-
-    Remember that, in order to process audio files, you need ffmpeg installed (https://ffmpeg.org/download.html)
-
-    </Tip>
+    > [!TIP]
+    > Remember that, in order to process audio files, you need ffmpeg installed (https://ffmpeg.org/download.html)
 
     ```python
     >>> from evaluate import evaluator
@@ -52,12 +49,9 @@
     >>> )
     ```
 
-    <Tip>
-
-    The evaluator supports raw audio data as well, in the form of a numpy array. However, be aware that calling
-    the audio column automatically decodes and resamples the audio files, which can be slow for large datasets.
-
-    </Tip>
+    > [!TIP]
+    > The evaluator supports raw audio data as well, in the form of a numpy array. However, be aware that calling
+    > the audio column automatically decodes and resamples the audio files, which can be slow for large datasets.
 
     ```python
     >>> from evaluate import evaluator

diff --git a/src/evaluate/evaluator/question_answering.py b/src/evaluate/evaluator/question_answering.py
@@ -53,12 +53,9 @@
     >>> )
     ```
 
-    <Tip>
-
-    Datasets where the answer may be missing in the context are supported, for example SQuAD v2 dataset. In this case, it is safer to pass `squad_v2_format=True` to
-    the compute() call.
-
-    </Tip>
+    > [!TIP]
+    > Datasets where the answer may be missing in the context are supported, for example SQuAD v2 dataset. In this case, it is safer to pass `squad_v2_format=True` to
+    > the compute() call.
 
     ```python
     >>> from evaluate import evaluator

diff --git a/src/evaluate/evaluator/token_classification.py b/src/evaluate/evaluator/token_classification.py
@@ -43,47 +43,41 @@
     >>> )
     ```
 
-    <Tip>
-
-    For example, the following dataset format is accepted by the evaluator:
-
-    ```python
-    dataset = Dataset.from_dict(
-        mapping={
-            "tokens": [["New", "York", "is", "a", "city", "and", "Felix", "a", "person", "."]],
-            "ner_tags": [[1, 2, 0, 0, 0, 0, 3, 0, 0, 0]],
-        },
-        features=Features({
-            "tokens": Sequence(feature=Value(dtype="string")),
-            "ner_tags": Sequence(feature=ClassLabel(names=["O", "B-LOC", "I-LOC", "B-PER", "I-PER"])),
-            }),
-    )
-    ```
-
-    </Tip>
-
-    <Tip warning={true}>
-
-    For example, the following dataset format is **not** accepted by the evaluator:
-
-    ```python
-    dataset = Dataset.from_dict(
-        mapping={
-            "tokens": [["New York is a city and Felix a person."]],
-            "starts": [[0, 23]],
-            "ends": [[7, 27]],
-            "ner_tags": [["LOC", "PER"]],
-        },
-        features=Features({
-            "tokens": Value(dtype="string"),
-            "starts": Sequence(feature=Value(dtype="int32")),
-            "ends": Sequence(feature=Value(dtype="int32")),
-            "ner_tags": Sequence(feature=Value(dtype="string")),
-        }),
-    )
-    ```
-
-    </Tip>
+    > [!TIP]
+    > For example, the following dataset format is accepted by the evaluator:
+    >
+    > ```python
+    > dataset = Dataset.from_dict(
+    >     mapping={
+    >         "tokens": [["New", "York", "is", "a", "city", "and", "Felix", "a", "person", "."]],
+    >         "ner_tags": [[1, 2, 0, 0, 0, 0, 3, 0, 0, 0]],
+    >     },
+    >     features=Features({
+    >         "tokens": Sequence(feature=Value(dtype="string")),
+    >         "ner_tags": Sequence(feature=ClassLabel(names=["O", "B-LOC", "I-LOC", "B-PER", "I-PER"])),
+    >         }),
+    > )
+    > ```
+
+    > [!WARNING]
+    > For example, the following dataset format is **not** accepted by the evaluator:
+    >
+    > ```python
+    > dataset = Dataset.from_dict(
+    >     mapping={
+    >         "tokens": [["New York is a city and Felix a person."]],
+    >         "starts": [[0, 23]],
+    >         "ends": [[7, 27]],
+    >         "ner_tags": [["LOC", "PER"]],
+    >     },
+    >     features=Features({
+    >         "tokens": Value(dtype="string"),
+    >         "starts": Sequence(feature=Value(dtype="int32")),
+    >         "ends": Sequence(feature=Value(dtype="int32")),
+    >         "ner_tags": Sequence(feature=Value(dtype="string")),
+    >     }),
+    > )
+    > ```
 """
 
 

diff --git a/src/evaluate/utils/logging.py b/src/evaluate/utils/logging.py
@@ -87,16 +87,13 @@ def get_verbosity() -> int:
     Returns:
         Logging level, e.g., `evaluate.logging.DEBUG` and `evaluate.logging.INFO`.
 
-    <Tip>
-
-        Hugging Face Evaluate library has following logging levels:
-        - `evaluate.logging.CRITICAL`, `evaluate.logging.FATAL`
-        - `evaluate.logging.ERROR`
-        - `evaluate.logging.WARNING`, `evaluate.logging.WARN`
-        - `evaluate.logging.INFO`
-        - `evaluate.logging.DEBUG`
-
-    </Tip>
+    > [!TIP]
+    > Hugging Face Evaluate library has following logging levels:
+    >     - `evaluate.logging.CRITICAL`, `evaluate.logging.FATAL`
+    >     - `evaluate.logging.ERROR`
+    >     - `evaluate.logging.WARNING`, `evaluate.logging.WARN`
+    >     - `evaluate.logging.INFO`
+    >     - `evaluate.logging.DEBUG`
     """
     return _get_library_root_logger().getEffectiveLevel()