diff --git a/docs/source/a_quick_tour.mdx b/docs/source/a_quick_tour.mdx index 371f7f11..c8b28e67 100644 --- a/docs/source/a_quick_tour.mdx +++ b/docs/source/a_quick_tour.mdx @@ -113,11 +113,8 @@ Before we can apply a metric or other evaluation module to a use-case, we need t } ``` - - -Note that features always describe the type of a single input element. In general we will add lists of elements so you can always think of a list around the types in `features`. Evaluate accepts various input formats (Python lists, NumPy arrays, PyTorch tensors, etc.) and converts them to an appropriate format for storage and computation. - - +> [!TIP] +> Note that features always describe the type of a single input element. In general we will add lists of elements so you can always think of a list around the types in `features`. Evaluate accepts various input formats (Python lists, NumPy arrays, PyTorch tensors, etc.) and converts them to an appropriate format for storage and computation. ## Compute diff --git a/docs/source/base_evaluator.mdx b/docs/source/base_evaluator.mdx index 921fb936..9eea7cd9 100644 --- a/docs/source/base_evaluator.mdx +++ b/docs/source/base_evaluator.mdx @@ -68,11 +68,8 @@ eval_results = task_evaluator.compute( ) print(eval_results) ``` - - -Without specifying a device, the default for model inference will be the first GPU on the machine if one is available, and else CPU. If you want to use a specific device you can pass `device` to `compute` where -1 will use the GPU and a positive integer (starting with 0) will use the associated CUDA device. - - +> [!TIP] +> Without specifying a device, the default for model inference will be the first GPU on the machine if one is available, and else CPU. If you want to use a specific device you can pass `device` to `compute` where -1 will use the GPU and a positive integer (starting with 0) will use the associated CUDA device. The results will look as follows: @@ -87,11 +84,8 @@ The results will look as follows: Note that evaluation results include both the requested metric, and information about the time it took to obtain predictions through the pipeline. - - -The time performances can give useful indication on model speed for inference but should be taken with a grain of salt: they include all the processing that goes on in the pipeline. This may include tokenizing, post-processing, that may be different depending on the model. Furthermore, it depends a lot on the hardware you are running the evaluation on and you may be able to improve the performance by optimizing things like the batch size. - - +> [!TIP] +> The time performances can give useful indication on model speed for inference but should be taken with a grain of salt: they include all the processing that goes on in the pipeline. This may include tokenizing, post-processing, that may be different depending on the model. Furthermore, it depends a lot on the hardware you are running the evaluation on and you may be able to improve the performance by optimizing things like the batch size. ### Evaluate multiple metrics diff --git a/docs/source/choosing_a_metric.mdx b/docs/source/choosing_a_metric.mdx index 344f8305..49d9259a 100644 --- a/docs/source/choosing_a_metric.mdx +++ b/docs/source/choosing_a_metric.mdx @@ -43,10 +43,9 @@ You can find the right metric for your task by: Some datasets have specific metrics associated with them -- this is especially in the case of popular benchmarks like [GLUE](https://huggingface.co/metrics/glue) and [SQuAD](https://huggingface.co/metrics/squad). - -💡 -GLUE is actually a collection of different subsets on different tasks, so first you need to choose the one that corresponds to the NLI task, such as mnli, which is described as “crowdsourced collection of sentence pairs with textual entailment annotations” - +> [!WARNING] +> 💡 +> GLUE is actually a collection of different subsets on different tasks, so first you need to choose the one that corresponds to the NLI task, such as mnli, which is described as “crowdsourced collection of sentence pairs with textual entailment annotations” If you are evaluating your model on a benchmark dataset like the ones mentioned above, you can use its dedicated evaluation metric. Make sure you respect the format that they require. For example, to evaluate your model on the [SQuAD](https://huggingface.co/datasets/squad) dataset, you need to feed the `question` and `context` into your model and return the `prediction_text`, which should be compared with the `references` (based on matching the `id` of the question) : diff --git a/docs/source/considerations.mdx b/docs/source/considerations.mdx index dc8ca5cc..6589f871 100644 --- a/docs/source/considerations.mdx +++ b/docs/source/considerations.mdx @@ -18,9 +18,8 @@ Some datasets on the 🤗 Hub are already separated into these three splits. How If the dataset you're using doesn't have a predefined train-test split, it is up to you to define which part of the dataset you want to use for training your model and which you want to use for hyperparameter tuning or final evaluation. - -Training and evaluating on the same split can misrepresent your results! If you overfit on your training data the evaluation results on that split will look great but the model will perform poorly on new data. - +> [!WARNING] +> Training and evaluating on the same split can misrepresent your results! If you overfit on your training data the evaluation results on that split will look great but the model will perform poorly on new data. Depending on the size of the dataset, you can keep anywhere from 10-30% for evaluation and the rest for training, while aiming to set up the test set to reflect the production data as close as possible. Check out [this thread](https://discuss.huggingface.co/t/how-to-split-main-dataset-into-train-dev-test-as-datasetdict/1090) for a more in-depth discussion of dataset splitting! diff --git a/docs/source/index.mdx b/docs/source/index.mdx index 720abcdc..75c65c70 100644 --- a/docs/source/index.mdx +++ b/docs/source/index.mdx @@ -36,11 +36,8 @@ Model cards provide an overview of a model's capabilities evaluated by the commu Unlike leaderboards, model card evaluation scores are often created by the author, rather than by the community. - - -For information on reporting results, see details on [the Model Card Evaluation Results metadata](https://huggingface.co/docs/hub/en/model-cards#evaluation-results). - - +> [!TIP] +> For information on reporting results, see details on [the Model Card Evaluation Results metadata](https://huggingface.co/docs/hub/en/model-cards#evaluation-results). ## Libraries and packages @@ -50,11 +47,8 @@ There are a number of open-source libraries and packages that you can use to eva LightEval is a library for evaluating LLMs. It is designed to be comprehensive and customizable. Visit the LightEval [repository](https://github.com/huggingface/lighteval) for more information. - - -For more recent evaluation approaches that are popular on the Hugging Face Hub that are currently more actively maintained, check out [LightEval](https://github.com/huggingface/lighteval). - - +> [!TIP] +> For more recent evaluation approaches that are popular on the Hugging Face Hub that are currently more actively maintained, check out [LightEval](https://github.com/huggingface/lighteval). ### 🤗 Evaluate diff --git a/src/evaluate/evaluator/audio_classification.py b/src/evaluate/evaluator/audio_classification.py index 685fb9fd..6c39a896 100644 --- a/src/evaluate/evaluator/audio_classification.py +++ b/src/evaluate/evaluator/audio_classification.py @@ -30,11 +30,8 @@ TASK_DOCUMENTATION = r""" Examples: - - - Remember that, in order to process audio files, you need ffmpeg installed (https://ffmpeg.org/download.html) - - + > [!TIP] + > Remember that, in order to process audio files, you need ffmpeg installed (https://ffmpeg.org/download.html) ```python >>> from evaluate import evaluator @@ -52,12 +49,9 @@ >>> ) ``` - - - The evaluator supports raw audio data as well, in the form of a numpy array. However, be aware that calling - the audio column automatically decodes and resamples the audio files, which can be slow for large datasets. - - + > [!TIP] + > The evaluator supports raw audio data as well, in the form of a numpy array. However, be aware that calling + > the audio column automatically decodes and resamples the audio files, which can be slow for large datasets. ```python >>> from evaluate import evaluator diff --git a/src/evaluate/evaluator/question_answering.py b/src/evaluate/evaluator/question_answering.py index 99b4190e..0e4895cc 100644 --- a/src/evaluate/evaluator/question_answering.py +++ b/src/evaluate/evaluator/question_answering.py @@ -53,12 +53,9 @@ >>> ) ``` - - - Datasets where the answer may be missing in the context are supported, for example SQuAD v2 dataset. In this case, it is safer to pass `squad_v2_format=True` to - the compute() call. - - + > [!TIP] + > Datasets where the answer may be missing in the context are supported, for example SQuAD v2 dataset. In this case, it is safer to pass `squad_v2_format=True` to + > the compute() call. ```python >>> from evaluate import evaluator diff --git a/src/evaluate/evaluator/token_classification.py b/src/evaluate/evaluator/token_classification.py index ba08ebd5..bb711192 100644 --- a/src/evaluate/evaluator/token_classification.py +++ b/src/evaluate/evaluator/token_classification.py @@ -43,47 +43,41 @@ >>> ) ``` - - - For example, the following dataset format is accepted by the evaluator: - - ```python - dataset = Dataset.from_dict( - mapping={ - "tokens": [["New", "York", "is", "a", "city", "and", "Felix", "a", "person", "."]], - "ner_tags": [[1, 2, 0, 0, 0, 0, 3, 0, 0, 0]], - }, - features=Features({ - "tokens": Sequence(feature=Value(dtype="string")), - "ner_tags": Sequence(feature=ClassLabel(names=["O", "B-LOC", "I-LOC", "B-PER", "I-PER"])), - }), - ) - ``` - - - - - - For example, the following dataset format is **not** accepted by the evaluator: - - ```python - dataset = Dataset.from_dict( - mapping={ - "tokens": [["New York is a city and Felix a person."]], - "starts": [[0, 23]], - "ends": [[7, 27]], - "ner_tags": [["LOC", "PER"]], - }, - features=Features({ - "tokens": Value(dtype="string"), - "starts": Sequence(feature=Value(dtype="int32")), - "ends": Sequence(feature=Value(dtype="int32")), - "ner_tags": Sequence(feature=Value(dtype="string")), - }), - ) - ``` - - + > [!TIP] + > For example, the following dataset format is accepted by the evaluator: + > + > ```python + > dataset = Dataset.from_dict( + > mapping={ + > "tokens": [["New", "York", "is", "a", "city", "and", "Felix", "a", "person", "."]], + > "ner_tags": [[1, 2, 0, 0, 0, 0, 3, 0, 0, 0]], + > }, + > features=Features({ + > "tokens": Sequence(feature=Value(dtype="string")), + > "ner_tags": Sequence(feature=ClassLabel(names=["O", "B-LOC", "I-LOC", "B-PER", "I-PER"])), + > }), + > ) + > ``` + + > [!WARNING] + > For example, the following dataset format is **not** accepted by the evaluator: + > + > ```python + > dataset = Dataset.from_dict( + > mapping={ + > "tokens": [["New York is a city and Felix a person."]], + > "starts": [[0, 23]], + > "ends": [[7, 27]], + > "ner_tags": [["LOC", "PER"]], + > }, + > features=Features({ + > "tokens": Value(dtype="string"), + > "starts": Sequence(feature=Value(dtype="int32")), + > "ends": Sequence(feature=Value(dtype="int32")), + > "ner_tags": Sequence(feature=Value(dtype="string")), + > }), + > ) + > ``` """ diff --git a/src/evaluate/utils/logging.py b/src/evaluate/utils/logging.py index d29b7f48..f84656ab 100644 --- a/src/evaluate/utils/logging.py +++ b/src/evaluate/utils/logging.py @@ -87,16 +87,13 @@ def get_verbosity() -> int: Returns: Logging level, e.g., `evaluate.logging.DEBUG` and `evaluate.logging.INFO`. - - - Hugging Face Evaluate library has following logging levels: - - `evaluate.logging.CRITICAL`, `evaluate.logging.FATAL` - - `evaluate.logging.ERROR` - - `evaluate.logging.WARNING`, `evaluate.logging.WARN` - - `evaluate.logging.INFO` - - `evaluate.logging.DEBUG` - - + > [!TIP] + > Hugging Face Evaluate library has following logging levels: + > - `evaluate.logging.CRITICAL`, `evaluate.logging.FATAL` + > - `evaluate.logging.ERROR` + > - `evaluate.logging.WARNING`, `evaluate.logging.WARN` + > - `evaluate.logging.INFO` + > - `evaluate.logging.DEBUG` """ return _get_library_root_logger().getEffectiveLevel()