Skip to content

Commit 6f9599e

Browse files
committed
update to use hf dataset
1 parent 724558d commit 6f9599e

4 files changed

Lines changed: 17 additions & 1783 deletions

File tree

README.md

Lines changed: 13 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -19,6 +19,14 @@ A detailed discussion of LiveDRBench, including how it was developed and tested,
1919

2020
## Usage
2121

22+
To use LiveDRBench's questions, you can load the benchmark using the Hugging Face `datasets` library:
23+
24+
```python
25+
from datasets import load_dataset
26+
27+
livedrbench = load_dataset("microsoft/LiveDRBench", "v1-full")['test']
28+
```
29+
2230
To evaluate predictions on LiveDRBench, provide a predictions file with the following JSON schema:
2331

2432
```
@@ -34,7 +42,7 @@ To evaluate predictions on LiveDRBench, provide a predictions file with the foll
3442
Then, run the evaluation script with an OpenAI API key. This script will compute **precision**, **recall**, and **F1** scores for each benchmark category.
3543

3644
```bash
37-
python evaluation.py \
45+
python src/evaluation.py \
3846
--openai_api_key YOUR_API_KEY \
3947
--preds_file path/to/your/predictions.json \
4048
[--openai_model_name gpt-4o] \
@@ -54,13 +62,13 @@ LiveDRBench repository is best suited for loading the companion benchmark and ev
5462

5563
## Out-of-scope Uses
5664

57-
- LiveDRBench is not well suited for training new Deep Research models. It only provides a test set. To avoid accidental test set leakage, we encrypt the answers in the benchmark, following the procedure of [BrowseComp benchmark's release](https://github.com/openai/simple-evals/blob/main/browsecomp_eval.py).
65+
- LiveDRBench is not well suited for training new Deep Research models. It only provides a test set. To avoid accidental test set leakage, we encrypt the answers in the benchmark, following the procedure of [BrowseComp benchmark's release](https://github.com/openai/simple-evals/blob/main/browsecomp_eval.py).
5866

59-
- LiveDRBench dataset is not as representative of all kinds of Deep Research queries, especially those that require assessing the writing quality of long reports.
67+
- LiveDRBench dataset is not as representative of all kinds of Deep Research queries, especially those that require assessing the writing quality of long reports.
6068

61-
- We do not recommend using LiveDRBench repo or the dataset in commercial or real-world applications without further testing and development. They are being released for research purposes.
69+
- We do not recommend using LiveDRBench repo or the dataset in commercial or real-world applications without further testing and development. They are being released for research purposes.
6270

63-
- LiveDRBench should not be used in highly regulated domains where inaccurate outputs could suggest actions that lead to injury or negatively impact an individual's legal, financial, or life opportunities.
71+
- LiveDRBench should not be used in highly regulated domains where inaccurate outputs could suggest actions that lead to injury or negatively impact an individual's legal, financial, or life opportunities.
6472

6573
## Data Creation: Problem Inversion
6674

0 commit comments

Comments
 (0)