You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
To evaluate predictions on LiveDRBench, provide a predictions file with the following JSON schema:
23
31
24
32
```
@@ -34,7 +42,7 @@ To evaluate predictions on LiveDRBench, provide a predictions file with the foll
34
42
Then, run the evaluation script with an OpenAI API key. This script will compute **precision**, **recall**, and **F1** scores for each benchmark category.
35
43
36
44
```bash
37
-
python evaluation.py \
45
+
python src/evaluation.py \
38
46
--openai_api_key YOUR_API_KEY \
39
47
--preds_file path/to/your/predictions.json \
40
48
[--openai_model_name gpt-4o] \
@@ -54,13 +62,13 @@ LiveDRBench repository is best suited for loading the companion benchmark and ev
54
62
55
63
## Out-of-scope Uses
56
64
57
-
- LiveDRBench is not well suited for training new Deep Research models. It only provides a test set. To avoid accidental test set leakage, we encrypt the answers in the benchmark, following the procedure of [BrowseComp benchmark's release](https://github.com/openai/simple-evals/blob/main/browsecomp_eval.py).
65
+
-LiveDRBench is not well suited for training new Deep Research models. It only provides a test set. To avoid accidental test set leakage, we encrypt the answers in the benchmark, following the procedure of [BrowseComp benchmark's release](https://github.com/openai/simple-evals/blob/main/browsecomp_eval.py).
58
66
59
-
- LiveDRBench dataset is not as representative of all kinds of Deep Research queries, especially those that require assessing the writing quality of long reports.
67
+
-LiveDRBench dataset is not as representative of all kinds of Deep Research queries, especially those that require assessing the writing quality of long reports.
60
68
61
-
- We do not recommend using LiveDRBench repo or the dataset in commercial or real-world applications without further testing and development. They are being released for research purposes.
69
+
-We do not recommend using LiveDRBench repo or the dataset in commercial or real-world applications without further testing and development. They are being released for research purposes.
62
70
63
-
- LiveDRBench should not be used in highly regulated domains where inaccurate outputs could suggest actions that lead to injury or negatively impact an individual's legal, financial, or life opportunities.
71
+
-LiveDRBench should not be used in highly regulated domains where inaccurate outputs could suggest actions that lead to injury or negatively impact an individual's legal, financial, or life opportunities.
0 commit comments