microsoft
diff --git a/‎README.md‎
Lines changed: 13 additions & 5 deletions b/‎README.md‎
Lines changed: 13 additions & 5 deletions
@@ -19,6 +19,14 @@ A detailed discussion of LiveDRBench, including how it was developed and tested,
 
 ## Usage
 
+To use LiveDRBench's questions, you can load the benchmark using the Hugging Face `datasets` library:
+
+```python
+from datasets import load_dataset
+
+livedrbench = load_dataset("microsoft/LiveDRBench", "v1-full")['test']
+```
+
 To evaluate predictions on LiveDRBench, provide a predictions file with the following JSON schema:
 
 ```
@@ -34,7 +42,7 @@ To evaluate predictions on LiveDRBench, provide a predictions file with the foll
 Then, run the evaluation script with an OpenAI API key. This script will compute **precision**, **recall**, and **F1** scores for each benchmark category.
 
 ```bash
-python evaluation.py \
+python src/evaluation.py \
   --openai_api_key YOUR_API_KEY \
   --preds_file path/to/your/predictions.json \
   [--openai_model_name gpt-4o] \
@@ -54,13 +62,13 @@ LiveDRBench repository is best suited for loading the companion benchmark and ev
 
 ## Out-of-scope Uses
 
-- LiveDRBench is not well suited for training new Deep Research models. It only provides a test set. To avoid accidental test set leakage, we encrypt the answers in the benchmark, following the procedure of [BrowseComp benchmark's release](https://github.com/openai/simple-evals/blob/main/browsecomp_eval.py).
+-   LiveDRBench is not well suited for training new Deep Research models. It only provides a test set. To avoid accidental test set leakage, we encrypt the answers in the benchmark, following the procedure of [BrowseComp benchmark's release](https://github.com/openai/simple-evals/blob/main/browsecomp_eval.py).
 
-- LiveDRBench dataset is not as representative of all kinds of Deep Research queries, especially those that require assessing the writing quality of long reports. 
+-   LiveDRBench dataset is not as representative of all kinds of Deep Research queries, especially those that require assessing the writing quality of long reports.
 
-- We do not recommend using LiveDRBench repo or the dataset in commercial or real-world applications without further testing and development. They are being released for research purposes.
+-   We do not recommend using LiveDRBench repo or the dataset in commercial or real-world applications without further testing and development. They are being released for research purposes.
 
-- LiveDRBench should not be used in highly regulated domains where inaccurate outputs could suggest actions that lead to injury or negatively impact an individual's legal, financial, or life opportunities.
+-   LiveDRBench should not be used in highly regulated domains where inaccurate outputs could suggest actions that lead to injury or negatively impact an individual's legal, financial, or life opportunities.
 
 ## Data Creation: Problem Inversion