Skip to content

Commit 348cf84

Browse files
committed
update readme
1 parent 83b7c83 commit 348cf84

File tree

2 files changed

+90
-27
lines changed

2 files changed

+90
-27
lines changed

README.md

Lines changed: 87 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -1,33 +1,96 @@
1-
# Project
1+
# LiveDRBench: A novel benchmark for Deep Research
22

3-
> This repo has been populated by an initial template to help get you started. Please
4-
> make sure to update the content to build a great experience for community-building.
3+
_Dataset and evaluation code accompanying our paper available at [arXiv] (??)_
54

6-
As the maintainer of this project, please make a few updates:
5+
We propose a formal characterization of the deep research (DR) problem and introduce a new benchmark, _LiveDRBench_, to evaluate the performance of DR systems. To enable objective evaluation, we define DR using an intermediate output representation that encodes key claims
6+
uncovered during search—separating the reasoning challenge from surface-level report generation.
77

8-
- Improving this README.MD file to provide a great experience
9-
- Updating SUPPORT.MD with content about this project's support experience
10-
- Understanding the security reporting process in SECURITY.MD
11-
- Remove this section from the README
8+
## Dataset Details
129

13-
## Contributing
10+
The benchmarks consists of 100 challenging DR tasks over scientific topics (e.g., dataset discovery, materials discovery, novelty search, prior art discovery) and public interest events (e.g, the Oscars). The data was collected between May-June 2025. We hope to keep the benchmark live, and release periodic updates with new tasks.
1411

15-
This project welcomes contributions and suggestions. Most contributions require you to agree to a
16-
Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us
17-
the rights to use your contribution. For details, visit [Contributor License Agreements](https://cla.opensource.microsoft.com).
12+
Each task consists of (a) a prompt with a short description of the task and the expected output format; and (b) ground-truth JSON containing the claims and references that should be uncovered. We also include an evaluation script for computing the performance of DR systems using information-retrieval metrics namely precision, recall, and F1 scores.
1813

19-
When you submit a pull request, a CLA bot will automatically determine whether you need to provide
20-
a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions
21-
provided by the bot. You will only need to do this once across all repos using our CLA.
14+
**Note**: LiveDRBench does not contain links to external data sources. LiveDRBench includes data from an existing scientific dataset, [Curie](https://github.com/google/curie). All queries are answerable using publicly available information.
2215

23-
This project has adopted the [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/).
24-
For more information see the [Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/) or
25-
contact [opencode@microsoft.com](mailto:opencode@microsoft.com) with any additional questions or comments.
16+
## Usage
2617

27-
## Trademarks
18+
To evaluate predictions on **LiveDRBench**, provide a predictions file with the following JSON schema:
2819

29-
This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft
30-
trademarks or logos is subject to and must follow
31-
[Microsoft's Trademark & Brand Guidelines](https://www.microsoft.com/legal/intellectualproperty/trademarks/usage/general).
32-
Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship.
33-
Any use of third-party trademarks or logos are subject to those third-party's policies.
20+
```json
21+
[
22+
{
23+
"key": str, // Unique identifier from livedrbench.csv
24+
"preds": List[List[dict | str] | dict] // Predictions in the format specified by each question in livedrbench.csv
25+
},
26+
...
27+
]
28+
```
29+
30+
Then, run the evaluation script with an OpenAI API key. This script will compute **precision**, **recall**, and **F1** scores for each benchmark category.
31+
32+
```bash
33+
python evaluation.py \
34+
--openai_api_key YOUR_API_KEY \
35+
--preds_file path/to/your/predictions.json \
36+
[--openai_model_name gpt-4o] \
37+
[--num_threads 8] \
38+
[--debug]
39+
```
40+
41+
- `--openai_api_key` (required): Your OpenAI API key.
42+
- `--preds_file` (required): Path to the predictions JSON file.
43+
- `--openai_model_name` (optional): Model to use as judge (default: gpt-4o).
44+
- `--num_threads` (optional): Number of parallel threads (default: 8).
45+
- `--debug` (optional): Enable debug mode, without multithreading.
46+
47+
## Intended Uses
48+
49+
LiveDRBench repository is best suited for loading the companion benchmark and evaluating existing models. LiveDRBench dataset is intended to be used together with the Github repository. The code and the benchmark are being shared with the research community to facilitate reproduction of our results and foster further research in this area. LiveDRBench is intended to be used by domain experts who are independently capable of evaluating the quality of outputs before acting on them.
50+
51+
## Out-of-scope Uses
52+
53+
> LiveDRBench is not well suited for training new Deep Research models. It only provides a test set.
54+
55+
> LiveDRBench dataset is not as representative of all kinds of Deep Research queries, especially those that require long reports such as literature review.
56+
57+
> We do not recommend using LiveDRBench repo or the dataset in commercial or real-world applications without further testing and development. They are being released for research purposes.
58+
59+
> LiveDRBench should not be used in highly regulated domains where inaccurate outputs could suggest actions that lead to injury or negatively impact an individual's legal, financial, or life opportunities.
60+
61+
## Best Practices
62+
63+
Best performance can be achieved by connecting an API key directly to the codebase. LiveDRBench should not be the only measure of understanding the performance of a DR model. Additional methods specific to the model use case should also be used to determine the overall performance of the model
64+
65+
We strongly encourage users to use LLMs that support robust Responsible AI mitigations, such as Azure Open AI (AOAI) services. Such services continually update their safety and RAI mitigations with the latest industry standards for responsible use. For more on AOAI’s best practices when employing foundations models for scripts and applications:
66+
67+
[Blog post on responsible AI features in AOAI that were presented at Ignite 2023](https://techcommunity.microsoft.com/t5/ai-azure-ai-services-blog/announcing-new-ai-safety-amp-responsible-ai-features-in-azure/ba-p/3983686)
68+
69+
[Overview of Responsible AI practices for Azure OpenAI models](https://learn.microsoft.com/en-us/legal/cognitive-services/openai/overview)
70+
71+
[Azure OpenAI Transparency Note](https://learn.microsoft.com/en-us/legal/cognitive-services/openai/transparency-note)
72+
73+
[OpenAI’s Usage policies](https://openai.com/policies/usage-policies)
74+
75+
[Azure OpenAI’s Code of Conduct](https://learn.microsoft.com/en-us/legal/cognitive-services/openai/code-of-conduct)
76+
77+
Users are reminded to be mindful of data privacy concerns and are encouraged to review the privacy policies associated with any models and data storage solutions interfacing with LiveDRBench.
78+
79+
It is the user’s responsibility to ensure that the use of LiveDRBench repo and dataset complies with relevant data protection regulations and organizational guidelines.
80+
81+
## License
82+
83+
Code in this Github repository is licensed under the [MIT License (CDLA v2)](https://github.com/microsoft/livedrbench/blob/main/LICENSE).
84+
85+
## Contact
86+
87+
If you have suggestions or questions, please contact us at amshar@microsoft.com.
88+
89+
## Citing LiveDRBench
90+
91+
@inproceedings{livedrbench2025,
92+
title={LiveDRBench: A novel benchmark for Deep Research},
93+
author={Java, Abhinav and Khandelwal, Ashmit and Midigeshi, Sukruta and Halfaker, Aaron and Deshpande, Amit and Goyal, Navin and Gupta, Ankur and Natarajan, Nagarajan and Sharma, Amit},
94+
booktitle={arXiv preprint arXiv:2506.08626},
95+
year={2025}
96+
}

src/evaluate.py

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -143,10 +143,10 @@ def __call__(self):
143143
if __name__ == "__main__":
144144
args = ArgumentParser()
145145
args.add_argument("--openai_api_key", type=str, required=True, help="OpenAI API key")
146-
args.add_argument("--openai_model_name", type=str, default="gpt-4o", help="OpenAI model name to use for evaluation")
147-
args.add_argument("--preds_file", type=str, required=True, help="Path to the CSV file containing predictions")
146+
args.add_argument("--openai_model_name", type=str, default="gpt-4o", help="OpenAI model name to use as judge")
147+
args.add_argument("--preds_file", type=str, required=True, help="Path to the JSON file containing predictions")
148148
args.add_argument("--num_threads", type=int, default=8, help="Number of threads to use for evaluation")
149-
args.add_argument("--debug", action='store_true', help="Enable debug mode")
149+
args.add_argument("--debug", action='store_true', help="Enable debug mode, without multithreading")
150150
args = args.parse_args()
151151

152152
os.environ['OPENAI_API_KEY'] = args.openai_model_name

0 commit comments

Comments
 (0)