update readme

ashmitkx · ashmitkx · commit 348cf849ce91 · 2025-08-07T16:14:33.000+05:30
diff --git a/README.md b/README.md
@@ -1,33 +1,96 @@
-# Project
+# LiveDRBench: A novel benchmark for Deep Research
 
-> This repo has been populated by an initial template to help get you started. Please
-> make sure to update the content to build a great experience for community-building.
+_Dataset and evaluation code accompanying our paper available at [arXiv] (??)_
 
-As the maintainer of this project, please make a few updates:
+We propose a formal characterization of the deep research (DR) problem and introduce a new benchmark, _LiveDRBench_, to evaluate the performance of DR systems. To enable objective evaluation, we define DR using an intermediate output representation that encodes key claims
+uncovered during search—separating the reasoning challenge from surface-level report generation.
 
-- Improving this README.MD file to provide a great experience
-- Updating SUPPORT.MD with content about this project's support experience
-- Understanding the security reporting process in SECURITY.MD
-- Remove this section from the README
+## Dataset Details
 
-## Contributing
+The benchmarks consists of 100 challenging DR tasks over scientific topics (e.g., dataset discovery, materials discovery, novelty search, prior art discovery) and public interest events (e.g, the Oscars). The data was collected between May-June 2025. We hope to keep the benchmark live, and release periodic updates with new tasks.
 
-This project welcomes contributions and suggestions.  Most contributions require you to agree to a
-Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us
-the rights to use your contribution. For details, visit [Contributor License Agreements](https://cla.opensource.microsoft.com).
+Each task consists of (a) a prompt with a short description of the task and the expected output format; and (b) ground-truth JSON containing the claims and references that should be uncovered. We also include an evaluation script for computing the performance of DR systems using information-retrieval metrics namely precision, recall, and F1 scores.
 
-When you submit a pull request, a CLA bot will automatically determine whether you need to provide
-a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions
-provided by the bot. You will only need to do this once across all repos using our CLA.
+**Note**: LiveDRBench does not contain links to external data sources. LiveDRBench includes data from an existing scientific dataset, [Curie](https://github.com/google/curie). All queries are answerable using publicly available information.
 
-This project has adopted the [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/).
-For more information see the [Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/) or
-contact [opencode@microsoft.com](mailto:opencode@microsoft.com) with any additional questions or comments.
+## Usage
 
-## Trademarks
+To evaluate predictions on **LiveDRBench**, provide a predictions file with the following JSON schema:
 
-This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft
-trademarks or logos is subject to and must follow
-[Microsoft's Trademark & Brand Guidelines](https://www.microsoft.com/legal/intellectualproperty/trademarks/usage/general).
-Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship.
-Any use of third-party trademarks or logos are subject to those third-party's policies.
+```json
+[
+  {
+    "key": str,                             // Unique identifier from livedrbench.csv
+    "preds": List[List[dict | str] | dict]  // Predictions in the format specified by each question in livedrbench.csv
+  },
+  ...
+]
+```
+
+Then, run the evaluation script with an OpenAI API key. This script will compute **precision**, **recall**, and **F1** scores for each benchmark category.
+
+```bash
+python evaluation.py \
+  --openai_api_key YOUR_API_KEY \
+  --preds_file path/to/your/predictions.json \
+  [--openai_model_name gpt-4o] \
+  [--num_threads 8] \
+  [--debug]
+```
+
+-   `--openai_api_key` (required): Your OpenAI API key.
+-   `--preds_file` (required): Path to the predictions JSON file.
+-   `--openai_model_name` (optional): Model to use as judge (default: gpt-4o).
+-   `--num_threads` (optional): Number of parallel threads (default: 8).
+-   `--debug` (optional): Enable debug mode, without multithreading.
+
+## Intended Uses
+
+LiveDRBench repository is best suited for loading the companion benchmark and evaluating existing models. LiveDRBench dataset is intended to be used together with the Github repository. The code and the benchmark are being shared with the research community to facilitate reproduction of our results and foster further research in this area. LiveDRBench is intended to be used by domain experts who are independently capable of evaluating the quality of outputs before acting on them.
+
+## Out-of-scope Uses
+
+> LiveDRBench is not well suited for training new Deep Research models. It only provides a test set.
+
+> LiveDRBench dataset is not as representative of all kinds of Deep Research queries, especially those that require long reports such as literature review.
+
+> We do not recommend using LiveDRBench repo or the dataset in commercial or real-world applications without further testing and development. They are being released for research purposes.
+
+> LiveDRBench should not be used in highly regulated domains where inaccurate outputs could suggest actions that lead to injury or negatively impact an individual's legal, financial, or life opportunities.
+
+## Best Practices
+
+Best performance can be achieved by connecting an API key directly to the codebase. LiveDRBench should not be the only measure of understanding the performance of a DR model. Additional methods specific to the model use case should also be used to determine the overall performance of the model
+
+We strongly encourage users to use LLMs that support robust Responsible AI mitigations, such as Azure Open AI (AOAI) services. Such services continually update their safety and RAI mitigations with the latest industry standards for responsible use. For more on AOAI’s best practices when employing foundations models for scripts and applications:
+
+[Blog post on responsible AI features in AOAI that were presented at Ignite 2023](https://techcommunity.microsoft.com/t5/ai-azure-ai-services-blog/announcing-new-ai-safety-amp-responsible-ai-features-in-azure/ba-p/3983686)
+
+[Overview of Responsible AI practices for Azure OpenAI models](https://learn.microsoft.com/en-us/legal/cognitive-services/openai/overview)
+
+[Azure OpenAI Transparency Note](https://learn.microsoft.com/en-us/legal/cognitive-services/openai/transparency-note)
+
+[OpenAI’s Usage policies](https://openai.com/policies/usage-policies)
+
+[Azure OpenAI’s Code of Conduct](https://learn.microsoft.com/en-us/legal/cognitive-services/openai/code-of-conduct)
+
+Users are reminded to be mindful of data privacy concerns and are encouraged to review the privacy policies associated with any models and data storage solutions interfacing with LiveDRBench.
+
+It is the user’s responsibility to ensure that the use of LiveDRBench repo and dataset complies with relevant data protection regulations and organizational guidelines.
+
+## License
+
+Code in this Github repository is licensed under the [MIT License (CDLA v2)](https://github.com/microsoft/livedrbench/blob/main/LICENSE).
+
+## Contact
+
+If you have suggestions or questions, please contact us at amshar@microsoft.com.
+
+## Citing LiveDRBench
+
+    @inproceedings{livedrbench2025,
+      title={LiveDRBench: A novel benchmark for Deep Research},
+      author={Java, Abhinav and Khandelwal, Ashmit and Midigeshi, Sukruta and Halfaker, Aaron and Deshpande, Amit and Goyal, Navin and Gupta, Ankur and Natarajan, Nagarajan and Sharma, Amit},
+      booktitle={arXiv preprint arXiv:2506.08626},
+      year={2025}
+    }
diff --git a/src/evaluate.py b/src/evaluate.py
@@ -143,10 +143,10 @@ def __call__(self):
 if __name__ == "__main__":
     args = ArgumentParser()
     args.add_argument("--openai_api_key", type=str, required=True, help="OpenAI API key")
-    args.add_argument("--openai_model_name", type=str, default="gpt-4o", help="OpenAI model name to use for evaluation")
-    args.add_argument("--preds_file", type=str, required=True, help="Path to the CSV file containing predictions")
+    args.add_argument("--openai_model_name", type=str, default="gpt-4o", help="OpenAI model name to use as judge")
+    args.add_argument("--preds_file", type=str, required=True, help="Path to the JSON file containing predictions")
     args.add_argument("--num_threads", type=int, default=8, help="Number of threads to use for evaluation")
-    args.add_argument("--debug", action='store_true', help="Enable debug mode")
+    args.add_argument("--debug", action='store_true', help="Enable debug mode, without multithreading")
     args = args.parse_args()
     
     os.environ['OPENAI_API_KEY'] = args.openai_model_name