Official repository for the accepted CHIL 2025 paper
.
This github Repo accompanies our upcoming publication:
Zhang et al. CaseReportBench: An LLM Benchmark Dataset for Dense Information Extraction in Clinical Case Reports.
To appear in the Proceedings of the Conference on Health, Inference, and Learning (CHIL 2025), PMLR.
The official PMLR citation and link will be added upon publication.
CaseReportBench is the first benchmark designed for dense information extraction from clinical case reports, focused on rare diseases, especially Inborn Errors of Metabolism (IEMs). This benchmark evaluates how well large language models (LLMs) can extract structured, clinically relevant data across 14 system-level categories, such as Neurology, History, Lab/Imaging, and Musculoskeletal (MSK).
Key Contributions:
- A curated dataset of 138 expert-annotated case reports.
- Dense extractions across 14 predefined diagnostic categories.
- Evaluation of LLMs including *Qwen2, Qwen2.5, LLaMA3, and GPT-4o.
- Novel prompting strategies: Filtered Category-Specific Prompting (FCSP), Uniform Category-Specific Prompting (UCP), and Unified Global Prompting (UGP).
- Expert clinical assessment of model outputs.
The src/ folder contains all key components for dataset construction, prompting logic, and LLM evaluation:
| Folder | Description |
|---|---|
dataset_construction/ |
Scripts to process PMC-OA case reports, filter IEM cases, and structure data into prompt-ready JSON. Includes code for expert annotation merging and TSR filtering. |
benchmarking_llms/ |
Evaluate LLM dense information extractions against gold expert-crafted annotations, and compute all metrics (TSR, EM, hallucination, etc). |
This dataset includes the following supplementary files:
65_Excluded_Subheadings_Casefilter.json: Subheading-level case filtering metadata.65_Subheading_Category_Mapping.json: Mapping of subheadings to clinical categories.65_Excluded_Title_Manual_Review.txt: Manually reviewed titles excluded from the dataset.
These files support the CHIL 2025 submission and are referenced in the accompanying arXiv paper.
To set up the environment using Conda:
conda env create -f environment.yaml
conda activate CaseReportBenchThe dataset is available on the Hugging Face Hub:
👉 https://huggingface.co/datasets/cxyzhang/caseReportBench_ClinicalDenseExtraction_Benchmark
To load it in Python:
from datasets import load_dataset
dataset = load_dataset("cxyzhang/consolidated_expert_validated_denseExtractionDataset")- Code: MIT License (LICENSE.txt)
- Dataset: CC BY-NC 4.0 (DATA_LICENSE.txt)
The dataset is derived from the PubMed Central Open Access Subset and is for non-commercial academic use only.
If you use this work in your research, please cite:
@inproceedings{zhang2025casereportbench,
title={CaseReportBench: An LLM Benchmark Dataset for Dense Information Extraction in Clinical Case Reports},
author={Zhang, Xiao Yu Cindy and Ferreira, Carlos R. and Rossignol, Francis and Ng, Raymond T. and Wasserman, Wyeth and Zhu, Jian},
booktitle={Proceedings of the Sixth Conference on Health, Inference, and Learning},
series={Proceedings of Machine Learning Research},
volume={287},
pages={527--542},
year={2025},
publisher={PMLR}
}