This is the official repo for the paper ClinDiag: Grounding Large Language Model in Clinical Diagnostics.
🔗 Demo website: https://clindiag.streamlit.app/
This installation requires no non-standard hardware.
The installation time depends on internet connection bandwidth and typically takes less than 30 mins.
When using pip it is generally recommended to install packages in a virtual environment to avoid modifying system state. We use conda as an example here:
Create and activate:
$ conda create -n clindiag python==3.12
$ conda activate clindiagTo deactivate later, run:
(clindiag) conda deactivate(clindiag) pip install -r requirements.txtBefore running a script, go to configs/OAI_Config_List.json to fill in your model name and API key.
{
"model": "gpt-4o-mini",
"api_key": "[YOUR_API_KEY]",
"base_url": "[YOUR_BASE_URL](optional)",
"tags": [
"gpt-4o-mini"
]
}The tags will be used to filter selected model(s) for each stage, see parse_args() in code scripts for details.
- 🔗 Demo website: https://clindiag.streamlit.app/
Trained on our fine-tuning dataset, ClinDiag-GPT showed superior performance in clinical diagnostic procedures. Although we can't provide direct API access to our fine-tuned model for security and cost considerations, feel free to chat with ClinDiag-GPT on our demo website.
Running the demo on our website doesn't require any additional installation or configuration. When interacting with human, it normally takes 10-20 mins to go through a full demo case.
To test out the 2-agent ClinDiag-Framework, run:
(clindiag) python code/trial_doctor_provider.py --data_dir sample_data--data_dir: root directory of input case folders. Here we usesample_datafor a quick demo--output_dir: directory to save output files, defaults tooutput--model_name_{history/pe/test/diagnosis}: models used for the doctor agent in each stage, defaults togpt-4o-mini--model_name_provider: model used for the provider agent, defaults togpt-4o-mini
Clindiag_Benchmark.zip
A comprehensive clinical dataset of 2,021 real-world cases, encompassing both rare and common diseases across 32 specialties.
Note: The full ClinDiag-Benchmark (n=4,421) used in our study comprises three subsets:
- Challenging Case Subset (n=1,719)
- Rare Disease Subset (n=302)
- Emergency Case Subset (n=2,400)
The provided
benchmark_dataset/only contains the former two subsets.The Emergency Case Subset is derived from MIMIC-IV-Ext Clinical Decision Making Dataset, which is officially available at https://physionet.org/. Users should follow the guidelines provided there to gain access to the MIMIC dataset and adhere to their data use policy.
./human_examiner_scripts/
A set of 35 patient scripts sourced from the hospital’s Objective Structured Clinical Examination (OSCE) test dataset for standardized patient training.
Below are instructions to run experiments on the full benchmark dataset.
This script implements a version of our human-LLM collaboration framework where LLMs serve as an assistant to answer physician's questions.
(clindiag) python code/human_as_doctor.py --data_dir benchmark_dataset --output_dir outputBy default, output files will be saved to ./output/human_as_doctor/.... You can set your desired output directory by specifying --output_dir (same for all scripts below).
Our framework also allows human to act as the information provider, while LLMs are the doctors who drive the diagnostic process.
(clindiag) python code/human_as_provider.py --data_dir benchmark_datasetThis is to simulate the human-alone scenario where a physician performs the clinical diagnostic procedure all by itself within the ClinDiag-Framework.
(clindiag) python code/human_alone.py --data_dir benchmark_datasetThe following scripts were used for additional study. We examined the effects of (1) multi-doctor collaboration, (2) introducing a critic agent, and (3) prompt engineering on diagnostic performance.
We tested the effect of having 2–3 doctor agents collaborate in the clinical decision-making process.
(clindiag) python code/trial_multidoctor.py --data_dir benchmark_dataset --num_specialists 2--num_specialists: number of doctor agents, defaults to3
This framework incorporates a critic agent to suggest further revisions on doctor agent's questions.
(clindiag) python code/trial_critic.py --data_dir benchmark_dataset --model_name_critic gpt-4o-mini--model_name_critic: model used for the critic agent, defaults togpt-4o-mini
This script adopts expert-generated prompts.
(clindiag) python code/trial_expert_prompt.py --data_dir benchmark_dataset./finetune_example.json
The multi-turn chat dataset used for fine-tuning a chat model. Due to copyright concern, the raw source materials is not distributed. Instead, we curated a synthetic example to provide an illustration of the training data format.
{"messages": [{"role": "system", "content": "..."}, {"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]}
{"messages": [{"role": "system", "content": "..."}, {"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]}