Skip to content

geteff1/ClinDiag

Repository files navigation

ClinDiag

This is the official repo for the paper ClinDiag: Grounding Large Language Model in Clinical Diagnostics.

🔗 Demo website: https://clindiag.streamlit.app/

Table of Contents

  1. Installation
  2. Demo
  3. Datasets
  4. Usage

Installation

This installation requires no non-standard hardware.

The installation time depends on internet connection bandwidth and typically takes less than 30 mins.

Set up a virtual environment

When using pip it is generally recommended to install packages in a virtual environment to avoid modifying system state. We use conda as an example here:

Create and activate:

$ conda create -n clindiag python==3.12
$ conda activate clindiag

To deactivate later, run:

(clindiag) conda deactivate

Install dependencies

(clindiag) pip install -r requirements.txt

Add API configs

Before running a script, go to configs/OAI_Config_List.json to fill in your model name and API key.

{
    "model": "gpt-4o-mini",
    "api_key": "[YOUR_API_KEY]",
    "base_url": "[YOUR_BASE_URL](optional)",
    "tags": [
        "gpt-4o-mini"
    ]
}

The tags will be used to filter selected model(s) for each stage, see parse_args() in code scripts for details.

Demo

💬 ClinDiag-GPT

Trained on our fine-tuning dataset, ClinDiag-GPT showed superior performance in clinical diagnostic procedures. Although we can't provide direct API access to our fine-tuned model for security and cost considerations, feel free to chat with ClinDiag-GPT on our demo website.

Running the demo on our website doesn't require any additional installation or configuration. When interacting with human, it normally takes 10-20 mins to go through a full demo case.

🏗 ClinDiag-Framework

To test out the 2-agent ClinDiag-Framework, run:

(clindiag) python code/trial_doctor_provider.py --data_dir sample_data
  • --data_dir: root directory of input case folders. Here we use sample_data for a quick demo
  • --output_dir: directory to save output files, defaults to output
  • --model_name_{history/pe/test/diagnosis}: models used for the doctor agent in each stage, defaults to gpt-4o-mini
  • --model_name_provider: model used for the provider agent, defaults to gpt-4o-mini

Datasets

ClinDiag-Benchmark

Clindiag_Benchmark.zip

A comprehensive clinical dataset of 2,021 real-world cases, encompassing both rare and common diseases across 32 specialties.

Note: The full ClinDiag-Benchmark (n=4,421) used in our study comprises three subsets:

  1. Challenging Case Subset (n=1,719)
  2. Rare Disease Subset (n=302)
  3. Emergency Case Subset (n=2,400)

The provided benchmark_dataset/ only contains the former two subsets.

The Emergency Case Subset is derived from MIMIC-IV-Ext Clinical Decision Making Dataset, which is officially available at https://physionet.org/. Users should follow the guidelines provided there to gain access to the MIMIC dataset and adhere to their data use policy.

Standardized Patients (n=35)

./human_examiner_scripts/

A set of 35 patient scripts sourced from the hospital’s Objective Structured Clinical Examination (OSCE) test dataset for standardized patient training.

Usage

Below are instructions to run experiments on the full benchmark dataset.

Human+LLM

Human as Doctor

This script implements a version of our human-LLM collaboration framework where LLMs serve as an assistant to answer physician's questions.

(clindiag) python code/human_as_doctor.py --data_dir benchmark_dataset --output_dir output

By default, output files will be saved to ./output/human_as_doctor/.... You can set your desired output directory by specifying --output_dir (same for all scripts below).

Human as Provider

Our framework also allows human to act as the information provider, while LLMs are the doctors who drive the diagnostic process.

(clindiag) python code/human_as_provider.py --data_dir benchmark_dataset

Human Alone

This is to simulate the human-alone scenario where a physician performs the clinical diagnostic procedure all by itself within the ClinDiag-Framework.

(clindiag) python code/human_alone.py --data_dir benchmark_dataset

Additional Study

The following scripts were used for additional study. We examined the effects of (1) multi-doctor collaboration, (2) introducing a critic agent, and (3) prompt engineering on diagnostic performance.

1. Multi-doctor agents

We tested the effect of having 2–3 doctor agents collaborate in the clinical decision-making process.

(clindiag) python code/trial_multidoctor.py --data_dir benchmark_dataset --num_specialists 2
  • --num_specialists: number of doctor agents, defaults to 3

2. Critic agent

This framework incorporates a critic agent to suggest further revisions on doctor agent's questions.

(clindiag) python code/trial_critic.py --data_dir benchmark_dataset --model_name_critic gpt-4o-mini
  • --model_name_critic: model used for the critic agent, defaults to gpt-4o-mini

3. Expert prompt

This script adopts expert-generated prompts.

(clindiag) python code/trial_expert_prompt.py --data_dir benchmark_dataset

Fine-Tuning Data

./finetune_example.json

The multi-turn chat dataset used for fine-tuning a chat model. Due to copyright concern, the raw source materials is not distributed. Instead, we curated a synthetic example to provide an illustration of the training data format.

{"messages": [{"role": "system", "content": "..."}, {"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]}
{"messages": [{"role": "system", "content": "..."}, {"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]}

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 2

  •  
  •  

Languages