EvalPrompt: Analyzing Large Language Models for Self-Diagnosis

To better understand the capability of using Large Language Models (LLMs) for self-diagnosis, we prompted ChatGPT-4.0 with medical questions from the United States Medical Licensing Exam (USMLE). This repository contains the initial set of USMLE questions, along with the corresponding ChatGPT-4.0 responses which can then be evaluated by humans.

Contents of Repository

This repository contains 3 major files across 2 experiments. Experiment 1 assesses the correctness and clarity of ChatGPT-4.0 and Experiment 2 re-prompts ChatGPT-4.0 through sentence dropout by using the correctly answered questions from the first experiment.

generate_responses.ipynb: The notebook used to generate LLM responses for a specific experiment.
output.xlsx: The final ChatGPT-4.0 responses.
questions.xlsx: The initial USMLE questions.

To directly use the results of this repository, please download the output.xlsx Excel files from Experiments 1 and 2. Please refer to Data File Contents for details on the contents of the data files.

If you would like to replicate the experiment and reproduce the results, please follow Replicate the Experiment to begin setting up your local environment.

Data File Contents

Experiment 1 contains a single ChatGPT-4.0 response for each question, while Experiment 2 contains multiple answers for each question through sentence dropout.

Experiment 1

Column Name	Description
`usmle_1_q_num`	An ID referring to each Step 1 question. Any numbers missing are questions that were dropped.
`usmle_1_q`	The full multiple-choice textual question from the USMLE Step 1 test.
`correct_answer`	The correct answer to the multiple-choice question.
`usmle_1_a`	ChatGPT's answer to the open-ended version of the question.

Experiment 2

Column Name	Description
`usmle_1_q_num`	An ID referring to each Step 1 question. Any numbers missing are questions that were dropped.
`usmle_1_q`	The full multiple-choice textual question from the USMLE Step 1 test.
`correct_answer`	The correct answer to the multiple-choice question.
`non_expert_match`	The assessor categorization from all nonexperts, where C = correct, PC = partially correct, I = incorrect, and A = ambiguous. For example, `CCC` represents 3 assessors each categorizing ChatGPT-4.0's response to the particular question as correct.
`expert_match`	The assessor categorization from all experts. This is similar to the `non_expert_match` column.
`usmle_1_a_*`	Multiple columns each with the same prefix, representing one ChatGPT-4.0 response for each sentence. For example, `usmle_1_a_0` represents ChatGPT-4.0's answer to the open-ended question with the first sentence being dropped.

Replicate the Experiment

Create a local virtual environment
```
python3 -m venv .venv
```
Activate the virtual environment
```
source .venv/bin/activate
```
Install dependencies
```
pip install -r requirements.txt
```
Attain your API key from OpenAI. You may refer to this guide to locate your OpenAI key.
Create an .env file, with the key: OPENAI_API_KEY
```
touch .env
```
Open the generate_responses.ipynb in the experiment_1 folder, and run each cell of code.
Have humans evaluate the output.xlsx file from Experiment 1 to determine which questions the LLM answered correctly and clearly.
Update the non_expert_match and expert_match columns from the questions.xlsx file from Experiment 2 with the results from the previous step.
Open the generate_responses.ipynb in the experiment_2 folder, and run each cell of code.
Have humans evaluate the output.xlsx file from Experiment 2 to evaluate the robustness of the LLM by determining which questions the LLM continued to answer correctly.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
experiment_1		experiment_1
experiment_2		experiment_2
.gitignore		.gitignore
CITATION.cff		CITATION.cff
LICENSE		LICENSE
README.md		README.md
prompts.py		prompts.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

EvalPrompt: Analyzing Large Language Models for Self-Diagnosis

Contents of Repository

Data File Contents

Experiment 1

Experiment 2

Replicate the Experiment

About

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

License

criticalml-uw/EvalPrompt

Folders and files

Latest commit

History

Repository files navigation

EvalPrompt: Analyzing Large Language Models for Self-Diagnosis

Contents of Repository

Data File Contents

Experiment 1

Experiment 2

Replicate the Experiment

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Uh oh!

Contributors

Uh oh!

Languages