Evaluating Assessment of Psychosis Risk Using Large Language Models

A lightweight, reproducible pipeline for CAARMS-aligned symptom scoring from CHR-P interview transcripts using locally deployed open-weight LLMs, plus evaluation scripts for (i) item-level agreement, (ii) subject–visit CHR-P detection, (iii) domain-stratified performance, (iv) fairness diagnostics, and (v) clinician/expert review.

Repository structure

.
├── prompts/                                  # Symptom-domain prompt templates (Q1–Q15)
├── run_allqs.py                               # Run LLM scoring across all domains (Q1–Q15)
├── eval_allqs.py                              # Core evaluation: item-level + subject–visit metrics
├── 1. Fairness & Question-level evaluation.r  # Fairness + domain/question-level analyses
└── 2. Expert evaluation.r                     # Clinician/expert review analyses

What this repo does

LLM scoring (Q1–Q15)
Runs CAARMS-aligned, symptom-specific prompts over preprocessed interview transcripts to generate:
- severity/intensity score (0–6)
- frequency score (0–6)
- short evidence-based summary (3–5 sentences)
Item-level evaluation
Computes agreement between model outputs and clinician ratings:
- Pearson correlation
- ICC (absolute agreement)
- output completeness + repair rate
- AUROC for CHR-P discrimination (severity-only and severity+frequency)
Subject–visit CHR-P detection
Aggregates domain-level outputs to subject–visit events and evaluates:
- accuracy, sensitivity, specificity
- precision, F1, MCC
Domain/question-level performance
Breaks down performance by CAARMS-aligned symptom domain.
Fairness diagnostics
Reports predicted positive rate (PPR), true positive rate (TPR), and false positive rate (FPR) across demographic/site strata to inform demographic parity and equalized odds.
Expert evaluation
Supports clinician/expert review of summaries and hard cases (e.g., confabulations).

Setup

Python

Recommended environment:

Python 3.11+
PyTorch 2.8.0
transformers 4.57.1

Inputs

You will need:

Preprocessed transcript files (de-identified/redacted) with consistent speaker turns
A mapping of transcripts to Q1–Q15 symptom domains (PSYCHS/CAARMS ordering)
Clinician ground-truth labels (severity/frequency) for evaluation

Run LLM scoring

Run scoring across all questions/domains (Q1–Q15):

python run_allqs.py

Notes:

Deterministic decoding is recommended (temperature=0) for reproducibility.
Small models typically run on a single GPU; large models may require sharding/multi-GPU.

Evaluate outputs

Item-level + subject–visit metrics:

python eval_allqs.py

This produces summary tables (e.g., overall item agreement, AUROC, subject–visit detection).

Fairness and expert evaluation (R)

source("1. Fairness & Question-level evaluation.r")
source("2. Expert evaluation.r")

Output repair (schema recovery)

If the model output is not valid JSON, evaluation applies a three-stage fallback:

parse fenced JSON block
parse first balanced JSON object in text
regex extraction of severity/frequency

Recovered predictions are tracked in evaluation outputs.

Reproducibility

Recommended settings:

do_sample = False (no sampling)
batch size = 1
single query per transcript–domain instance

Citation

If you use this code, please cite the associated manuscript (once available).

Preprint: TBA

License

BSD 3-Clause License

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
prompts_CAARMS		prompts_CAARMS
.gitignore		.gitignore
1. Fairness & Question-level evaluation.r		1. Fairness & Question-level evaluation.r
2. Expert evaluation.r		2. Expert evaluation.r
README.md		README.md
eval_allqs.py		eval_allqs.py
run_allqs.py		run_allqs.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Evaluating Assessment of Psychosis Risk Using Large Language Models

Repository structure

What this repo does

Setup

Python

Inputs

Run LLM scoring

Evaluate outputs

Fairness and expert evaluation (R)

Output repair (schema recovery)

Reproducibility

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Evaluating Assessment of Psychosis Risk Using Large Language Models

Repository structure

What this repo does

Setup

Python

Inputs

Run LLM scoring

Evaluate outputs

Fairness and expert evaluation (R)

Output repair (schema recovery)

Reproducibility

Citation

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages