Skip to content

tndrg/CHRP_LLM

Repository files navigation

Evaluating Assessment of Psychosis Risk Using Large Language Models

A lightweight, reproducible pipeline for CAARMS-aligned symptom scoring from CHR-P interview transcripts using locally deployed open-weight LLMs, plus evaluation scripts for (i) item-level agreement, (ii) subject–visit CHR-P detection, (iii) domain-stratified performance, (iv) fairness diagnostics, and (v) clinician/expert review.

Repository structure

.
├── prompts/                                  # Symptom-domain prompt templates (Q1–Q15)
├── run_allqs.py                               # Run LLM scoring across all domains (Q1–Q15)
├── eval_allqs.py                              # Core evaluation: item-level + subject–visit metrics
├── 1. Fairness & Question-level evaluation.r  # Fairness + domain/question-level analyses
└── 2. Expert evaluation.r                     # Clinician/expert review analyses

What this repo does

  1. LLM scoring (Q1–Q15)
    Runs CAARMS-aligned, symptom-specific prompts over preprocessed interview transcripts to generate:

    • severity/intensity score (0–6)
    • frequency score (0–6)
    • short evidence-based summary (3–5 sentences)
  2. Item-level evaluation
    Computes agreement between model outputs and clinician ratings:

    • Pearson correlation
    • ICC (absolute agreement)
    • output completeness + repair rate
    • AUROC for CHR-P discrimination (severity-only and severity+frequency)
  3. Subject–visit CHR-P detection
    Aggregates domain-level outputs to subject–visit events and evaluates:

    • accuracy, sensitivity, specificity
    • precision, F1, MCC
  4. Domain/question-level performance
    Breaks down performance by CAARMS-aligned symptom domain.

  5. Fairness diagnostics
    Reports predicted positive rate (PPR), true positive rate (TPR), and false positive rate (FPR) across demographic/site strata to inform demographic parity and equalized odds.

  6. Expert evaluation
    Supports clinician/expert review of summaries and hard cases (e.g., confabulations).

Setup

Python

Recommended environment:

  • Python 3.11+
  • PyTorch 2.8.0
  • transformers 4.57.1

Inputs

You will need:

  • Preprocessed transcript files (de-identified/redacted) with consistent speaker turns
  • A mapping of transcripts to Q1–Q15 symptom domains (PSYCHS/CAARMS ordering)
  • Clinician ground-truth labels (severity/frequency) for evaluation

Run LLM scoring

Run scoring across all questions/domains (Q1–Q15):

python run_allqs.py

Notes:

  • Deterministic decoding is recommended (temperature=0) for reproducibility.
  • Small models typically run on a single GPU; large models may require sharding/multi-GPU.

Evaluate outputs

Item-level + subject–visit metrics:

python eval_allqs.py 

This produces summary tables (e.g., overall item agreement, AUROC, subject–visit detection).

Fairness and expert evaluation (R)

source("1. Fairness & Question-level evaluation.r")
source("2. Expert evaluation.r")

Output repair (schema recovery)

If the model output is not valid JSON, evaluation applies a three-stage fallback:

  1. parse fenced JSON block
  2. parse first balanced JSON object in text
  3. regex extraction of severity/frequency

Recovered predictions are tracked in evaluation outputs.

Reproducibility

Recommended settings:

  • do_sample = False (no sampling)
  • batch size = 1
  • single query per transcript–domain instance

Citation

If you use this code, please cite the associated manuscript (once available).

  • Preprint: TBA

License

BSD 3-Clause License

Copyright (c) 2025, University of Oxford & King's College London. All rights reserved.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors