A lightweight, reproducible pipeline for CAARMS-aligned symptom scoring from CHR-P interview transcripts using locally deployed open-weight LLMs, plus evaluation scripts for (i) item-level agreement, (ii) subject–visit CHR-P detection, (iii) domain-stratified performance, (iv) fairness diagnostics, and (v) clinician/expert review.
.
├── prompts/ # Symptom-domain prompt templates (Q1–Q15)
├── run_allqs.py # Run LLM scoring across all domains (Q1–Q15)
├── eval_allqs.py # Core evaluation: item-level + subject–visit metrics
├── 1. Fairness & Question-level evaluation.r # Fairness + domain/question-level analyses
└── 2. Expert evaluation.r # Clinician/expert review analyses
-
LLM scoring (Q1–Q15)
Runs CAARMS-aligned, symptom-specific prompts over preprocessed interview transcripts to generate:- severity/intensity score (0–6)
- frequency score (0–6)
- short evidence-based summary (3–5 sentences)
-
Item-level evaluation
Computes agreement between model outputs and clinician ratings:- Pearson correlation
- ICC (absolute agreement)
- output completeness + repair rate
- AUROC for CHR-P discrimination (severity-only and severity+frequency)
-
Subject–visit CHR-P detection
Aggregates domain-level outputs to subject–visit events and evaluates:- accuracy, sensitivity, specificity
- precision, F1, MCC
-
Domain/question-level performance
Breaks down performance by CAARMS-aligned symptom domain. -
Fairness diagnostics
Reports predicted positive rate (PPR), true positive rate (TPR), and false positive rate (FPR) across demographic/site strata to inform demographic parity and equalized odds. -
Expert evaluation
Supports clinician/expert review of summaries and hard cases (e.g., confabulations).
Recommended environment:
- Python 3.11+
- PyTorch 2.8.0
- transformers 4.57.1
You will need:
- Preprocessed transcript files (de-identified/redacted) with consistent speaker turns
- A mapping of transcripts to Q1–Q15 symptom domains (PSYCHS/CAARMS ordering)
- Clinician ground-truth labels (severity/frequency) for evaluation
Run scoring across all questions/domains (Q1–Q15):
python run_allqs.pyNotes:
- Deterministic decoding is recommended (
temperature=0) for reproducibility. - Small models typically run on a single GPU; large models may require sharding/multi-GPU.
Item-level + subject–visit metrics:
python eval_allqs.py This produces summary tables (e.g., overall item agreement, AUROC, subject–visit detection).
source("1. Fairness & Question-level evaluation.r")
source("2. Expert evaluation.r")If the model output is not valid JSON, evaluation applies a three-stage fallback:
- parse fenced JSON block
- parse first balanced JSON object in text
- regex extraction of severity/frequency
Recovered predictions are tracked in evaluation outputs.
Recommended settings:
- do_sample = False (no sampling)
- batch size = 1
- single query per transcript–domain instance
If you use this code, please cite the associated manuscript (once available).
- Preprint: TBA
BSD 3-Clause License
Copyright (c) 2025, University of Oxford & King's College London. All rights reserved.