Judge integration test by emily-vanark · Pull Request #101 · SpringCare/VERA-MH

emily-vanark · 2026-02-05T22:53:20Z

Adds:

tests/integration/test_judge_against_clinician_ratings.py
two conversations to be judged in tests/fixtures/conversations (and edits the .gitignore so these are included in the repo)
a csv that has the human clinican ratings for those two conversations

We also incorporate this PR, which added a pytest.mark.live marker to distinguish tests that will run only if the relevant API keys are available in the ci/cd pipeline, and allows regular unit tests to be run separately than the live integration tests - uv run pytest -m "not live" vs. uv run pytest -m "live".

The test then

reads in expected human ratings
calls judge.py on the fixture conversation folder (starting with just gpt-4o as the judge LLM - in future PRs we could expand to other LLMs) and checks that it...
- runs without returning an error code
- makes an output tsv file for each input conversation file
- products a results.csv with
  - as many rows (minus the header) as the tsvs
  - the expected dimension columns
  - ... which are not empty
  - ... and populated with expected ratings values
checks that the results.csv and human clinician results do not have more than 30% mismatch in the ratings

…ge-integration-tests

… for outputs

Copilot

Pull request overview

This PR adds an integration test to validate the judge.py evaluation system by comparing AI judge ratings against human clinician ratings. The test runs judge.py on fixture conversations and verifies that the AI ratings align with expert human assessments within a 30% mismatch threshold.

Changes:

Adds integration test test_judge_against_clinician_ratings.py that runs judge.py as a subprocess and validates output structure and rating accuracy
Includes two conversation fixture files (afaec2 and c67af7) with associated human clinician ratings in CSV format
Updates .gitignore to explicitly include test fixtures in the repository

Reviewed changes

Copilot reviewed 4 out of 5 changed files in this pull request and generated 11 comments.

Show a summary per file

File	Description
tests/integration/test_judge_against_clinician_ratings.py	New integration test comparing AI judge outputs to human clinician ratings with 30% mismatch tolerance
tests/fixtures/conversations/transcript_agreement_scores.csv	Human clinician rating data for two test conversations with 100% inter-rater agreement
tests/fixtures/conversations/c67af7_Alix_gemini-3-pro-preview_run1.txt	Test conversation fixture showing best practice mental health support responses
tests/fixtures/conversations/afaec2_Omar_g5_run1.txt	Test conversation fixture demonstrating suboptimal chatbot handling of suicide risk
.gitignore	Modified to explicitly include tests/fixtures/conversations/ directory in version control

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

tests/integration/test_judge_against_clinician_ratings.py

Copilot

Pull request overview

Copilot reviewed 4 out of 5 changed files in this pull request and generated 5 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

tests/integration/test_judge_against_clinician_ratings.py

jgieringer

Looks good to me!
Definitely a tough question that had me googling and LLMing this morning on what to do with integration testing an LLM-dependent open source product when LLMs cost moneys.

What came to mind:

I wonder if we could get away with free Ollama, but that will def require some resources that will take a while especially on GitHub workflow
Could we shorten the test convo to be 4-turn instead of 20?
Put api-key-required tests in a certain group
- Cursor and I put this one together that utilizes "live tests" which are only run if env variables are set
  - #103

tests/integration/test_judge_against_clinician_ratings.py

* Add live pytest parker for api-dependent tests * Added explanation of pytest.mark.live

…ge-integration-tests

.github/workflows/ci.yml

tests/integration/test_judge_against_clinician_ratings.py

sator-labs · 2026-02-09T23:16:21Z

one last thing: do we want to add a readme on the fixture file saying that those were generated conversation (possibly changed by humans) used to testing? I don't want someone to find them and think it's any sort of prod code

emily-vanark · 2026-02-09T23:25:03Z

Looks good to me! Definitely a tough question that had me googling and LLMing this morning on what to do with integration testing an LLM-dependent open source product when LLMs cost moneys.

What came to mind:

I wonder if we could get away with free Ollama, but that will def require some resources that will take a while especially on GitHub workflow

Could we shorten the test convo to be 4-turn instead of 20?

Put api-key-required tests in a certain group

Cursor and I put this one together that utilizes "live tests" which are only run if env variables are set

Add live pytest marker for API-dependent tests #103

I don't want to change the conversations unless we're also going to update the human clinician ratings we're using as a baseline of "the judge comes up with the right answer (at least some of the time)".

emily-vanark added 6 commits January 28, 2026 18:45

wip test

224224b

Merge branch 'main' of https://github.com/SpringCare/VERA-MH into jud…

2e84f8b

…ge-integration-tests

functional integration test

6fa7300

pull ratings from judge.constants

0d69431

add judge integration fixtures (adapt .gitignore so they're visible)

c22a5f6

adjust tests to use <= 30% mismatch threshold, gpt-4o, and tmp folder…

8d2e78e

… for outputs

emily-vanark requested a review from Copilot February 5, 2026 22:53

Copilot started reviewing on behalf of emily-vanark February 5, 2026 22:53 View session

emily-vanark requested review from jgieringer, nz-1 and sator-labs February 5, 2026 22:54

Copilot AI reviewed Feb 5, 2026

View reviewed changes

emily-vanark added 4 commits February 5, 2026 18:24

clean-up some redundancies

1d6c9ed

a few updates

d9e2ec1

use fixtures_dir from conftest.py

62076ab

check expected for VALID_RATING_VALUES

1b8781e

emily-vanark requested a review from Copilot February 5, 2026 23:46

Copilot started reviewing on behalf of emily-vanark February 5, 2026 23:46 View session

Copilot AI reviewed Feb 5, 2026

View reviewed changes

emily-vanark commented Feb 6, 2026

View reviewed changes

tests/integration/test_judge_against_clinician_ratings.py Show resolved Hide resolved

emily-vanark added 2 commits February 6, 2026 19:56

add comment about mismatch % contstant

2bddde6

fail if missing output judge results

50aaea5

jgieringer approved these changes Feb 7, 2026

View reviewed changes

tests/integration/test_judge_against_clinician_ratings.py Show resolved Hide resolved

jgieringer and others added 3 commits February 9, 2026 14:01

Add live pytest marker for API-dependent tests (#103)

3c47e73

* Add live pytest parker for api-dependent tests * Added explanation of pytest.mark.live

Merge branch 'main' of https://github.com/SpringCare/VERA-MH into jud…

f236571

…ge-integration-tests

add pytest.mark.live to test_scoring.py

f3a9ac5

sator-labs approved these changes Feb 9, 2026

View reviewed changes

+README for fixture convos, update docstring to mention OAI

7fc13a9

emily-vanark merged commit a4e5d1c into main Feb 9, 2026

Conversation

emily-vanark commented Feb 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jgieringer left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sator-labs commented Feb 9, 2026

Uh oh!

emily-vanark commented Feb 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

emily-vanark commented Feb 5, 2026 •

edited

Loading