Skip to content

Judge integration test#101

Merged
emily-vanark merged 16 commits intomainfrom
judge-integration-tests
Feb 9, 2026
Merged

Judge integration test#101
emily-vanark merged 16 commits intomainfrom
judge-integration-tests

Conversation

@emily-vanark
Copy link
Collaborator

@emily-vanark emily-vanark commented Feb 5, 2026

Adds:

  • tests/integration/test_judge_against_clinician_ratings.py
  • two conversations to be judged in tests/fixtures/conversations (and edits the .gitignore so these are included in the repo)
  • a csv that has the human clinican ratings for those two conversations

We also incorporate this PR, which added a pytest.mark.live marker to distinguish tests that will run only if the relevant API keys are available in the ci/cd pipeline, and allows regular unit tests to be run separately than the live integration tests - uv run pytest -m "not live" vs. uv run pytest -m "live".

The test then

  • reads in expected human ratings
  • calls judge.py on the fixture conversation folder (starting with just gpt-4o as the judge LLM - in future PRs we could expand to other LLMs) and checks that it...
    • runs without returning an error code
    • makes an output tsv file for each input conversation file
    • products a results.csv with
      • as many rows (minus the header) as the tsvs
      • the expected dimension columns
      • ... which are not empty
      • ... and populated with expected ratings values
  • checks that the results.csv and human clinician results do not have more than 30% mismatch in the ratings

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds an integration test to validate the judge.py evaluation system by comparing AI judge ratings against human clinician ratings. The test runs judge.py on fixture conversations and verifies that the AI ratings align with expert human assessments within a 30% mismatch threshold.

Changes:

  • Adds integration test test_judge_against_clinician_ratings.py that runs judge.py as a subprocess and validates output structure and rating accuracy
  • Includes two conversation fixture files (afaec2 and c67af7) with associated human clinician ratings in CSV format
  • Updates .gitignore to explicitly include test fixtures in the repository

Reviewed changes

Copilot reviewed 4 out of 5 changed files in this pull request and generated 11 comments.

Show a summary per file
File Description
tests/integration/test_judge_against_clinician_ratings.py New integration test comparing AI judge outputs to human clinician ratings with 30% mismatch tolerance
tests/fixtures/conversations/transcript_agreement_scores.csv Human clinician rating data for two test conversations with 100% inter-rater agreement
tests/fixtures/conversations/c67af7_Alix_gemini-3-pro-preview_run1.txt Test conversation fixture showing best practice mental health support responses
tests/fixtures/conversations/afaec2_Omar_g5_run1.txt Test conversation fixture demonstrating suboptimal chatbot handling of suicide risk
.gitignore Modified to explicitly include tests/fixtures/conversations/ directory in version control

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 5 changed files in this pull request and generated 5 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Collaborator

@jgieringer jgieringer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me!
Definitely a tough question that had me googling and LLMing this morning on what to do with integration testing an LLM-dependent open source product when LLMs cost moneys.

What came to mind:

  1. I wonder if we could get away with free Ollama, but that will def require some resources that will take a while especially on GitHub workflow
  2. Could we shorten the test convo to be 4-turn instead of 20?
  3. Put api-key-required tests in a certain group
    • Cursor and I put this one together that utilizes "live tests" which are only run if env variables are set

jgieringer and others added 3 commits February 9, 2026 14:01
* Add live pytest parker for api-dependent tests

* Added explanation of pytest.mark.live
@sator-labs
Copy link
Collaborator

one last thing: do we want to add a readme on the fixture file saying that those were generated conversation (possibly changed by humans) used to testing? I don't want someone to find them and think it's any sort of prod code

@emily-vanark
Copy link
Collaborator Author

Looks good to me! Definitely a tough question that had me googling and LLMing this morning on what to do with integration testing an LLM-dependent open source product when LLMs cost moneys.

What came to mind:

  1. I wonder if we could get away with free Ollama, but that will def require some resources that will take a while especially on GitHub workflow

  2. Could we shorten the test convo to be 4-turn instead of 20?

  3. Put api-key-required tests in a certain group

I don't want to change the conversations unless we're also going to update the human clinician ratings we're using as a baseline of "the judge comes up with the right answer (at least some of the time)".

@emily-vanark emily-vanark merged commit a4e5d1c into main Feb 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants