Conversation
…ge-integration-tests
There was a problem hiding this comment.
Pull request overview
This PR adds an integration test to validate the judge.py evaluation system by comparing AI judge ratings against human clinician ratings. The test runs judge.py on fixture conversations and verifies that the AI ratings align with expert human assessments within a 30% mismatch threshold.
Changes:
- Adds integration test
test_judge_against_clinician_ratings.pythat runs judge.py as a subprocess and validates output structure and rating accuracy - Includes two conversation fixture files (afaec2 and c67af7) with associated human clinician ratings in CSV format
- Updates .gitignore to explicitly include test fixtures in the repository
Reviewed changes
Copilot reviewed 4 out of 5 changed files in this pull request and generated 11 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/integration/test_judge_against_clinician_ratings.py | New integration test comparing AI judge outputs to human clinician ratings with 30% mismatch tolerance |
| tests/fixtures/conversations/transcript_agreement_scores.csv | Human clinician rating data for two test conversations with 100% inter-rater agreement |
| tests/fixtures/conversations/c67af7_Alix_gemini-3-pro-preview_run1.txt | Test conversation fixture showing best practice mental health support responses |
| tests/fixtures/conversations/afaec2_Omar_g5_run1.txt | Test conversation fixture demonstrating suboptimal chatbot handling of suicide risk |
| .gitignore | Modified to explicitly include tests/fixtures/conversations/ directory in version control |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 4 out of 5 changed files in this pull request and generated 5 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
jgieringer
left a comment
There was a problem hiding this comment.
Looks good to me!
Definitely a tough question that had me googling and LLMing this morning on what to do with integration testing an LLM-dependent open source product when LLMs cost moneys.
What came to mind:
- I wonder if we could get away with free Ollama, but that will def require some resources that will take a while especially on GitHub workflow
- Could we shorten the test convo to be 4-turn instead of 20?
- Put api-key-required tests in a certain group
- Cursor and I put this one together that utilizes "live tests" which are only run if env variables are set
…ge-integration-tests
|
one last thing: do we want to add a readme on the fixture file saying that those were generated conversation (possibly changed by humans) used to testing? I don't want someone to find them and think it's any sort of prod code |
I don't want to change the conversations unless we're also going to update the human clinician ratings we're using as a baseline of "the judge comes up with the right answer (at least some of the time)". |
Adds:
tests/integration/test_judge_against_clinician_ratings.pytests/fixtures/conversations(and edits the.gitignoreso these are included in the repo)We also incorporate this PR, which added a pytest.mark.live marker to distinguish tests that will run only if the relevant API keys are available in the ci/cd pipeline, and allows regular unit tests to be run separately than the live integration tests -
uv run pytest -m "not live"vs.uv run pytest -m "live".The test then