gather benchmark stats across three models (claude, gpt, gemini) & human raters - [ ] fix reference required eval: pdf source still missing - [ ] add gemini for eval - [ ] run all three models and gather stats - [ ] collect human rater result