Releases: GoogleCloudPlatform/evalbench
Releases · GoogleCloudPlatform/evalbench
v1.1.0
1.1.0 (2026-03-20)
Features
- Add a Gemini-powered dataset translation tool. (#257) (a5c0359)
- Add Cloud Run support and make the server port configurable via… (#234) (34110b1)
- add evalbench release pipeline and bundling (#276) (a68b348)
- Add Gemini 3.0 Pro and 3.1 Pro preview model configurations (f8f036c)
- add QueryData API generator and refactor SQLGenWork (#281) (44d07dc)
- Add remote MCP server connectivity verification (7bf5716)
- Add remote MCP server connectivity verification (a64aa37)
- Add support for syncing Gemini CLI skills to fake home (7e2265b)
- Configure a dedicated home directory and user for evalbench within the Docker container. (89238f5)
- Configure GCS FUSE for session management and expose new ports for UI and metrics. (b02489e)
- Enable session-specific fake home directories for Gemini CLI and improve JSON parsing, while passing the session ID to the generator configuration. (0e0c06b)
- Enhance Evalbench Viewer UI (#252) (e3a2f95)
- Enhance results directory discovery in the viewer and ensure the CSV reporter outputs to a shared volume when running in server mode. (a4761e1)
- Install Node.js via NodeSource PPA, consolidating package installations and removing NVM. (a9f2741)
- Introduce Horizontal Pod Autoscaler, offload blocking evaluatio… (#269) (a639282)
- Introduce Horizontal Pod Autoscaler, offload blocking evaluation tasks to a thread pool, and enhance session manager robustness. (6024fb3)
- Multi run orchestrator (#258) (aec92c9)
- Schema, Database Instantiation (#259) (dcb8bf6)
- spanner: Improve and extend support for Spanner Client (#247) (ac6625a)
- Sync Gemini CLI skills into fake_home (93e6265)
Bug Fixes
- Configure absl.logging to output to stdout and initialize its handler. (560d0ee)
- Correct Gemini CLI response parsing to strip markdown code blocks and remove a redundant prompt argument, and update Makefile container names, pre-run cleanup, and volume mount paths. (#275) (daa0821)
- dataset: preserve multi-dialect golden_sql for BIRD (#262) (12ccf98)
- handle empty MySQL passwords and add Cloud SQL support to ensure_database_exists (#268) (beef7ec)
- implement timeouts to prevent thread hanging in evaluator (#266) (bb77c2f)
- prevent execution thread deadlocks and db connection leaks (#267) (265fee8)
- Prevent logging handler from closing sys.stdout by wrapping it in an
UncloseableStream. (d7c453e) - various improvements, fixes to the SpannerDB driver (#264) (5c6f425)
EvalBEnch 1.0
First release of EvalBench.
What's Changed
- Add a gRPC server that can handle eval requests from a client using SQLGen by @tommyang in #1
- Fix Gemini generator by @IsmailMehdi in #2
- Fixed the path for eval_output.json and score_result.json by @viditchopra1500 in #4
- productize the service by @IsmailMehdi in #6
- local service by @IsmailMehdi in #10
- Added job_id to score table for analyzing run data using it. by @viditchopra1500 in #9
- Service by @IsmailMehdi in #12
- 5 add llm rater implementation v1 by @viditchopra1500 in #11
- Add support for new evalbench database format by @hardikgu23 in #16
- Service by @IsmailMehdi in #19
- Service by @IsmailMehdi in #20
- Service work by @IsmailMehdi in #22
- Applied Rate limiting of 30 api calls to LLM, per minute. by @viditchopra1500 in #23
- Add initial config changes for nl2code by @prernakakkar-google in #25
- Fix error due to golden error being passed as list in new database eval format by @hardikgu23 in #27
- Setup-Teardown by @hardikgu23 in #18
- making setup_teardown a package/module by @hardikgu23 in #31
- Post insertion check by @hardikgu23 in #32
- Update README.md by @IsmailMehdi in #30
- G3client by @IsmailMehdi in #33
- Fix setup files for mysql dialect by @hardikgu23 in #35
- fix: spurious empty sql_generator_error and BQ OOM by @tommyang in #38
- Include job_id in EvalResponse to make results lookup easier by @tommyang in #39
- Add tmp_dql, tmp_dml user creation in setup files by @hardikgu23 in #36
- Added backoff and retry logic to llmrater by @viditchopra1500 in #42
- Return correct golden query based on dialect for bird by @hardikgu23 in #47
- DB Execution rate limiting and backoff by @mahyareb in #44
- make style by @mahyareb in #45
- Update the proto to represent the new eval format by @mahyareb in #48
- Fix issues with parsing of proto for evalitem by @mahyareb in #49
- Fix crash case when generated sql is empty by @mahyareb in #50
- delete json outputs at end of Eval by @tommyang in #52
- Add Support for SQLServer, fix MySQL auth issue by @mahyareb in #53
- Add support for google3 resources by @mahyareb in #54
- Support Gob Cloning in containers by @IsmailMehdi in #55
- Mount the tmp_session_files to kube by @mahyareb in #56
- Use dataset from git-on-borg for eval run by @hardikgu23 in #40
- Db query evaluation by @hardikgu23 in #37
- Fix: Use id field from prompt for dataset in newFormat by @hardikgu23 in #57
- Add new scorer returned_sql. by @hardikgu23 in #58
- Gob support by @IsmailMehdi in #60
- Gob support by @IsmailMehdi in #62
- Encountered bugs while setting up local evalbench run with direct path. by @viditchopra1500 in #64
- Fix db_config missing in SqlServerDB by @hardikgu23 in #67
- Fix missing parameter in sqlServerDB by @hardikgu23 in #68
- LLM Rater improvements by @viditchopra1500 in #70
- Accept schema, data and db config paths in experiment config for setup-teardown. by @hardikgu23 in #69
- Update Mysql Checksum by @hardikgu23 in #71
- Remove None entries from eval_query, setup_sql, cleanup_sql by @hardikgu23 in #72
- Check for None values by @hardikgu23 in #73
- Moving dataset filtering logic based on query_type inside Evaluator by @hardikgu23 in #75
- Fix logic for distributing temp databases between runners by @hardikgu23 in #76
- Handle None entries in golden_sql list by @hardikgu23 in #78
- Fix Executability score to not include punted queries by @hardikgu23 in #83
- Seperate execution for dml by @hardikgu23 in #87
- Use eval_result/metadata for comparision in case of dml/ddl by @hardikgu23 in #89
- Fix: Backticks being removed causing mysql queries to fail by @hardikgu23 in #91
- Update run_service.sh by @IsmailMehdi in #93
- Rate limiting strategy for LLM Rater: by @viditchopra1500 in #94
- Fix: Missing eval result in case of dql by @hardikgu23 in #96
- Allow limiting the number of results written to BigQuery by @mahyareb in #97
- Fix issues with truncation by @mahyareb in #100
- Fix Exact Match Issue by @mahyareb in #101
- Fix deployment from Cloudtop permission by @mahyareb in #99
- Skip llmRater in case of exact match by @hardikgu23 in #92
- Increase Chunk Size for BQ for QuotaError by @mahyareb in #102
- Reduce the time complexity of the remove duplicate function. by @viditchopra1500 in #95
- Add vertical autoscaling and test deploy by @mahyareb in #98
New Contributors
- @tommyang made their first contribution in #1
- @IsmailMehdi made their first contribution in #2
- @viditchopra1500 made their first contribution in #4
- @hardikgu23 made their first contribution in #16
- @mahyareb made their first contribution in #44
Full Changelog: https://github.com/GoogleCloudPlatform/evalbench/commits/v1.0