Releases · GoogleCloudPlatform/evalbench

1.1.0 (2026-03-20)

Features

Add a Gemini-powered dataset translation tool. (#257) (a5c0359)

Add Cloud Run support and make the server port configurable via… (#234) (34110b1)

add evalbench release pipeline and bundling (#276) (a68b348)

Add Gemini 3.0 Pro and 3.1 Pro preview model configurations (f8f036c)

add QueryData API generator and refactor SQLGenWork (#281) (44d07dc)

Add remote MCP server connectivity verification (7bf5716)

Add remote MCP server connectivity verification (a64aa37)

Add support for syncing Gemini CLI skills to fake home (7e2265b)

Configure a dedicated home directory and user for evalbench within the Docker container. (89238f5)

Configure GCS FUSE for session management and expose new ports for UI and metrics. (b02489e)

Enable session-specific fake home directories for Gemini CLI and improve JSON parsing, while passing the session ID to the generator configuration. (0e0c06b)

Enhance Evalbench Viewer UI (#252) (e3a2f95)

Enhance results directory discovery in the viewer and ensure the CSV reporter outputs to a shared volume when running in server mode. (a4761e1)

Install Node.js via NodeSource PPA, consolidating package installations and removing NVM. (a9f2741)

Introduce Horizontal Pod Autoscaler, offload blocking evaluatio… (#269) (a639282)

Introduce Horizontal Pod Autoscaler, offload blocking evaluation tasks to a thread pool, and enhance session manager robustness. (6024fb3)

Multi run orchestrator (#258) (aec92c9)

Schema, Database Instantiation (#259) (dcb8bf6)

spanner: Improve and extend support for Spanner Client (#247) (ac6625a)

Sync Gemini CLI skills into fake_home (93e6265)

Bug Fixes

Configure absl.logging to output to stdout and initialize its handler. (560d0ee)

Correct Gemini CLI response parsing to strip markdown code blocks and remove a redundant prompt argument, and update Makefile container names, pre-run cleanup, and volume mount paths. (#275) (daa0821)

dataset: preserve multi-dialect golden_sql for BIRD (#262) (12ccf98)

handle empty MySQL passwords and add Cloud SQL support to ensure_database_exists (#268) (beef7ec)

implement timeouts to prevent thread hanging in evaluator (#266) (bb77c2f)

prevent execution thread deadlocks and db connection leaks (#267) (265fee8)

Prevent logging handler from closing sys.stdout by wrapping it in an UncloseableStream. (d7c453e)

various improvements, fixes to the SpannerDB driver (#264) (5c6f425)

First release of EvalBench.

What's Changed

Add a gRPC server that can handle eval requests from a client using SQLGen by @tommyang in #1
Fix Gemini generator by @IsmailMehdi in #2
Fixed the path for eval_output.json and score_result.json by @viditchopra1500 in #4
productize the service by @IsmailMehdi in #6
local service by @IsmailMehdi in #10
Added job_id to score table for analyzing run data using it. by @viditchopra1500 in #9
Service by @IsmailMehdi in #12
5 add llm rater implementation v1 by @viditchopra1500 in #11
Add support for new evalbench database format by @hardikgu23 in #16
Service by @IsmailMehdi in #19
Service by @IsmailMehdi in #20
Service work by @IsmailMehdi in #22
Applied Rate limiting of 30 api calls to LLM, per minute. by @viditchopra1500 in #23
Add initial config changes for nl2code by @prernakakkar-google in #25
Fix error due to golden error being passed as list in new database eval format by @hardikgu23 in #27
Setup-Teardown by @hardikgu23 in #18
making setup_teardown a package/module by @hardikgu23 in #31
Post insertion check by @hardikgu23 in #32
Update README.md by @IsmailMehdi in #30
G3client by @IsmailMehdi in #33
Fix setup files for mysql dialect by @hardikgu23 in #35
fix: spurious empty sql_generator_error and BQ OOM by @tommyang in #38
Include job_id in EvalResponse to make results lookup easier by @tommyang in #39
Add tmp_dql, tmp_dml user creation in setup files by @hardikgu23 in #36
Added backoff and retry logic to llmrater by @viditchopra1500 in #42
Return correct golden query based on dialect for bird by @hardikgu23 in #47
DB Execution rate limiting and backoff by @mahyareb in #44
make style by @mahyareb in #45
Update the proto to represent the new eval format by @mahyareb in #48
Fix issues with parsing of proto for evalitem by @mahyareb in #49
Fix crash case when generated sql is empty by @mahyareb in #50
delete json outputs at end of Eval by @tommyang in #52
Add Support for SQLServer, fix MySQL auth issue by @mahyareb in #53
Add support for google3 resources by @mahyareb in #54
Support Gob Cloning in containers by @IsmailMehdi in #55
Mount the tmp_session_files to kube by @mahyareb in #56
Use dataset from git-on-borg for eval run by @hardikgu23 in #40
Db query evaluation by @hardikgu23 in #37
Fix: Use id field from prompt for dataset in newFormat by @hardikgu23 in #57
Add new scorer returned_sql. by @hardikgu23 in #58
Gob support by @IsmailMehdi in #60
Gob support by @IsmailMehdi in #62
Encountered bugs while setting up local evalbench run with direct path. by @viditchopra1500 in #64
Fix db_config missing in SqlServerDB by @hardikgu23 in #67
Fix missing parameter in sqlServerDB by @hardikgu23 in #68
LLM Rater improvements by @viditchopra1500 in #70
Accept schema, data and db config paths in experiment config for setup-teardown. by @hardikgu23 in #69
Update Mysql Checksum by @hardikgu23 in #71
Remove None entries from eval_query, setup_sql, cleanup_sql by @hardikgu23 in #72
Check for None values by @hardikgu23 in #73
Moving dataset filtering logic based on query_type inside Evaluator by @hardikgu23 in #75
Fix logic for distributing temp databases between runners by @hardikgu23 in #76
Handle None entries in golden_sql list by @hardikgu23 in #78
Fix Executability score to not include punted queries by @hardikgu23 in #83
Seperate execution for dml by @hardikgu23 in #87
Use eval_result/metadata for comparision in case of dml/ddl by @hardikgu23 in #89
Fix: Backticks being removed causing mysql queries to fail by @hardikgu23 in #91
Update run_service.sh by @IsmailMehdi in #93
Rate limiting strategy for LLM Rater: by @viditchopra1500 in #94
Fix: Missing eval result in case of dql by @hardikgu23 in #96
Allow limiting the number of results written to BigQuery by @mahyareb in #97
Fix issues with truncation by @mahyareb in #100
Fix Exact Match Issue by @mahyareb in #101
Fix deployment from Cloudtop permission by @mahyareb in #99
Skip llmRater in case of exact match by @hardikgu23 in #92
Increase Chunk Size for BQ for QuotaError by @mahyareb in #102
Reduce the time complexity of the remove duplicate function. by @viditchopra1500 in #95
Add vertical autoscaling and test deploy by @mahyareb in #98