Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion evalbench/generators/models/gemini.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
from google import genai
from google.genai.types import GenerateContentResponse
from google.genai.errors import ServerError
from util.rate_limit import ResourceExhaustedError
from util.gcp import get_gcp_project, get_gcp_region
from google.api_core.exceptions import ResourceExhausted
Expand Down Expand Up @@ -36,7 +37,7 @@ def generate_internal(self, prompt):
if isinstance(response, GenerateContentResponse):
r = sanitize_sql(response.text)
return r
except ResourceExhausted as e:
except (ResourceExhausted, ServerError) as e:
raise ResourceExhaustedError(e)
except Exception as e:
logger.exception("Unhandled exception during generate_content")
Expand Down
69 changes: 37 additions & 32 deletions evalbench/scorers/llmrater.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,15 +2,15 @@
LLMRater
In this comparison strategy, an LLM compares the golden execution results with the generated sql execution results.
It returns a score between 0 and 100, with a score of 100 for concrete positive cases,
where either there is a mismatch of columns names or extra relevant columns in Generated SQL exists.
where either there is a mismatch of columns names, extra relevant columns, or harmless unrequested sorting/limits in Generated SQL.

Evaluation rules given to LLM:
1. Assume OUTPUT #1 is the gold standard and is ALWAYS correct.
2. The order of columns in OUTPUT #2 does not matter.
3. Allow slight variations due to differences in rounding or precision, for calculated values.
Otherwise ensure exact matches for numbers, dates, timestamps, measurements, and metrics
between the two outputs.
4. The mapped column names might differ, do not make any assumptions based on them.
3. The order of rows in OUTPUT #2 does not matter UNLESS explicitly requested in the prompt.
4. Allow slight variations due to differences in rounding or precision, for calculated values.
5. Allow acceptable divergences based on relaxed criteria (ambiguous counts, null handling, relative dates, IDs vs Names, ambiguous limits, etc.).
6. The mapped column names might differ, do not make any assumptions based on them.

Run Configuration Options:
1. model_config: Required
Expand Down Expand Up @@ -45,7 +45,7 @@
1. [EntityError] - Wrong table or entity was used.
2. [ValueLinkingError] - Wrong literal value (string/number) was used.
3. [ColumnLinkingError] - Wrong column was selected or used in a condition.
4. [OrderingError] - Sorting order (ASC/DESC) or column is incorrect.
4. [OrderingError] - Sorting order (ASC/DESC) or column is incorrect (only flag if prompt explicitly requested sorting).
5. [InstructionError] - Failed to follow specific constraints in the prompt (e.g., "return top 5").
6. [IntentError] - Misinterpreted the user's fundamental request.
7. [DataTypesError] - Incorrect handling of data types (e.g., casting, dates).
Expand Down Expand Up @@ -187,46 +187,49 @@ def compare(

The correct answer to this question is:

OUTPUT #1:
OUTPUT #1 (Gold Standard):

{golden_execution_result}


We get the following answer from a second query.
We get the following answer from a generated query:

OUTPUT #2
OUTPUT #2 (Generated Result):

{generated_execution_result}


Thinking step by step, compare the two outputs and look for differences in data presentation.
Thinking step by step, compare the two outputs and look for differences in data and presentation.
Here are steps to follow:

1. Ensure that every column in OUTPUT #1 has a corresponding column in OUTPUT #2 that represents
the same information, even if the column names are different.
2. All columns present in OUTPUT #1 must also be present in OUTPUT #2. OUTPUT #2 is allowed to
have additional columns relevant to the query.
3. Compare the data within each mapped column pair between OUTPUT #1 and OUTPUT #2.
Ensure that OUTPUT #2 contains all the data from OUTPUT #1, with no missing or extra rows.
4. Minor string transformations are allowed (e.g., concatenating first and last name), but the
underlying information must be preserved.

RULES - These MUST be strictly followed, to answer the FINAL QUESTION:

1. Assume OUTPUT #1 is the gold standard and is ALWAYS correct.
2. The order of columns in OUTPUT #2 does not matter.
3. Allow slight variations due to differences in rounding or precision, for calculated values.
Otherwise ensure exact matches for numbers, dates, timestamps, measurements, and metrics
between the two outputs.
4. The mapped column names might differ, do not make any assumptions based on them.
1. Analyze the QUESTION: Does it explicitly ask for a specific sorting order (e.g., "ordered by date", "top 5")? Does it explicitly ask for a limit?
2. Column Mapping: Ensure that every column in OUTPUT #1 has a corresponding column in OUTPUT #2 that represents the same information. OUTPUT #2 is allowed to have additional descriptive columns.
3. Data Comparison: Compare the data within each mapped column pair.
4. Row Order: Ignore differences in row order UNLESS the QUESTION explicitly requested a specific sorting. Treat the data as unordered sets if no order is specified.
5. Extra Rows: If OUTPUT #2 has extra rows but contains all of OUTPUT #1, evaluate if the extra rows violate the prompt's constraints. If the prompt was ambiguous about limits (e.g. "Identify the MSA with the highest growth" and the model returns a ranked list instead of a single row), treating it as EXTRA_INFORMATION is acceptable and correct.

RULES & RELAXED EVALUATION CRITERIA - These MUST be strictly followed:

1. Assume OUTPUT #1 is the gold standard and its core data values are ALWAYS mathematically/logically correct.
2. The mapped column names might differ, do not make any assumptions based on them.
3. Do NOT penalize OUTPUT #2 if it differs from OUTPUT #1 for ANY of the following reasons:
- Column/Row Order: Differences in column names, column order, or row order when no requirements are specified in the QUESTION.
- Rounding: Differences in integer/decimal rounding or precision when the QUESTION lacks specific guidelines.
- Ambiguous "Top X": The QUESTION requests "first" or "top" X entries but doesn't specify an ordering field, yielding different subsets.
- Null/NA Handling: Differences in including vs. excluding 'null' or 'NA' values when the QUESTION does not specify.
- Ambiguous Limit: The QUESTION asks for "top/highest" or "bottom/lowest" entries but doesn't specify a concrete limit, leading to different numbers of entries.
- Entity Representation: The QUESTION asks for a list of items but doesn't specify IDs or names, leading one output to return IDs and the other names.
- Extra Columns: OUTPUT #2 has a small number of extra columns that are not explicitly excluded and don't render the overall result incorrect.
- Truncation/Subsets: Truncation for display, or differing subsets of data when ordering is not specified.
- Fewer than X: Returning fewer than X entries for a "top X" request because fewer entries meet the underlying criteria.
- Relative Time/Date: Differences arising because queries were evaluated at different assumed current times/dates (e.g., "last two years").

FINAL QUESTION: Does OUTPUT #2 provide the same information as OUTPUT #1?
FINAL ANSWER: Choose ONLY ONE
- INFORMATION_MATCHES -- OUTPUT #1 and OUTPUT #2 provide the same information.
- MISSING_INFORMATION -- Something important is missing from OUTPUT #2.
- EXTRA_INFORMATION -- Some non-harmful extra relevant columns were added to OUTPUT #2.
- INCORRECT_INFORMATION -- Some incorrect information was added to OUTPUT #2, likely due to
an incorrect filter or incorrect aggregation.
- INFORMATION_MATCHES -- OUTPUT #1 and OUTPUT #2 provide the same core information (or differences fall under the acceptable relaxed criteria).
- MISSING_INFORMATION -- Something important requested by the QUESTION is missing from OUTPUT #2 (e.g. data points dropped, missing expected columns).
- EXTRA_INFORMATION -- OUTPUT #2 includes the correct answer but added non-harmful extra relevant columns, or harmless extra rows due to an ambiguous limit/sorting constraint in the QUESTION.
- INCORRECT_INFORMATION -- OUTPUT #2 contains mathematically or logically incorrect data, wrong aggregations, bad joins, missing expected rows, or violates explicit constraints in the QUESTION.
"""

logging.debug("\n --------- prompt: --------- \n %s ", prompt)
Expand All @@ -243,6 +246,8 @@ def compare(

logging.debug(
"\n --------- llm_rater_output: --------- \n %s ", response)

# Scoring Logic: Both INFORMATION_MATCHES and EXTRA_INFORMATION are rewarded as correct.
score = (
100
if ("INFORMATION_MATCHES" in response or "EXTRA_INFORMATION" in response)
Expand Down
Loading
Loading