GoogleCloudPlatform · totoleon · Apr 3, 2026 · Apr 3, 2026 · Apr 3, 2026 · Apr 3, 2026
@@ -1,5 +1,6 @@
 from google import genai
 from google.genai.types import GenerateContentResponse
+from google.genai.errors import ServerError
 from util.rate_limit import ResourceExhaustedError
 from util.gcp import get_gcp_project, get_gcp_region
 from google.api_core.exceptions import ResourceExhausted
@@ -36,7 +37,7 @@ def generate_internal(self, prompt):
             if isinstance(response, GenerateContentResponse):
                 r = sanitize_sql(response.text)
             return r
-        except ResourceExhausted as e:
+        except (ResourceExhausted, ServerError) as e:
             raise ResourceExhaustedError(e)
         except Exception as e:
             logger.exception("Unhandled exception during generate_content")

@@ -2,15 +2,15 @@
 LLMRater
 In this comparison strategy, an LLM compares the golden execution results with the generated sql execution results.
 It returns a score between 0 and 100, with a score of 100 for concrete positive cases,
-where either there is a mismatch of columns names or extra relevant columns in Generated SQL exists.
+where either there is a mismatch of columns names, extra relevant columns, or harmless unrequested sorting/limits in Generated SQL.
 
 Evaluation rules given to LLM:
     1. Assume OUTPUT #1 is the gold standard and is ALWAYS correct.
     2. The order of columns in OUTPUT #2 does not matter.
-    3. Allow slight variations due to differences in rounding or precision, for calculated values.
-        Otherwise ensure exact matches for numbers, dates, timestamps, measurements, and metrics
-        between the two outputs.
-    4. The mapped column names might differ, do not make any assumptions based on them.
+    3. The order of rows in OUTPUT #2 does not matter UNLESS explicitly requested in the prompt.
+    4. Allow slight variations due to differences in rounding or precision, for calculated values.
+    5. Allow acceptable divergences based on relaxed criteria (ambiguous counts, null handling, relative dates, IDs vs Names, ambiguous limits, etc.).
+    6. The mapped column names might differ, do not make any assumptions based on them.
 
 Run Configuration Options:
     1. model_config: Required
@@ -45,7 +45,7 @@
 1. [EntityError] - Wrong table or entity was used.
 2. [ValueLinkingError] - Wrong literal value (string/number) was used.
 3. [ColumnLinkingError] - Wrong column was selected or used in a condition.
-4. [OrderingError] - Sorting order (ASC/DESC) or column is incorrect.
+4. [OrderingError] - Sorting order (ASC/DESC) or column is incorrect (only flag if prompt explicitly requested sorting).
 5. [InstructionError] - Failed to follow specific constraints in the prompt (e.g., "return top 5").
 6. [IntentError] - Misinterpreted the user's fundamental request.
 7. [DataTypesError] - Incorrect handling of data types (e.g., casting, dates).
@@ -187,46 +187,49 @@ def compare(
 
         The correct answer to this question is:
 
-        OUTPUT #1:
+        OUTPUT #1 (Gold Standard):
 
         {golden_execution_result}
 
 
-        We get the following answer from a second query.
+        We get the following answer from a generated query:
 
-        OUTPUT #2
+        OUTPUT #2 (Generated Result):
 
         {generated_execution_result}
 
 
-        Thinking step by step, compare the two outputs and look for differences in data presentation.
+        Thinking step by step, compare the two outputs and look for differences in data and presentation.
         Here are steps to follow:
 
-        1. Ensure that every column in OUTPUT #1 has a corresponding column in OUTPUT #2 that represents
-           the same information, even if the column names are different.
-        2. All columns present in OUTPUT #1 must also be present in OUTPUT #2. OUTPUT #2 is allowed to
-           have additional columns relevant to the query.
-        3. Compare the data within each mapped column pair between OUTPUT #1 and OUTPUT #2.
-           Ensure that OUTPUT #2 contains all the data from OUTPUT #1, with no missing or extra rows.
-        4. Minor string transformations are allowed (e.g., concatenating first and last name), but the
-           underlying information must be preserved.
-
-        RULES - These MUST be strictly followed, to answer the FINAL QUESTION:
-
-        1. Assume OUTPUT #1 is the gold standard and is ALWAYS correct.
-        2. The order of columns in OUTPUT #2 does not matter.
-        3. Allow slight variations due to differences in rounding or precision, for calculated values.
-           Otherwise ensure exact matches for numbers, dates, timestamps, measurements, and metrics
-           between the two outputs.
-        4. The mapped column names might differ, do not make any assumptions based on them.
+        1. Analyze the QUESTION: Does it explicitly ask for a specific sorting order (e.g., "ordered by date", "top 5")? Does it explicitly ask for a limit?
+        2. Column Mapping: Ensure that every column in OUTPUT #1 has a corresponding column in OUTPUT #2 that represents the same information. OUTPUT #2 is allowed to have additional descriptive columns.
+        3. Data Comparison: Compare the data within each mapped column pair. 
+        4. Row Order: Ignore differences in row order UNLESS the QUESTION explicitly requested a specific sorting. Treat the data as unordered sets if no order is specified.
+        5. Extra Rows: If OUTPUT #2 has extra rows but contains all of OUTPUT #1, evaluate if the extra rows violate the prompt's constraints. If the prompt was ambiguous about limits (e.g. "Identify the MSA with the highest growth" and the model returns a ranked list instead of a single row), treating it as EXTRA_INFORMATION is acceptable and correct.
+
+        RULES & RELAXED EVALUATION CRITERIA - These MUST be strictly followed:
+
+        1. Assume OUTPUT #1 is the gold standard and its core data values are ALWAYS mathematically/logically correct.
+        2. The mapped column names might differ, do not make any assumptions based on them.
+        3. Do NOT penalize OUTPUT #2 if it differs from OUTPUT #1 for ANY of the following reasons:
+            - Column/Row Order: Differences in column names, column order, or row order when no requirements are specified in the QUESTION.
+            - Rounding: Differences in integer/decimal rounding or precision when the QUESTION lacks specific guidelines.
+            - Ambiguous "Top X": The QUESTION requests "first" or "top" X entries but doesn't specify an ordering field, yielding different subsets.
+            - Null/NA Handling: Differences in including vs. excluding 'null' or 'NA' values when the QUESTION does not specify.
+            - Ambiguous Limit: The QUESTION asks for "top/highest" or "bottom/lowest" entries but doesn't specify a concrete limit, leading to different numbers of entries.
+            - Entity Representation: The QUESTION asks for a list of items but doesn't specify IDs or names, leading one output to return IDs and the other names.
+            - Extra Columns: OUTPUT #2 has a small number of extra columns that are not explicitly excluded and don't render the overall result incorrect.
+            - Truncation/Subsets: Truncation for display, or differing subsets of data when ordering is not specified.
+            - Fewer than X: Returning fewer than X entries for a "top X" request because fewer entries meet the underlying criteria.
+            - Relative Time/Date: Differences arising because queries were evaluated at different assumed current times/dates (e.g., "last two years").
 
         FINAL QUESTION: Does OUTPUT #2 provide the same information as OUTPUT #1?
         FINAL ANSWER: Choose ONLY ONE
-        - INFORMATION_MATCHES -- OUTPUT #1 and OUTPUT #2 provide the same information.
-        - MISSING_INFORMATION -- Something important is missing from OUTPUT #2.
-        - EXTRA_INFORMATION -- Some non-harmful extra relevant columns were added to OUTPUT #2.
-        - INCORRECT_INFORMATION -- Some incorrect information was added to OUTPUT #2, likely due to
-          an incorrect filter or incorrect aggregation.
+        - INFORMATION_MATCHES -- OUTPUT #1 and OUTPUT #2 provide the same core information (or differences fall under the acceptable relaxed criteria).
+        - MISSING_INFORMATION -- Something important requested by the QUESTION is missing from OUTPUT #2 (e.g. data points dropped, missing expected columns).
+        - EXTRA_INFORMATION -- OUTPUT #2 includes the correct answer but added non-harmful extra relevant columns, or harmless extra rows due to an ambiguous limit/sorting constraint in the QUESTION.
+        - INCORRECT_INFORMATION -- OUTPUT #2 contains mathematically or logically incorrect data, wrong aggregations, bad joins, missing expected rows, or violates explicit constraints in the QUESTION.
         """
 
         logging.debug("\n --------- prompt:   --------- \n %s ", prompt)
@@ -243,6 +246,8 @@ def compare(
 
         logging.debug(
             "\n --------- llm_rater_output:   --------- \n %s ", response)
+
+        # Scoring Logic: Both INFORMATION_MATCHES and EXTRA_INFORMATION are rewarded as correct.
         score = (
             100
             if ("INFORMATION_MATCHES" in response or "EXTRA_INFORMATION" in response)