Skip to content

Add minimal IPL data supplement to distinguish two semantically different SQL queries#181

Open
tohpinren wants to merge 1 commit intoxlang-ai:mainfrom
tohpinren:ipl-data
Open

Add minimal IPL data supplement to distinguish two semantically different SQL queries#181
tohpinren wants to merge 1 commit intoxlang-ai:mainfrom
tohpinren:ipl-data

Conversation

@tohpinren
Copy link
Copy Markdown

Hi Spider2 maintainers,

While analyzing the Spider2-Lite (SQLite) IPL dataset, I found a case where the current database contents cannot distinguish two semantically different SQL queries under execution.

Since Spider2 relies on execution-based evaluation, this means the dataset currently treats both queries as equivalent even though they implement different logic.

The original natural-language question is:

"Retrieve the names of players who scored no less than 100 runs in a match while playing for the team that lost that match."

On the current IPL sample, both queries return the same result (an empty set). The root cause is that the IPL sample rows do not contain any overlapping keys between ball_by_ball and batsman_scored on (match_id, over_id, ball_id, innings_no).

Because of this, the join between these two tables produces zero rows, which causes the first CTE to be empty and the rest of the query pipeline to also be empty. As a result, both queries return no rows regardless of their final join predicate, masking their semantic difference.

This PR introduces a minimal counterexample so that this semantic difference becomes observable under execution.

Query description

The query identifies players who scored at least 100 runs in a match and were on the losing team, and returns their names.

It works in four steps:

  1. Aggregate runs scored by each player in each match.
  2. Identify the losing team for each match.
  3. Select players who both scored ≥100 runs and belong to the losing
    team in that match.
  4. Return the corresponding player names.

Query 1

WITH player_runs AS (
    SELECT 
        bbb.striker AS player_id, 
        bbb.match_id, 
        SUM(bsc.runs_scored) AS total_runs 
    FROM ball_by_ball AS bbb
    JOIN batsman_scored AS bsc
      ON bbb.match_id = bsc.match_id 
     AND bbb.over_id = bsc.over_id 
     AND bbb.ball_id = bsc.ball_id 
     AND bbb.innings_no = bsc.innings_no
    GROUP BY bbb.striker, bbb.match_id
    HAVING SUM(bsc.runs_scored) >= 100
),
losing_teams AS (
    SELECT 
        match_id,
        CASE 
            WHEN match_winner = team_1 THEN team_2 
            ELSE team_1 
        END AS loser
    FROM match
),
players_in_losing_teams AS (
    SELECT 
        pr.player_id,
        pr.match_id
    FROM player_runs AS pr
    JOIN losing_teams AS lt
      ON pr.match_id = lt.match_id
    JOIN player_match AS pm
      ON pr.player_id = pm.player_id
     AND pr.match_id = pm.match_id
     AND lt.loser = pm.team_id
)
SELECT DISTINCT
    p.player_name
FROM player AS p
JOIN players_in_losing_teams AS plt
  ON p.player_id = plt.player_id
ORDER BY p.player_name;

Query 2

This query is identical except for the final join condition:

WITH player_runs AS (
    SELECT 
        bbb.striker AS player_id, 
        bbb.match_id, 
        SUM(bsc.runs_scored) AS total_runs 
    FROM ball_by_ball AS bbb
    JOIN batsman_scored AS bsc
      ON bbb.match_id = bsc.match_id 
     AND bbb.over_id = bsc.over_id 
     AND bbb.ball_id = bsc.ball_id 
     AND bbb.innings_no = bsc.innings_no
    GROUP BY bbb.striker, bbb.match_id
    HAVING SUM(bsc.runs_scored) >= 100
),
losing_teams AS (
    SELECT 
        match_id,
        CASE 
            WHEN match_winner = team_1 THEN team_2 
            ELSE team_1 
        END AS loser
    FROM match
),
players_in_losing_teams AS (
    SELECT 
        pr.player_id,
        pr.match_id
    FROM player_runs AS pr
    JOIN losing_teams AS lt
      ON pr.match_id = lt.match_id
    JOIN player_match AS pm
      ON pr.player_id = pm.player_id
     AND pr.match_id = pm.match_id
     AND lt.loser = pm.team_id
)
SELECT DISTINCT
    p.player_name
FROM player AS p
JOIN players_in_losing_teams AS plt
  ON p.player_id <> plt.player_id
ORDER BY p.player_name;

Why the queries are different

The queries differ only in the final join predicate:

  • Query 1: p.player_id = plt.player_id
  • Query 2: p.player_id <> plt.player_id

The first joins players to the qualifying records on matching IDs.
The second joins players on non-matching IDs, which produces a different result set.

Change introduced in this PR

This PR adds a minimal counterexample consisting of 8 rows across 8 IPL tables.

With these rows added:

  • Query 1 remains empty
  • Query 2 becomes non-empty due to the inequality join

This makes the semantic difference between the two queries observable under execution.

@tohpinren
Copy link
Copy Markdown
Author

Hi @lfy79001, following up on this PR. Would appreciate any feedback when convenient. Happy to clarify or make changes if needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant