Add minimal IPL data supplement to distinguish two semantically different SQL queries by tohpinren · Pull Request #181 · xlang-ai/Spider2

tohpinren · 2026-03-12T08:16:45Z

Hi Spider2 maintainers,

While analyzing the Spider2-Lite (SQLite) IPL dataset, I found a case where the current database contents cannot distinguish two semantically different SQL queries under execution.

Since Spider2 relies on execution-based evaluation, this means the dataset currently treats both queries as equivalent even though they implement different logic.

The original natural-language question is:

"Retrieve the names of players who scored no less than 100 runs in a match while playing for the team that lost that match."

On the current IPL sample, both queries return the same result (an empty set). The root cause is that the IPL sample rows do not contain any overlapping keys between ball_by_ball and batsman_scored on (match_id, over_id, ball_id, innings_no).

Because of this, the join between these two tables produces zero rows, which causes the first CTE to be empty and the rest of the query pipeline to also be empty. As a result, both queries return no rows regardless of their final join predicate, masking their semantic difference.

This PR introduces a minimal counterexample so that this semantic difference becomes observable under execution.

Query description

The query identifies players who scored at least 100 runs in a match and were on the losing team, and returns their names.

It works in four steps:

Aggregate runs scored by each player in each match.
Identify the losing team for each match.
Select players who both scored ≥100 runs and belong to the losing
team in that match.
Return the corresponding player names.

Query 1

WITH player_runs AS (
    SELECT 
        bbb.striker AS player_id, 
        bbb.match_id, 
        SUM(bsc.runs_scored) AS total_runs 
    FROM ball_by_ball AS bbb
    JOIN batsman_scored AS bsc
      ON bbb.match_id = bsc.match_id 
     AND bbb.over_id = bsc.over_id 
     AND bbb.ball_id = bsc.ball_id 
     AND bbb.innings_no = bsc.innings_no
    GROUP BY bbb.striker, bbb.match_id
    HAVING SUM(bsc.runs_scored) >= 100
),
losing_teams AS (
    SELECT 
        match_id,
        CASE 
            WHEN match_winner = team_1 THEN team_2 
            ELSE team_1 
        END AS loser
    FROM match
),
players_in_losing_teams AS (
    SELECT 
        pr.player_id,
        pr.match_id
    FROM player_runs AS pr
    JOIN losing_teams AS lt
      ON pr.match_id = lt.match_id
    JOIN player_match AS pm
      ON pr.player_id = pm.player_id
     AND pr.match_id = pm.match_id
     AND lt.loser = pm.team_id
)
SELECT DISTINCT
    p.player_name
FROM player AS p
JOIN players_in_losing_teams AS plt
  ON p.player_id = plt.player_id
ORDER BY p.player_name;

Query 2

This query is identical except for the final join condition:

WITH player_runs AS (
    SELECT 
        bbb.striker AS player_id, 
        bbb.match_id, 
        SUM(bsc.runs_scored) AS total_runs 
    FROM ball_by_ball AS bbb
    JOIN batsman_scored AS bsc
      ON bbb.match_id = bsc.match_id 
     AND bbb.over_id = bsc.over_id 
     AND bbb.ball_id = bsc.ball_id 
     AND bbb.innings_no = bsc.innings_no
    GROUP BY bbb.striker, bbb.match_id
    HAVING SUM(bsc.runs_scored) >= 100
),
losing_teams AS (
    SELECT 
        match_id,
        CASE 
            WHEN match_winner = team_1 THEN team_2 
            ELSE team_1 
        END AS loser
    FROM match
),
players_in_losing_teams AS (
    SELECT 
        pr.player_id,
        pr.match_id
    FROM player_runs AS pr
    JOIN losing_teams AS lt
      ON pr.match_id = lt.match_id
    JOIN player_match AS pm
      ON pr.player_id = pm.player_id
     AND pr.match_id = pm.match_id
     AND lt.loser = pm.team_id
)
SELECT DISTINCT
    p.player_name
FROM player AS p
JOIN players_in_losing_teams AS plt
  ON p.player_id <> plt.player_id
ORDER BY p.player_name;

Why the queries are different

The queries differ only in the final join predicate:

Query 1: p.player_id = plt.player_id
Query 2: p.player_id <> plt.player_id

The first joins players to the qualifying records on matching IDs.
The second joins players on non-matching IDs, which produces a different result set.

Change introduced in this PR

This PR adds a minimal counterexample consisting of 8 rows across 8 IPL tables.

With these rows added:

Query 1 remains empty
Query 2 becomes non-empty due to the inequality join

This makes the semantic difference between the two queries observable under execution.

tohpinren · 2026-03-19T08:13:15Z

Hi @lfy79001, following up on this PR. Would appreciate any feedback when convenient. Happy to clarify or make changes if needed.

Add IPL rows to distinguish = and <> queries

2636255

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add minimal IPL data supplement to distinguish two semantically different SQL queries#181

Add minimal IPL data supplement to distinguish two semantically different SQL queries#181
tohpinren wants to merge 1 commit intoxlang-ai:mainfrom
tohpinren:ipl-data

tohpinren commented Mar 12, 2026

Uh oh!

tohpinren commented Mar 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

tohpinren commented Mar 12, 2026

Query description

Query 1

Query 2

Why the queries are different

Change introduced in this PR

Uh oh!

tohpinren commented Mar 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant