Skip to content

Scalar COUNT(*) Returns 0 with WHERE Clauses #102

@patdevinwilson

Description

@patdevinwilson

Bug description

Scalar COUNT(*) (without GROUP BY) incorrectly returns 0 when querying with WHERE clauses, while COUNT(*) ... GROUP BY works correctly. This affects both CPU and GPU execution modes.

This bug significantly impacts TPC-DS benchmark results and likely affects production queries using scalar COUNT(*) with filters. The workaround is straightforward but requires query rewrites.

Environment

  • Presto Version: Presto Native with Velox (testversion)
  • Build Type: GPU-enabled (cuDF ON)
  • Dataset: TPC-DS SF100, Parquet format, DOUBLE types
  • Platform: Linux 6.8.0-54-generic
  • Reproduces on: Both CPU-only and GPU-enabled modes

Minimal Reproduction

Setup

-- Table: date_dim (73,049 rows)
-- Column: d_month_seq (INTEGER)
-- Known data: 366 rows with d_month_seq BETWEEN 1200 AND 1211

Broken: Scalar COUNT(*) returns 0

SELECT count(*) 
FROM date_dim 
WHERE d_month_seq BETWEEN 1200 AND 1211;
-- Expected: 366
-- Actual: 0 

Works: COUNT(*) with GROUP BY returns correct results

SELECT d_month_seq, count(*) 
FROM date_dim 
WHERE d_month_seq BETWEEN 1200 AND 1211 
GROUP BY d_month_seq;
-- Expected: 12 rows totaling 366
-- Actual: 12 rows totaling 366 

Works: COUNT(column) returns correct results

SELECT count(d_date_sk) 
FROM date_dim 
WHERE d_month_seq BETWEEN 1200 AND 1211;
-- Expected: 366
-- Actual: 366 

Works: SUM(1) returns correct results

SELECT sum(1) 
FROM date_dim 
WHERE d_month_seq BETWEEN 1200 AND 1211;
-- Expected: 366
-- Actual: 366 

Verification

This proves the data exists and the issue is specific to scalar COUNT(*):

-- Grouped aggregation works
SELECT d_month_seq, count(*) cnt
FROM date_dim 
WHERE d_month_seq BETWEEN 1195 AND 1215 
GROUP BY d_month_seq 
ORDER BY d_month_seq;
-- Returns correct counts for each group 

-- But scalar aggregation fails
SELECT count(*) FROM date_dim WHERE d_month_seq BETWEEN 1200 AND 1211;
-- Returns 0 

Real-World Impact

TPC-DS SF100 Benchmark: 4 queries affected, all using scalar COUNT(*):

Q38: COUNT(*) from INTERSECT subquery

SELECT count(*) FROM (
    SELECT DISTINCT c_last_name, c_first_name, d_date 
    FROM store_sales JOIN date_dim ON ss_sold_date_sk = d_date_sk
    JOIN customer ON ss_customer_sk = c_customer_sk
    WHERE d_month_seq BETWEEN 1200 AND 1211
    INTERSECT ...
) subquery;
-- Expected: 25,818, Actual: 0

Q87: COUNT(*) from EXCEPT subquery

SELECT count(*) FROM (
    SELECT DISTINCT c_last_name, c_first_name, d_date 
    FROM store_sales JOIN date_dim ON ss_sold_date_sk = d_date_sk
    JOIN customer ON ss_customer_sk = c_customer_sk
    WHERE d_month_seq BETWEEN 1200 AND 1211
    EXCEPT ...
) subquery;
-- Expected: 4,250,952, Actual: 0

Q88: Multiple scalar COUNT(*) aggregations

SELECT 
    count(*) h8_30_to_9,   -- Returns 0, expected 188,601
    count(*) h9_to_9_30,   -- Returns 0, expected 380,247
    -- ... 6 more scalar COUNT(*) aggregations
FROM store_sales 
JOIN time_dim ON ss_sold_time_sk = t_time_sk
WHERE t_hour = 8 AND t_minute >= 30;
-- All return 0 instead of correct values

Q96: Simple scalar COUNT(*)

SELECT count(*) 
FROM store_sales 
JOIN time_dim ON ss_sold_time_sk = t_time_sk
JOIN store ON ss_store_sk = s_store_sk
WHERE t_hour = 20 AND t_minute < 30 
AND s_store_name = 'ese';
-- Expected: 70,344, Actual: 0

Pattern Analysis

TPC-DS Evidence (20 queries use COUNT(*)):

  • 5 queries PASS (Q10, Q23, Q41, Q64, Q69) - All use COUNT(*) ... GROUP BY
  • 4 queries return 0 (Q38, Q87, Q88, Q96) - All use scalar COUNT(*)
  • 10 queries fail for unrelated reasons (cuDF limitations, result mismatches)
  • 1 query not tested (disabled)

System information

Basic Information:
OS: Linux
Kernel: 6.8.0-54-generic
Architecture: aarch64
Testing Branch: https://github.com/rapidsai/velox-testing/tree/pwilson/tpcds-test-improvements

Relevant logs

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions