forked from facebookincubator/velox
-
Notifications
You must be signed in to change notification settings - Fork 9
Open
Labels
bugSomething isn't workingSomething isn't working
Description
Bug description
Scalar COUNT(*) (without GROUP BY) incorrectly returns 0 when querying with WHERE clauses, while COUNT(*) ... GROUP BY works correctly. This affects both CPU and GPU execution modes.
This bug significantly impacts TPC-DS benchmark results and likely affects production queries using scalar COUNT(*) with filters. The workaround is straightforward but requires query rewrites.
Environment
- Presto Version: Presto Native with Velox (testversion)
- Build Type: GPU-enabled (cuDF ON)
- Dataset: TPC-DS SF100, Parquet format, DOUBLE types
- Platform: Linux 6.8.0-54-generic
- Reproduces on: Both CPU-only and GPU-enabled modes
Minimal Reproduction
Setup
-- Table: date_dim (73,049 rows)
-- Column: d_month_seq (INTEGER)
-- Known data: 366 rows with d_month_seq BETWEEN 1200 AND 1211Broken: Scalar COUNT(*) returns 0
SELECT count(*)
FROM date_dim
WHERE d_month_seq BETWEEN 1200 AND 1211;
-- Expected: 366
-- Actual: 0 Works: COUNT(*) with GROUP BY returns correct results
SELECT d_month_seq, count(*)
FROM date_dim
WHERE d_month_seq BETWEEN 1200 AND 1211
GROUP BY d_month_seq;
-- Expected: 12 rows totaling 366
-- Actual: 12 rows totaling 366 Works: COUNT(column) returns correct results
SELECT count(d_date_sk)
FROM date_dim
WHERE d_month_seq BETWEEN 1200 AND 1211;
-- Expected: 366
-- Actual: 366 Works: SUM(1) returns correct results
SELECT sum(1)
FROM date_dim
WHERE d_month_seq BETWEEN 1200 AND 1211;
-- Expected: 366
-- Actual: 366 Verification
This proves the data exists and the issue is specific to scalar COUNT(*):
-- Grouped aggregation works
SELECT d_month_seq, count(*) cnt
FROM date_dim
WHERE d_month_seq BETWEEN 1195 AND 1215
GROUP BY d_month_seq
ORDER BY d_month_seq;
-- Returns correct counts for each group
-- But scalar aggregation fails
SELECT count(*) FROM date_dim WHERE d_month_seq BETWEEN 1200 AND 1211;
-- Returns 0 Real-World Impact
TPC-DS SF100 Benchmark: 4 queries affected, all using scalar COUNT(*):
Q38: COUNT(*) from INTERSECT subquery
SELECT count(*) FROM (
SELECT DISTINCT c_last_name, c_first_name, d_date
FROM store_sales JOIN date_dim ON ss_sold_date_sk = d_date_sk
JOIN customer ON ss_customer_sk = c_customer_sk
WHERE d_month_seq BETWEEN 1200 AND 1211
INTERSECT ...
) subquery;
-- Expected: 25,818, Actual: 0Q87: COUNT(*) from EXCEPT subquery
SELECT count(*) FROM (
SELECT DISTINCT c_last_name, c_first_name, d_date
FROM store_sales JOIN date_dim ON ss_sold_date_sk = d_date_sk
JOIN customer ON ss_customer_sk = c_customer_sk
WHERE d_month_seq BETWEEN 1200 AND 1211
EXCEPT ...
) subquery;
-- Expected: 4,250,952, Actual: 0Q88: Multiple scalar COUNT(*) aggregations
SELECT
count(*) h8_30_to_9, -- Returns 0, expected 188,601
count(*) h9_to_9_30, -- Returns 0, expected 380,247
-- ... 6 more scalar COUNT(*) aggregations
FROM store_sales
JOIN time_dim ON ss_sold_time_sk = t_time_sk
WHERE t_hour = 8 AND t_minute >= 30;
-- All return 0 instead of correct valuesQ96: Simple scalar COUNT(*)
SELECT count(*)
FROM store_sales
JOIN time_dim ON ss_sold_time_sk = t_time_sk
JOIN store ON ss_store_sk = s_store_sk
WHERE t_hour = 20 AND t_minute < 30
AND s_store_name = 'ese';
-- Expected: 70,344, Actual: 0Pattern Analysis
TPC-DS Evidence (20 queries use COUNT(*)):
- 5 queries PASS (Q10, Q23, Q41, Q64, Q69) - All use
COUNT(*) ... GROUP BY - 4 queries return 0 (Q38, Q87, Q88, Q96) - All use scalar
COUNT(*) - 10 queries fail for unrelated reasons (cuDF limitations, result mismatches)
- 1 query not tested (disabled)
System information
Basic Information:
OS: Linux
Kernel: 6.8.0-54-generic
Architecture: aarch64
Testing Branch: https://github.com/rapidsai/velox-testing/tree/pwilson/tpcds-test-improvements
Relevant logs
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working