Skip to content

Add option to select custom queries for tpch and tpcds benchmarks#247

Open
devavret wants to merge 4 commits intomainfrom
devavret/custom-queries
Open

Add option to select custom queries for tpch and tpcds benchmarks#247
devavret wants to merge 4 commits intomainfrom
devavret/custom-queries

Conversation

@devavret
Copy link
Contributor

This PR adds an option to provide a path to a custom json file containing TPC-H and TPC-DS queries.
Also adds queries_duckdb.json which is a file containing queries converted to SQL from duckdb plans.
Adds queries.best.json which is a clone of queries.json with SQL for Q17 and Q21 replaced with duckdb plan generated queries. This is because Q17 and Q21 showed notable improvement in performance while other queries had the same performance with standard SQL vs duckDB converted SQL.

also adds queries converted from duckdb plans
"Q20": "SELECT s_name, s_address FROM supplier, nation WHERE s_suppkey IN ( SELECT ps_suppkey FROM partsupp WHERE ps_partkey IN ( SELECT p_partkey FROM part WHERE p_name LIKE 'forest%') AND ps_availqty > ( SELECT 0.5 * sum(l_quantity) FROM lineitem WHERE l_partkey = ps_partkey AND l_suppkey = ps_suppkey AND l_shipdate >= CAST('1994-01-01' AS date) AND l_shipdate < CAST('1995-01-01' AS date))) AND s_nationkey = n_nationkey AND n_name = 'CANADA' ORDER BY s_name",
"Q21": "WITH multi_line_orders AS (SELECT l_orderkey FROM lineitem GROUP BY l_orderkey HAVING count(*) > 1), late_lines AS (SELECT l.l_orderkey, l.l_suppkey FROM multi_line_orders m JOIN lineitem l ON m.l_orderkey = l.l_orderkey WHERE l.l_receiptdate > l.l_commitdate), single_late_orders AS (SELECT l_orderkey FROM late_lines GROUP BY l_orderkey HAVING count(*) = 1) SELECT s_name, count(*) AS numwait FROM single_late_orders slo JOIN late_lines ll ON slo.l_orderkey = ll.l_orderkey JOIN supplier ON ll.l_suppkey = s_suppkey JOIN nation ON s_nationkey = n_nationkey AND n_name = 'SAUDI ARABIA' JOIN orders ON slo.l_orderkey = o_orderkey AND o_orderstatus = 'F' GROUP BY s_name ORDER BY numwait DESC, s_name LIMIT 100",
"Q22": "SELECT cntrycode, count(*) AS numcust, sum(c_acctbal) AS totacctbal FROM ( SELECT substring(c_phone FROM 1 FOR 2) AS cntrycode, c_acctbal FROM customer WHERE substring(c_phone FROM 1 FOR 2) IN ('13', '31', '23', '29', '30', '18', '17') AND c_acctbal > ( SELECT avg(c_acctbal) FROM customer WHERE c_acctbal > 0.00 AND substring(c_phone FROM 1 FOR 2) IN ('13', '31', '23', '29', '30', '18', '17')) AND NOT EXISTS ( SELECT * FROM orders WHERE o_custkey = c_custkey)) AS custsale GROUP BY cntrycode ORDER BY cntrycode"
} No newline at end of file
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

new line needed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe adding multiple query files in the codebase could be a source of confusion, as it is no longer clear what the authoritative source of queries is. Also, can the benchmarks still be considered standard if we are running queries that are different from those in the TPC-H/TPC-DS specification? Ideally, query rewrites should be done by the query engine we are publishing benchmarks for and not as a preprocessing step.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original 22 queries are part of the official TPC-H specification.
May be consider renaming the query file name. (queries_rapidsmpf.json or better).
We are testing new SQL queries for TPCH that give us insights to better plans. So this SQL should be saved.

if queries_file:
path = queries_file if os.path.isabs(queries_file) else get_abs_file_path(queries_file)
else:
path = get_abs_file_path(f"./queries/{benchmark_type}/queries_best.json")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@paul-aiyedun How about this: we can use the official queries by default here but still package the query_best.json in velox testing. The primary aim of this PR is allow us to bring our own queries.json.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think it makes sense to keep the original queries as is. We can create an experimental directory in the queries directory that contains non-standard query sets we want to try. I would also suggest putting each query set in its own directory with a README that describes how the queries were derived.

Copy link
Contributor Author

@devavret devavret Feb 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1c91baa
edit: i realize jsons cannot have comments

@karthikeyann
Copy link
Contributor

@paul-aiyedun please resolve the conflict and merge this PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

4 participants