"Metimur" (Latin for "We Measure") is a Databricks SQL Benchmark Accelerator that streamlines data generation and query benchmarking in Databricks SQL warehouses.
- Use Case 1: Benchmark existing data
- Use Case 2: Generate Synthetic Data based on table schemas
- Use Case 3: Generate TPC Data
- Benchmark Output: Metrics Dashboard
- Limitations
- π Streamline data generation and query benchmarking
- π Evaluate query performance across different Databricks SQL warehouse types
- π Generate synthetic data based on table schemas
- π Built-in support for TPC-DS and TPC-H benchmarks
- π Automated AI/BI Metrics Dashboard for performance analysis
- Databricks workspace with Serverless and Unity Catalog enabled
- Python 3.9+
- Databricks Runtime 14.3+
Add the repo to your Databricks Workspace using Databricks Repos
Prerequisite: You must have existing data available in your Databricks workspace to use this feature.
Compare query performance across different Databricks SQL warehouse types and sizes with varying concurrency levels. This use case is ideal when you want to benchmark your actual production or test data.
-
Single Warehouse Benchmark: View average query duration for one warehouse.
-
Multiple Warehouses Benchmark: Automatically creates serverless, pro, and classic warehouses (same size, names prefixed by warehouse_prefix).
-
Multiple Warehouses, Varying Sizes: Select multiple sizes for one warehouse type (set by warehouse_type), with names prefixed by warehouse_prefix.
-
Verify Data Availability
- Ensure your data is already loaded into your Databricks workspace
- Data should be accessible through Unity Catalog
- If you don't have data, consider using Use Case 2 (Generate Synthetic Data) or Use Case 3 (TPC Benchmark Data)
-
Open the quickstarts notebook in your Databricks workspace
-
Connect to a Single User Cluster (DBR14.3 LTS or above)
-
Upload your queries to the
queriesdirectory -
Follow the notebook instructions
-
Basic Query Format
- Each query must be identified using
--markers - Each query must end with a semicolon
; - Example:
--Q01-- SELECT * FROM lineitem;
- Each query must be identified using
-
Multiple Queries in One File
- You can include multiple queries in a single file
- Each query must follow the format above
- Example:
--Q01-- SELECT * FROM lineitem; --Q02-- SELECT * FROM orders;
-
Parameterized Queries
- Use
:param_namesyntax for parameters in your queries - Create a
params.jsonfile in the same directory as your queries - Format your
params.jsonas follows:
{ "Q01": [ { "l_shipdate": "1998-12-01" }, { "l_shipdate": "1998-11-01" } ], "Q02": [ { "p_size": 15, "p_type": "%BRASS", "r_name": "EUROPE" } ] } - Use
-
Example Files
- See
queries/tpchorqueries/tpcdsfolders for complete examples - Each folder contains:
- SQL query files
params.json(if using parameterized queries)
- See
Note: The query duration is fetched from Query History API and should be consistent with query duration on Databricks monitoring UI
Generate synthetic data based on table schemas and benchmark query performance. Note: This feature is currently under development and may not work as expected.
-
Open advanced notebook on Databricks workspace.
-
Connect the notebook to Single User Cluster with DBR14.3+ LTS
-
Run each cell in "Set Up" section
-
Choose
BYODin the drop down benchmarks widget -
Upload your user-defined schema file for each table to the schemas folder. Follow the example in schemas/tpch.
Table Schema Example
{ "table_name": "customer", "rows": 750000, "fields": [ { "colName": "c_custkey", "colType": "bigint", "uniqueValues": null, "values": [], "minValue": null, "maxValue": null }, { "colName": "c_mktsegment", "colType": "string", "uniqueValues": 5, "values": [ "MACHINERY", "AUTOMOBILE", "BUILDING", "HOUSEHOLD", "FURNITURE" ], "minValue": null, "maxValue": null } ] } -
Upload the queries to a separate folder under queries directory, and provide the path in Query Path widget. Follow instruction in Use Case 1: Benchmark existing data
- An automated Workflow job is created with 2 tasks: Generate_Data and Run_Benchmarking
-
In generate_data task, data are generated in
serverless_benchmarkcatalog and user-specified schema name -
In the run_benchmarking task, benchmark queries are executed on the generated data
You want to test Databricks SQL Warehouses performance at different scale factors of TPC Industry benchmark Data.
- Open advanced notebook on Databricks workspace.
- Connect the notebook to Single User Cluster with DBR14.3+ LTS
- Run each cell in "Set Up" section
- Choose
TPCHorTPCDSin the drop down Benchmarks widget and the Scale Factors - Set the Query Path widget to
queries/tpchorqueries/tpcds
-
An automated Workflow job is created with 2 tasks: Generate_Data and Run_Benchmarking
-
In generate_data task, TPC data are generated in
serverless_benchmarkcatalog andtpch_{scale_factor}schema -
In the run_benchmarking task, benchmark queries are executed on the generated data similar to Use Case 2
Note: To avoid re-generate these industry benchmarks data, after data is generated, all users in the workspace will be able to query the tables and run benchmark queries on them. If the schemas and tables already exist, the Generate Data task will be skipped
All benchmark results are stored in a Delta table (serverless_benchmark.default._metimur_metrics_{user_name}) and visualized in an automated Databricks AI/BI Metrics Dashboard.
- Option to filter the benchmark run by benchmark timestamp, warehouse configuration (warehouse size, warehouse type), or query ID
- Analyze query duration based on query history table
- Analyze query cost (DBUs, $DBUs) using UC system tables
-
Metimur exclusively supports Unity Catalog managed tables. A default catalog named
serverless_benchmarkis automatically provisioned to store all TPC and synthetic data. -
Workspace users receive the following Unity Catalog permissions:
USE CATALOG,CREATE SCHEMA,USE SCHEMA, andCREATE TABLEon theserverless_benchmarkcatalogSELECTpermission on all TPC benchmark tables For more details, see Unity Catalog Privileges documentation.
-
TPC benchmark data is generated only if it doesn't already exist in the catalog.
-
For synthetic data generation, table permissions are restricted to the creating user only.
-
Query Compatibility:
- Validate queries on Databricks before benchmarking
- No automatic query conversion from other EDW systems is provided
- Ensure queries follow Databricks SQL syntax
-
Performance Optimization:
- Generated Delta tables require manual optimization
- Recommended techniques:
- Liquid Clustering - RECOMMENDED
- Z-Ordering and Partitioning
- Default Unity Catalog
serverless_benchmarkis created automatically - All workspace users get basic permissions on the catalog
- Synthetic data tables are accessible only to the creator
- TPC data tables are accessible to all workspace users
Contributions are welcome! Please feel free to submit a Pull Request.
This project is licensed under the MIT License - see the LICENSE file for details.


