TPC-C benchmark CPU and memory optimizations to lower hardware requirements and better logic for initial data upload

Hi,

According to your [documentation](https://docs.yugabyte.com/preview/benchmark/tpcc-ysql/#run-tpc-c), "for 10k warehouses, you would need ten clients of type c5.4xlarge to drive the benchmark. For multiple clients, you need to perform three steps". In total, it is 160 CPU cores and 320 GiB RAM, i.e.
* 1 CPU core per 62.5 warehouses
* 32.768 MiB RAM per warehouse

At [YDB](https://ydb.tech/) we followed your path and also had forked and adapted TPC-C from the Benchbase. Thus, we finished with very similar high harware requirements for TPC-C clients. Fortunately, we found very simple yet effective optimizations (and because you have same codebase, you can easily employ them too). In [this](https://blog.ydb.tech/ydb-meets-tpc-c-distributed-transactions-performance-now-revealed-42f1ed44bd73) post we discuss our TPC-C implementations and later [describe](https://blog.ydb.tech/how-we-switched-to-java-21-virtual-threads-and-got-deadlock-in-tpc-c-for-postgresql-cca2fe08d70b) some pitfalls, which again can be easily fixed. Here is a sum-up of changes:
1. Switch to Java 21 and use virtual threads instead of physical threads. Now, you have to spawn 1 thread per terminal. For example, for 10K warehouses, you will have to run 100K threads. Here is the [commit](https://github.com/ydb-platform/tpcc/commit/1e6030cdb7c64f18d2a935dd27ead646267c5eb7).
2. Don't aggregate full information about each transaction (LatencyRecord class). Just aggregate OKs/Fails count and use a histogram for execution time. Here is the [commit](https://github.com/ydb-platform/tpcc/commit/a0224d6dd475f0e01ae8c34878060e8324a5b1d4).

After these changes, you will have the following requirements:
* 1 CPU core per 1000 warehouses
* 6 MiB RAM per warehouse

Now, to run 10K warehouses you need 10 cores and ~60 GiB of RAM. It gives 2 c5.4xlarge instead of 10 (a significant change in cost) or even 1 memory optimized instance.

Another issue that affects the price of measurements is loading time. You specify ~5.5 hours for 10K warehouses and 30 cluster nodes of type c5d.4xlarge. If you have the original code there, then probably you use YSQL to upload the data. If you try YCQL instead, you can probably cut the time in half. Initially, we needed 2.7 hours to load 15K warehouses, but we were able to change TPC-C code to do it in 1.6 hours. We simply use bulk upserts, which are blind writes instead of inserts, which were by default.

We're very interested to try YugabyteDB with high number of warehouses (e.g. 40K and 80K), these optimisations will help a lot to cut the spendings.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TPC-C benchmark CPU and memory optimizations to lower hardware requirements and better logic for initial data upload #145

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

TPC-C benchmark CPU and memory optimizations to lower hardware requirements and better logic for initial data upload #145

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions