[Discussion] What’s the appropriate KVIKIO_NTHREADS setting for GDS?

Hi KVIKIO team,

I’d like to share a performance debugging story from this afternoon that I think is somewhat unique, and hopefully useful to others.

Let me start by describing the setup: 
- hardware: a DGX A100 system with 4 PCIe4.0 SSDs, each capable of ~6.4 GB/s bandwidth, measured by GDSIO. 
  - Previously, these SSDs were configured as a single RAID-0 volume. 
  - This week, we split the RAID and mounted each SSD as an independent filesystem.
- my workload involves multistream, multithreaded Parquet reading, which is **a strong assumption**: I have implemented a multistream, multithreaded Parquet reader based on libcudf. In my design, each Parquet read call retrieves only a small chunk (such as a row group like 32MB), and multiple worker threads concurrently issue these Parquet read calls. This approach differs from traditional readers, which may process larger chunks.

My Parquet reader setup is a bit different—I operate at the row group level, which means I typically launch around 60 separate cudf `read_parquet` calls per file.

## With RAID-0

Life was simple with RAID-0. I could easily saturate the SSD bandwidth:
- GDSIO: Using 4 threads and large read chunks 
- In my parquet reader: with KVIKIO_NTHREADS=16 and 8 worker threads issuing cudf read_parquet calls

Performance was excellent—reading Parquet at ~23 GB/s into one A100. Decoding and decompression were pipelined, and the overhead from 16 KVIKIO threads plus 8 cudf workers was kind of minimal. Everything worked beautifully thanks to KVIKIO and GDS.

## Without RAID

For various reasons, we decided to un-RAID the SSDs and mount them individually. I reran my previous setup:
- With GDSIO, I could still saturate a single SSD (~6.4 GB/s) using 4 threads. 
  - By issuing 4 concurrent GDSIO commands—each targeting a different SSD and using 4 threads—I again reached ~23 GB/s. That was expected.
- Using my Parquet reader with KVIKIO_NTHREADS=16 and 8 cudf worker threads, I could read from a single SSD at ~6.5 GB/s. Also expected.

So far, everything looked good.

### Performance hunting

Then I tried to scale up and saturate all 4 SSDs simultaneously, reading parquets. Theoretically, this shouldn’t be hard. I launched 8 threads per SSD to read Parquet files, and **scaled KVIKIO_NTHREADS linearly to 64. But with this setup, I only achieved ~15 GB/s, with huge fluctuations in bandwidth.**

(I spent two hours checking thread pool setup, PCIe traffic, `iostat`, RMM MR, and nsys profiling—but I’ll skip those details here.)

Eventually, I found the root cause: `KVIKIO_NTHREADS` was simply too low. To saturate all 4 SSDs and PCIe, I needed to set `KVIKIO_NTHREADS=128`. With that, I saw stable bandwidth and full utilization.

## Discussions

I’ll share more observations later, but I wanted to start a discussion around a few questions (you can also add anything you like):
- What’s the rule of thumb for choosing a good `KVIKIO_NTHREADS` value?
- What are standard workarounds when working with multiple SSDs?
- (I will share nsys profile) How do you handle thread pool contention in this context?
-  And a comment: I now realize that removing RAID comes with a significant CPU cost—something I hadn’t anticipated when making the change.

Looking forward to hearing your thoughts and experiences!





Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Discussion] What’s the appropriate KVIKIO_NTHREADS setting for GDS? #850

With RAID-0

Without RAID

Performance hunting

Discussions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Discussion] What’s the appropriate KVIKIO_NTHREADS setting for GDS? #850

Description

With RAID-0

Without RAID

Performance hunting

Discussions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions