Skip to content

[Discussion] What’s the appropriate KVIKIO_NTHREADS setting for GDS? #850

@JigaoLuo

Description

@JigaoLuo

Hi KVIKIO team,

I’d like to share a performance debugging story from this afternoon that I think is somewhat unique, and hopefully useful to others.

Let me start by describing the setup:

  • hardware: a DGX A100 system with 4 PCIe4.0 SSDs, each capable of ~6.4 GB/s bandwidth, measured by GDSIO.
    • Previously, these SSDs were configured as a single RAID-0 volume.
    • This week, we split the RAID and mounted each SSD as an independent filesystem.
  • my workload involves multistream, multithreaded Parquet reading, which is a strong assumption: I have implemented a multistream, multithreaded Parquet reader based on libcudf. In my design, each Parquet read call retrieves only a small chunk (such as a row group like 32MB), and multiple worker threads concurrently issue these Parquet read calls. This approach differs from traditional readers, which may process larger chunks.

My Parquet reader setup is a bit different—I operate at the row group level, which means I typically launch around 60 separate cudf read_parquet calls per file.

With RAID-0

Life was simple with RAID-0. I could easily saturate the SSD bandwidth:

  • GDSIO: Using 4 threads and large read chunks
  • In my parquet reader: with KVIKIO_NTHREADS=16 and 8 worker threads issuing cudf read_parquet calls

Performance was excellent—reading Parquet at ~23 GB/s into one A100. Decoding and decompression were pipelined, and the overhead from 16 KVIKIO threads plus 8 cudf workers was kind of minimal. Everything worked beautifully thanks to KVIKIO and GDS.

Without RAID

For various reasons, we decided to un-RAID the SSDs and mount them individually. I reran my previous setup:

  • With GDSIO, I could still saturate a single SSD (~6.4 GB/s) using 4 threads.
    • By issuing 4 concurrent GDSIO commands—each targeting a different SSD and using 4 threads—I again reached ~23 GB/s. That was expected.
  • Using my Parquet reader with KVIKIO_NTHREADS=16 and 8 cudf worker threads, I could read from a single SSD at ~6.5 GB/s. Also expected.

So far, everything looked good.

Performance hunting

Then I tried to scale up and saturate all 4 SSDs simultaneously, reading parquets. Theoretically, this shouldn’t be hard. I launched 8 threads per SSD to read Parquet files, and scaled KVIKIO_NTHREADS linearly to 64. But with this setup, I only achieved ~15 GB/s, with huge fluctuations in bandwidth.

(I spent two hours checking thread pool setup, PCIe traffic, iostat, RMM MR, and nsys profiling—but I’ll skip those details here.)

Eventually, I found the root cause: KVIKIO_NTHREADS was simply too low. To saturate all 4 SSDs and PCIe, I needed to set KVIKIO_NTHREADS=128. With that, I saw stable bandwidth and full utilization.

Discussions

I’ll share more observations later, but I wanted to start a discussion around a few questions (you can also add anything you like):

  • What’s the rule of thumb for choosing a good KVIKIO_NTHREADS value?
  • What are standard workarounds when working with multiple SSDs?
  • (I will share nsys profile) How do you handle thread pool contention in this context?
  • And a comment: I now realize that removing RAID comes with a significant CPU cost—something I hadn’t anticipated when making the change.

Looking forward to hearing your thoughts and experiences!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions