-
Notifications
You must be signed in to change notification settings - Fork 86
Description
Hi KVIKIO team,
I’d like to share a performance debugging story from this afternoon that I think is somewhat unique, and hopefully useful to others.
Let me start by describing the setup:
- hardware: a DGX A100 system with 4 PCIe4.0 SSDs, each capable of ~6.4 GB/s bandwidth, measured by GDSIO.
- Previously, these SSDs were configured as a single RAID-0 volume.
- This week, we split the RAID and mounted each SSD as an independent filesystem.
- my workload involves multistream, multithreaded Parquet reading, which is a strong assumption: I have implemented a multistream, multithreaded Parquet reader based on libcudf. In my design, each Parquet read call retrieves only a small chunk (such as a row group like 32MB), and multiple worker threads concurrently issue these Parquet read calls. This approach differs from traditional readers, which may process larger chunks.
My Parquet reader setup is a bit different—I operate at the row group level, which means I typically launch around 60 separate cudf read_parquet calls per file.
With RAID-0
Life was simple with RAID-0. I could easily saturate the SSD bandwidth:
- GDSIO: Using 4 threads and large read chunks
- In my parquet reader: with KVIKIO_NTHREADS=16 and 8 worker threads issuing cudf read_parquet calls
Performance was excellent—reading Parquet at ~23 GB/s into one A100. Decoding and decompression were pipelined, and the overhead from 16 KVIKIO threads plus 8 cudf workers was kind of minimal. Everything worked beautifully thanks to KVIKIO and GDS.
Without RAID
For various reasons, we decided to un-RAID the SSDs and mount them individually. I reran my previous setup:
- With GDSIO, I could still saturate a single SSD (~6.4 GB/s) using 4 threads.
- By issuing 4 concurrent GDSIO commands—each targeting a different SSD and using 4 threads—I again reached ~23 GB/s. That was expected.
- Using my Parquet reader with KVIKIO_NTHREADS=16 and 8 cudf worker threads, I could read from a single SSD at ~6.5 GB/s. Also expected.
So far, everything looked good.
Performance hunting
Then I tried to scale up and saturate all 4 SSDs simultaneously, reading parquets. Theoretically, this shouldn’t be hard. I launched 8 threads per SSD to read Parquet files, and scaled KVIKIO_NTHREADS linearly to 64. But with this setup, I only achieved ~15 GB/s, with huge fluctuations in bandwidth.
(I spent two hours checking thread pool setup, PCIe traffic, iostat, RMM MR, and nsys profiling—but I’ll skip those details here.)
Eventually, I found the root cause: KVIKIO_NTHREADS was simply too low. To saturate all 4 SSDs and PCIe, I needed to set KVIKIO_NTHREADS=128. With that, I saw stable bandwidth and full utilization.
Discussions
I’ll share more observations later, but I wanted to start a discussion around a few questions (you can also add anything you like):
- What’s the rule of thumb for choosing a good
KVIKIO_NTHREADSvalue? - What are standard workarounds when working with multiple SSDs?
- (I will share nsys profile) How do you handle thread pool contention in this context?
- And a comment: I now realize that removing RAID comes with a significant CPU cost—something I hadn’t anticipated when making the change.
Looking forward to hearing your thoughts and experiences!