Skip to content

Optimize opportunistic Direct I/O for device reads#939

Open
kingcrimsontianyu wants to merge 27 commits intorapidsai:mainfrom
kingcrimsontianyu:direct-io-improve
Open

Optimize opportunistic Direct I/O for device reads#939
kingcrimsontianyu wants to merge 27 commits intorapidsai:mainfrom
kingcrimsontianyu:direct-io-improve

Conversation

@kingcrimsontianyu
Copy link
Contributor

@kingcrimsontianyu kingcrimsontianyu commented Mar 2, 2026

This PR improves the existing opportunistic direct I/O for POSIX device read in two ways:

  • Page-align first task in parallel_io: When file_offset is not page-aligned, the first task is shortened so that all subsequent tasks start at a page-aligned boundary. This eliminates per-task alignment overhead (BIO prefix/suffix) for the majority of tasks in a parallel read.

  • Optional pure Direct I/O with over-read (KVIKIO_AUTO_DIRECT_IO_READ_OVERREAD, default: off): A new posix_device_read_aligned function that aligns offset down and size up to page boundaries, ensuring all disk I/O goes through Direct I/O.

The performance result is available at the comment section below.

@kingcrimsontianyu kingcrimsontianyu added improvement Improves an existing functionality non-breaking Introduces a non-breaking change c++ Affects the C++ API of KvikIO labels Mar 2, 2026
@copy-pr-bot
Copy link

copy-pr-bot bot commented Mar 2, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@kingcrimsontianyu kingcrimsontianyu changed the title Improve Direct I/O read Optimize opportunistic Direct I/O for device reads Mar 3, 2026
@kingcrimsontianyu
Copy link
Contributor Author

kingcrimsontianyu commented Mar 5, 2026

Visualization

Below is a comparison of nsys profiles from PDS-H SF-1K benchmark.

  • Buffered I/O (default)
image
  • Opportunistic direct I/O (KVIKIO_AUTO_DIRECT_IO_READ=ON): The prefix and suffix are unaligned and read by BIO, whereas the middle segment is aligned and read by DIO. Note that there are now 3 H2D copies. This can potentially be optimized by using batched copy by a future PR.
image
  • Opportunistic direct I/O with overread (KVIKIO_AUTO_DIRECT_IO_READ=ON, KVIKIO_AUTO_DIRECT_IO_READ_OVERREAD=ON): The prefix and suffix are aligned down and up, respectively, and read all by DIO. Then only useful bytes are H2D copied, the overread segments ignored.
image

Performance result

The PDS-H SF-1K was run on server umb-b200-220 with a CPU of 2 NUMA nodes and 8 B200 GPUs. Each query is run 4 times. The first iteration is discarded, and the average of the remaining 3 iterations is used as the time.

  • main, off: Baseline. Dropping the cache before each query and each iteration.
  • main, oppo: Turn on the opportunistic I/O on the main branch. KvikIO splits an I/O segment into 3 components: prefix, middle segment, and suffix. The prefix and suffix are unaligned and read by the buffered I/O, whereas the middle segment is page aligned and read by the direct I/O
  • PR, off: Just a sanity check. Supposed to be the same with "main, off"
  • PR, oppo: The opportunistic read is optimized such that the first read task in KvikIO is aligned to page boundary, allowing all subsequent tasks to be aligned.
  • PR oppo+OR: On top of the previous optimization, add the "overread" feature, which extends the prefix and suffix of a segment to the page boundary so that all read operations use direct I/O.
+-----+-----------+------------------+------------------+------------------+------------------+
|   Q |  main,off |        main,oppo |           PR,off |          PR,oppo |       PR,oppo+OR |
+-----+-----------+------------------+------------------+------------------+------------------+
|   1 |    8.9723 |   4.8934 (1.83x) |   9.0808 (0.99x) |   4.5584 (1.97x) |   4.4108 (2.03x) |
|   2 |    1.2006 |   0.7308 (1.64x) |   1.2182 (0.99x) |   0.6996 (1.72x) |   0.6144 (1.95x) |
|   3 |    9.9261 |   5.7821 (1.72x) |   9.9175 (1.00x) |   5.6295 (1.76x) |   4.7235 (2.10x) |
|   4 |    4.9632 |   3.3372 (1.49x) |   4.9935 (0.99x) |   3.2400 (1.53x) |   2.6128 (1.90x) |
|   5 |   13.3386 |   6.4830 (2.06x) |  14.4169 (0.93x) |   6.3693 (2.09x) |   5.3710 (2.48x) |
|   6 |    6.4417 |   3.9650 (1.62x) |   6.5030 (0.99x) |   4.0286 (1.60x) |   3.3147 (1.94x) |
|   7 |   14.2337 |   8.2538 (1.72x) |  14.1590 (1.01x) |   8.0352 (1.77x) |   6.7896 (2.10x) |
|   8 |   16.8859 |   7.8435 (2.15x) |  16.7968 (1.01x) |   7.3436 (2.30x) |   6.9214 (2.44x) |
|   9 |   17.9825 |  11.2841 (1.59x) |  17.7636 (1.01x) |  10.8812 (1.65x) |  10.6265 (1.69x) |
|  10 |   11.8223 |   7.6007 (1.56x) |  11.9399 (0.99x) |   7.3471 (1.61x) |   6.5892 (1.79x) |
|  11 |    1.2852 |   0.6537 (1.97x) |   1.2870 (1.00x) |   0.6072 (2.12x) |   0.5765 (2.23x) |
|  12 |    7.0303 |   5.0595 (1.39x) |   6.9463 (1.01x) |   4.9740 (1.41x) |   4.1803 (1.68x) |
|  13 |   28.1904 |  26.1344 (1.08x) |  28.0242 (1.01x) |  25.4519 (1.11x) |  25.1725 (1.12x) |
|  14 |   11.0458 |   5.8779 (1.88x) |  11.0733 (1.00x) |   5.6576 (1.95x) |   4.7738 (2.31x) |
|  15 |    9.9825 |   5.2593 (1.90x) |  10.0793 (0.99x) |   5.1335 (1.94x) |   4.2799 (2.33x) |
|  16 |    1.9128 |   1.5106 (1.27x) |   1.9170 (1.00x) |   1.5030 (1.27x) |   1.4729 (1.30x) |
|  17 |   10.0350 |   4.9818 (2.01x) |  10.1210 (0.99x) |   4.8223 (2.08x) |   4.1964 (2.39x) |
|  18 |   85.8377 |  88.1720 (0.97x) |  82.1853 (1.04x) |  83.2519 (1.03x) |  86.9057 (0.99x) |
|  19 |   13.5745 |   8.0006 (1.70x) |  13.2990 (1.02x) |   7.7437 (1.75x) |   6.9383 (1.96x) |
|  20 |   10.8586 |   5.9373 (1.83x) |  11.0034 (0.99x) |   5.5818 (1.95x) |   4.8016 (2.26x) |
|  21 |   50.8339 |  49.6382 (1.02x) |  50.5383 (1.01x) |  50.6136 (1.00x) |  53.1957 (0.96x) |
|  22 |    1.5246 |   0.7084 (2.15x) |   1.4778 (1.03x) |   0.6919 (2.20x) |   0.5938 (2.57x) |
+-----+-----------+------------------+------------------+------------------+------------------+
| Tot |  337.8784 | 262.1073 (1.29x) | 334.7410 (1.01x) | 254.1651 (1.33x) | 249.0610 (1.36x) |
+-----+-----------+------------------+------------------+------------------+------------------+

The observation is that:

  • The sanity check passes cleanly.
  • The opportunistic direct I/O improves the performance pretty consistently:  PR,oppo > main,oppo > main,off.
  • The overread feature (PR,oppo+OR) further improves the performance.
  • Q18 and Q21 timing data are very noisy. For example on main, off Q18 ranges 76.10s ~ 95.78s, and Q21 ranges 44.04s ~ 54.83s. Their speedup numbers are not as meaningful.

@kingcrimsontianyu kingcrimsontianyu marked this pull request as ready for review March 6, 2026 00:34
@kingcrimsontianyu kingcrimsontianyu requested a review from a team as a code owner March 6, 2026 00:34
@kingcrimsontianyu kingcrimsontianyu requested review from bdice and vuule March 6, 2026 00:34
# Enable Direct I/O for reads, and disable it for writes
kvikio.defaults.set({"auto_direct_io_read": True, "auto_direct_io_write": False})

Over-read Alignment ``KVIKIO_AUTO_DIRECT_IO_READ_OVERREAD``
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now I wish to keep this a C++ only setting, so have not updated the Python API in this PR.

@kingcrimsontianyu kingcrimsontianyu requested a review from madsbk March 6, 2026 16:41
CUdeviceptr devPtr = convert_void2deviceptr(devPtr_base) + devPtr_offset;
std::size_t const bounce_buffer_size = bounce_buffer.size();
std::size_t cur_file_offset = file_offset;
std::size_t bytes_remaining = size;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Integer type is a bit tricky to handle. We have POSIX off_t (used for the syscall's offset parameter) and size_t/std::size_t (used for the syscall's size parameter, and ssize_t (returned by syscall). Here I tried to simplify things a bit by using std::size_t wherever possible and cast to off_t (using our overflow checker convert_size2off) when the variable is passed as the offset argument.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR optimizes opportunistic Direct I/O (DIO) for device reads in two ways:

  1. First-task alignment in parallel_io: When file_offset is unaligned, the first parallel task is shortened so subsequent tasks start at page-aligned offsets, eliminating per-task BIO prefix overhead for most tasks.
  2. Pure DIO with over-read (KVIKIO_AUTO_DIRECT_IO_READ_OVERREAD): A new posix_device_read_aligned function that aligns all reads to page boundaries by over-reading (reading extra bytes from disk and discarding the prefix/suffix), ensuring all disk I/O uses O_DIRECT.

Changes:

  • New posix_device_read_aligned function implementing pure DIO with alignment via over-read
  • first_task_size parameter added to parallel_io for first-task shortening
  • New KVIKIO_AUTO_DIRECT_IO_READ_OVERREAD setting in defaults
  • Refactored EnvVarContext to accept unordered_map in addition to initializer_list
  • New parameterized correctness tests for all unaligned offset/size combinations

Reviewed changes

Copilot reviewed 10 out of 10 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
docs/source/runtime_settings.rst Documents new KVIKIO_AUTO_DIRECT_IO_READ_OVERREAD env var
cpp/tests/utils/env.hpp Adds new EnvVarContext constructor accepting unordered_map
cpp/tests/utils/env.cpp Refactors shared logic into add_entry(), implements both constructors
cpp/tests/test_basic_io.cpp New parameterized OpportunisticDirectIOTest correctness tests
cpp/src/file_handle.cpp Computes first_task_size for page-aligned subsequent tasks
cpp/src/detail/posix_io.cpp Implements posix_device_read_aligned and adds lower_bound check
cpp/include/kvikio/detail/posix_io.hpp Declares posix_device_read_aligned, adds doc notes, type cast fixes
cpp/include/kvikio/detail/parallel_operation.hpp Adds first_task_size optional parameter to parallel_io
cpp/include/kvikio/defaults.hpp Declares auto_direct_io_read_overread setting
cpp/src/defaults.cpp Implements auto_direct_io_read_overread setting with env var initialization

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

partial review, will revisit after the currently pending comments are addressed.

@kingcrimsontianyu kingcrimsontianyu requested a review from vuule March 11, 2026 04:12
Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

few small comments

auto const aligned_cur_offset = detail::align_up(cur_offset, page_size);
auto const bytes_requested = std::min(aligned_cur_offset - cur_offset, bytes_remaining);
auto const bytes_requested =
std::min(aligned_cur_offset - static_cast<std::size_t>(cur_offset), bytes_remaining);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the cast needed? might be better to make aligned_cur_offset size_t, if that also solves the issue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Further simplified the internal use of integer types.

*
* @param nbytes The default task size in bytes.
*/
static void set_task_size(std::size_t nbytes);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we enforce this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some unit tests use sub-page task sizes (https://github.com/rapidsai/kvikio/blob/main/python/kvikio/tests/test_basic_io.py#L25). For simplicity I think we can just make the alignment a performance advice. Updated the python doc on this matter.

@kingcrimsontianyu kingcrimsontianyu requested a review from vuule March 13, 2026 15:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

c++ Affects the C++ API of KvikIO improvement Improves an existing functionality non-breaking Introduces a non-breaking change

Projects

Status: Burndown

Development

Successfully merging this pull request may close these issues.

3 participants