Optimize opportunistic Direct I/O for device reads#939
Optimize opportunistic Direct I/O for device reads#939kingcrimsontianyu wants to merge 27 commits intorapidsai:mainfrom
Conversation
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
| # Enable Direct I/O for reads, and disable it for writes | ||
| kvikio.defaults.set({"auto_direct_io_read": True, "auto_direct_io_write": False}) | ||
|
|
||
| Over-read Alignment ``KVIKIO_AUTO_DIRECT_IO_READ_OVERREAD`` |
There was a problem hiding this comment.
For now I wish to keep this a C++ only setting, so have not updated the Python API in this PR.
| CUdeviceptr devPtr = convert_void2deviceptr(devPtr_base) + devPtr_offset; | ||
| std::size_t const bounce_buffer_size = bounce_buffer.size(); | ||
| std::size_t cur_file_offset = file_offset; | ||
| std::size_t bytes_remaining = size; |
There was a problem hiding this comment.
Integer type is a bit tricky to handle. We have POSIX off_t (used for the syscall's offset parameter) and size_t/std::size_t (used for the syscall's size parameter, and ssize_t (returned by syscall). Here I tried to simplify things a bit by using std::size_t wherever possible and cast to off_t (using our overflow checker convert_size2off) when the variable is passed as the offset argument.
There was a problem hiding this comment.
Pull request overview
This PR optimizes opportunistic Direct I/O (DIO) for device reads in two ways:
- First-task alignment in
parallel_io: Whenfile_offsetis unaligned, the first parallel task is shortened so subsequent tasks start at page-aligned offsets, eliminating per-task BIO prefix overhead for most tasks. - Pure DIO with over-read (
KVIKIO_AUTO_DIRECT_IO_READ_OVERREAD): A newposix_device_read_alignedfunction that aligns all reads to page boundaries by over-reading (reading extra bytes from disk and discarding the prefix/suffix), ensuring all disk I/O uses O_DIRECT.
Changes:
- New
posix_device_read_alignedfunction implementing pure DIO with alignment via over-read first_task_sizeparameter added toparallel_iofor first-task shortening- New
KVIKIO_AUTO_DIRECT_IO_READ_OVERREADsetting indefaults - Refactored
EnvVarContextto acceptunordered_mapin addition toinitializer_list - New parameterized correctness tests for all unaligned offset/size combinations
Reviewed changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
docs/source/runtime_settings.rst |
Documents new KVIKIO_AUTO_DIRECT_IO_READ_OVERREAD env var |
cpp/tests/utils/env.hpp |
Adds new EnvVarContext constructor accepting unordered_map |
cpp/tests/utils/env.cpp |
Refactors shared logic into add_entry(), implements both constructors |
cpp/tests/test_basic_io.cpp |
New parameterized OpportunisticDirectIOTest correctness tests |
cpp/src/file_handle.cpp |
Computes first_task_size for page-aligned subsequent tasks |
cpp/src/detail/posix_io.cpp |
Implements posix_device_read_aligned and adds lower_bound check |
cpp/include/kvikio/detail/posix_io.hpp |
Declares posix_device_read_aligned, adds doc notes, type cast fixes |
cpp/include/kvikio/detail/parallel_operation.hpp |
Adds first_task_size optional parameter to parallel_io |
cpp/include/kvikio/defaults.hpp |
Declares auto_direct_io_read_overread setting |
cpp/src/defaults.cpp |
Implements auto_direct_io_read_overread setting with env var initialization |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
vuule
left a comment
There was a problem hiding this comment.
partial review, will revisit after the currently pending comments are addressed.
| auto const aligned_cur_offset = detail::align_up(cur_offset, page_size); | ||
| auto const bytes_requested = std::min(aligned_cur_offset - cur_offset, bytes_remaining); | ||
| auto const bytes_requested = | ||
| std::min(aligned_cur_offset - static_cast<std::size_t>(cur_offset), bytes_remaining); |
There was a problem hiding this comment.
Why is the cast needed? might be better to make aligned_cur_offset size_t, if that also solves the issue.
There was a problem hiding this comment.
Done. Further simplified the internal use of integer types.
| * | ||
| * @param nbytes The default task size in bytes. | ||
| */ | ||
| static void set_task_size(std::size_t nbytes); |
There was a problem hiding this comment.
Some unit tests use sub-page task sizes (https://github.com/rapidsai/kvikio/blob/main/python/kvikio/tests/test_basic_io.py#L25). For simplicity I think we can just make the alignment a performance advice. Updated the python doc on this matter.



This PR improves the existing opportunistic direct I/O for POSIX device read in two ways:
Page-align first task in parallel_io: When file_offset is not page-aligned, the first task is shortened so that all subsequent tasks start at a page-aligned boundary. This eliminates per-task alignment overhead (BIO prefix/suffix) for the majority of tasks in a parallel read.
Optional pure Direct I/O with over-read (
KVIKIO_AUTO_DIRECT_IO_READ_OVERREAD, default: off): A new posix_device_read_aligned function that aligns offset down and size up to page boundaries, ensuring all disk I/O goes through Direct I/O.The performance result is available at the comment section below.