Conversation
…nitializing make_*_scalar overloads
…sed host_resource and host_device_resource
…aw CCCL references
…urrent_device_resource()
tfeher
left a comment
There was a problem hiding this comment.
Thanks Artem for the PR! This is great. As we try to maximize memory utilization, we are prone to run out of memory. This PR will be very useful to debug those issues and understand memory usage of various algorithms.
The extra memory usage tracking layer is only created if the user explicitly requests it. Therefore I do not see any issue merging this into raft. We should get this in 26.04.
I have few comments below.
My wishlist of follow up PRs:
- Python API to enable
memory_tracking_resource - Command line argument for cuvs-bench to enable memory tracking
| * a shared resource_stats object. The stats are co-owned via shared_ptr so | ||
| * they survive type-erasure of this adaptor. | ||
| * | ||
| * @note Make sure to call stats() before type-erasing the adaptor to get the statistics. |
There was a problem hiding this comment.
| * @note Make sure to call stats() before type-erasing the adaptor to get the statistics. | |
| * @note Make sure to call get_stats() before type-erasing the adaptor to get the statistics. |
|
|
||
| out_ << us << ',' << depth << ",\"" << range << '"'; | ||
| for (auto const& [name, stats] : sources_) { | ||
| out_ << ',' << stats->bytes_current.load(std::memory_order_relaxed) << ',' |
There was a problem hiding this comment.
I would like to have information about peak memory usage during the last sampling interval. E.g. if we use 1 sec interval for a long cagra build, then the peak in each interval would be the most informative number.
| private: | ||
| // NB: using `cuda::std` in place of `std`, | ||
| // because this may happen to be included in pre-C++20 code downstream. | ||
| cuda::std::atomic_flag flag_; // Note, meaning of the flag is inverted |
There was a problem hiding this comment.
Could you specify the meaning?
| notifier_->wait(); // waits indefinitely until notify() is called | ||
| write_row(); | ||
| // sleep for the minimum time interval | ||
| std::this_thread::sleep_for(sample_interval_); |
There was a problem hiding this comment.
Do I understand correctly that while the thread is sleeping we ignore notifications?
| // Prevent recursive concept satisfaction when Upstream is a __basic_any type (GCC C++20). | ||
| template <typename U, std::enable_if_t<std::is_same_v<std::decay_t<U>, Upstream>, int> = 0> | ||
| explicit notifying_adaptor(U&& upstream, | ||
| std::shared_ptr<notifier> n = std::make_shared<notifier>()) |
There was a problem hiding this comment.
In practice it is expected to have multiple notifying_adaptors writing to the same report. That means they should share the same notifier object. Wouldn't it be better to remove the default argument, to make the user conscious about sharing the same notifier object among the adaptors (like it is done in memory_tracking_resources.hpp)?
| struct nvtx_range_name_stack; | ||
| } // namespace detail | ||
|
|
||
| /** Shared, read-only handle to the current NVTX range name of another thread. */ |
There was a problem hiding this comment.
Isn't this the NVTX range name of the current thread?
| auto us = std::chrono::duration_cast<std::chrono::microseconds>( | ||
| std::chrono::steady_clock::now() - start_time_) | ||
| .count(); | ||
| auto [range, depth] = nvtx_range_->get(); |
There was a problem hiding this comment.
Here nvtx_range_ refers to thread that constructed the resource_monitor object. Do I understand correctly, that this is not accurate, since the notification about an allocation can come from a different thread?
Most of the time we have just the main thread scheduling the work for the GPU, so logging the allocated bytes to the main thread's NVTX ranges is good enough.
There was a problem hiding this comment.
Exactly! This a compromise I did in the interest of reducing the tracking overheads:
- the overheads on allocations amount to a few atomic CASes: bump the counter and touch the activitiy flag
- once in the sample period we do a more expensive operation: copy shared string (involves a mutex lock) and read the current state of all counters.
The user decides to "follow" a specific thread, and all resources states (samples) are correlated to the NVTX range of that thread.
If we were to get the NVTX range state on each allocation in its corresponding thread, that would lead to two problems:
- string copy overhead on each allocation
- not clear how to aggregate the allocations within a sample period
Detailed tracking of (almost) all allocations on device and host.
This produces a CSV file with sampled allocations with a timeline and NVTX correlation
This can later be visualized (the visualization script is not included in the PR):

Implementation overview
NVTX
Added thread-local tracking of NVTX range stack; the calling thread shares a handle to the sampling thread to correlate the NVTX range state with allocations.
Memory resource adaptors
cuda::mr-compatible resourceResource monitor
A resource monitor registers a collection of resource statistics objects, a single NVTX range handle, and a single notifier state. It spawns a new thread to sample the resource statistics at a given rate (but only when the notifier is triggered). This thread writes to a CSV output stream.
Memory tracking resources
raft::memory_tracking_resourcesis a child ofraft::resources, thus can be used as a drop-in replacement. It replaces all known memory resource for the duration of its lifetime and manages the output file or stream if necessary.Depends on (and includes all changes of) #2968