Skip to content

Tracking memory resources#2973

Open
achirkin wants to merge 29 commits intorapidsai:mainfrom
achirkin:fea-tracking-memory-resources
Open

Tracking memory resources#2973
achirkin wants to merge 29 commits intorapidsai:mainfrom
achirkin:fea-tracking-memory-resources

Conversation

@achirkin
Copy link
Contributor

@achirkin achirkin commented Mar 4, 2026

Detailed tracking of (almost) all allocations on device and host.

  // optionally pass an existing resource handle
  raft::resources res;

  // The tracking handle is a child of resource handle; it wraps all memory resources with statistics adaptors
  raft::memory_tracking_resources tracked(res, "allocations.csv", std::chrono::milliseconds(1));

  // All allocations are logged to a .csv as long as `tracked` is alive
  cuvs::neighbors::cagra::build(tracked, ...);

This produces a CSV file with sampled allocations with a timeline and NVTX correlation

timestamp_us,nvtx_depth,nvtx_range,host_current,host_total,pinned_current,pinned_total,managed_current,managed_total,device_current,device_total,workspace_current,workspace_total,large_workspace_current,large_workspace_total
198809,1,"hnsw::build<ACE>",20008,20008,0,0,0,0,148304,148304,0,0,0,0
199961,1,"hnsw::build<ACE>",20008,20008,0,0,0,0,15588304,15588304,0,0,0,0
201350,1,"hnsw::build<ACE>",0,20008,0,0,0,0,0,40385488,0,0,0,0
222216,3,"cagra::build_knn_graph<IVF-PQ>(5000000, 1536, 72)",1440000000,1440020008,0,0,0,0,0,40385488,0,0,0,0
273892,4,"ivf_pq::build(5000000, 1536)",1440020008,1440040016,0,0,0,0,40385488,80770976,0,0,0,0
304183,4,"ivf_pq::build(5000000, 1536)",1440020008,1440040016,0,0,0,0,40385488,80770976,0,0,4388567040,4388567040
309064,4,"ivf_pq::build(5000000, 1536)",1440020008,1440040016,0,0,0,0,53860384,94245872,0,0,4388567040,4388567040
334655,4,"ivf_pq::build(5000000, 1536)",1440020008,1440040016,0,0,0,0,67339295,107724783,0,0,4388567040,4388567040
385037,4,"ivf_pq::build(5000000, 1536)",1440020008,1440040016,0,0,0,0,74076743,114462231,0,0,4388567040,4388567040
386129,4,"ivf_pq::build(5000000, 1536)",1440020008,1440040016,0,0,0,0,80814199,121199687,0,0,4388567040,4388567040
402750,4,"ivf_pq::build(5000000, 1536)",1440020008,1440040016,0,0,0,0,46099768,126913967,0,0,4388567040,4388567040
...

This can later be visualized (the visualization script is not included in the PR):
allocations

Implementation overview

NVTX

Added thread-local tracking of NVTX range stack; the calling thread shares a handle to the sampling thread to correlate the NVTX range state with allocations.

Memory resource adaptors
  • statistics adaptor: atomically counts allocations/deallocations for any cuda::mr-compatible resource
  • notifying adaptor: sets a shared "notifier" state on each event
Resource monitor

A resource monitor registers a collection of resource statistics objects, a single NVTX range handle, and a single notifier state. It spawns a new thread to sample the resource statistics at a given rate (but only when the notifier is triggered). This thread writes to a CSV output stream.

Memory tracking resources

raft::memory_tracking_resources is a child of raft::resources, thus can be used as a drop-in replacement. It replaces all known memory resource for the duration of its lifetime and manages the output file or stream if necessary.

Depends on (and includes all changes of) #2968

achirkin and others added 25 commits February 26, 2026 09:20
@achirkin achirkin self-assigned this Mar 4, 2026
@achirkin achirkin requested review from a team as code owners March 4, 2026 17:44
@achirkin achirkin added feature request New feature or request non-breaking Non-breaking change labels Mar 4, 2026
Copy link
Contributor

@tfeher tfeher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Artem for the PR! This is great. As we try to maximize memory utilization, we are prone to run out of memory. This PR will be very useful to debug those issues and understand memory usage of various algorithms.

The extra memory usage tracking layer is only created if the user explicitly requests it. Therefore I do not see any issue merging this into raft. We should get this in 26.04.

I have few comments below.

My wishlist of follow up PRs:

  • Python API to enable memory_tracking_resource
  • Command line argument for cuvs-bench to enable memory tracking

* a shared resource_stats object. The stats are co-owned via shared_ptr so
* they survive type-erasure of this adaptor.
*
* @note Make sure to call stats() before type-erasing the adaptor to get the statistics.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* @note Make sure to call stats() before type-erasing the adaptor to get the statistics.
* @note Make sure to call get_stats() before type-erasing the adaptor to get the statistics.


out_ << us << ',' << depth << ",\"" << range << '"';
for (auto const& [name, stats] : sources_) {
out_ << ',' << stats->bytes_current.load(std::memory_order_relaxed) << ','
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to have information about peak memory usage during the last sampling interval. E.g. if we use 1 sec interval for a long cagra build, then the peak in each interval would be the most informative number.

private:
// NB: using `cuda::std` in place of `std`,
// because this may happen to be included in pre-C++20 code downstream.
cuda::std::atomic_flag flag_; // Note, meaning of the flag is inverted
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you specify the meaning?

notifier_->wait(); // waits indefinitely until notify() is called
write_row();
// sleep for the minimum time interval
std::this_thread::sleep_for(sample_interval_);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do I understand correctly that while the thread is sleeping we ignore notifications?

// Prevent recursive concept satisfaction when Upstream is a __basic_any type (GCC C++20).
template <typename U, std::enable_if_t<std::is_same_v<std::decay_t<U>, Upstream>, int> = 0>
explicit notifying_adaptor(U&& upstream,
std::shared_ptr<notifier> n = std::make_shared<notifier>())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In practice it is expected to have multiple notifying_adaptors writing to the same report. That means they should share the same notifier object. Wouldn't it be better to remove the default argument, to make the user conscious about sharing the same notifier object among the adaptors (like it is done in memory_tracking_resources.hpp)?

struct nvtx_range_name_stack;
} // namespace detail

/** Shared, read-only handle to the current NVTX range name of another thread. */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this the NVTX range name of the current thread?

auto us = std::chrono::duration_cast<std::chrono::microseconds>(
std::chrono::steady_clock::now() - start_time_)
.count();
auto [range, depth] = nvtx_range_->get();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here nvtx_range_ refers to thread that constructed the resource_monitor object. Do I understand correctly, that this is not accurate, since the notification about an allocation can come from a different thread?

Most of the time we have just the main thread scheduling the work for the GPU, so logging the allocated bytes to the main thread's NVTX ranges is good enough.

Copy link
Contributor Author

@achirkin achirkin Mar 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly! This a compromise I did in the interest of reducing the tracking overheads:

  1. the overheads on allocations amount to a few atomic CASes: bump the counter and touch the activitiy flag
  2. once in the sample period we do a more expensive operation: copy shared string (involves a mutex lock) and read the current state of all counters.

The user decides to "follow" a specific thread, and all resources states (samples) are correlated to the NVTX range of that thread.

If we were to get the NVTX range state on each allocation in its corresponding thread, that would lead to two problems:

  1. string copy overhead on each allocation
  2. not clear how to aggregate the allocations within a sample period

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature request New feature or request non-breaking Non-breaking change

Projects

Development

Successfully merging this pull request may close these issues.

2 participants