Skip to content

Fragmentation of the global CuPy memory pool #667

@mfep

Description

@mfep

Problem statement

As was seen at #666, the globally used CuPy memory pool is prone to fragmentation. This can lead to allocation failures when running the pipeline. The following example attempts to describe the problem:

  1. Let the pipeline run on a single GPU with 8 GiB of available memory.
  2. Method "A" runs. First it allocates the output array of 1 GiB, but also uses a 4 GiB array for temporary calculations.
  3. After method "A" returns, the memory pool holds 5 GiB overall, a chunk of 1 GiB holding the output of Method "A" and a chunk of 4 GiB, which is unused at this point. And 3 GiB is still free on the GPU.
  4. Method "B" runs. For the sake of example, this just allocates an output array of 1 GiB. To avoid a new allocation, the pool uses the unused 4 GiB chunk, bumps a pointer, and returns 1 GiB of it to the output array.
  5. After method "B" returns, there is still theoretical 6 GiB memory free: 3 GiB as part of the pool's 4 GiB chunk, and 3 GiB free on the device.
  6. Method "A" runs again. This should be possible, because it needs 5 GiB total, and as of above, there is a total 6 GiB free. However, while the allocation of the output 1 GiB can be served by the remaining 3 GiB in the pool, the subsequent allocation for the temporary 4 GiB array will fail, as no contiguous 4 GiB is available either in the pool, or on the device.

This is analog with what happens in #666: the smaller output of darks/flats takes hold of a larger backing chunk in the memory pool, effectively limiting the maximum size of subsequent allocations.

Possible solutions

A: Avoid pool fragmentation on an incidental basis

By reorganizing the code, or invoking free_all_blocks on the global pool, it is possible to solve these problems one-by-one, just as seen in #666. However, this does not guarantee that new cases will not be introduced unexpectedly. Also, free_all_blocks can introduce an execution bottleneck and reduce performance.

B: Make sure that auxiliary data is loaded to the GPU only on demand

A more generic, preventive solution is to make sure that auxiliary data, like darks and flats are only loaded to the device, when the subsequent method requires. After the method returns, the auxiliary data would be freed from the device, and kept in the host memory until it is needed again. This would guarantee that no unneeded data piece is loaded to the GPU that would cause the method to run out of memory.

C: Isolating the method's execution by using a separated memory pool

Even if the auxiliary data is loaded only on demand, it can occur that the output of one method holds a backing chunk in the pool larger than itself, thereby fragmenting the memory. A preventive solution would be to allocate the inputs/outputs of the method up front, and for each method in the pipeline, a separated memory pool would be used, which is released after the method returns. This would require more changes to the program structure, but would also make certain optimizations possible, such as pre-allocating memory based on the memory estimator.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions