Fragmentation of the global CuPy memory pool

# Problem statement

As was seen at #666, the globally used CuPy memory pool is prone to fragmentation. This can lead to allocation failures when running the pipeline. The following example attempts to describe the problem:

1. Let the pipeline run on a single GPU with 8 GiB of available memory.
2. Method "A" runs. First it allocates the output array of 1 GiB, but also uses a 4 GiB array for temporary calculations.
3. After method "A" returns, the memory pool holds 5 GiB overall, a chunk of 1 GiB holding the output of Method "A" and a chunk of 4 GiB, which is unused at this point. And 3 GiB is still free on the GPU.
4. Method "B" runs. For the sake of example, this just allocates an output array of 1 GiB. To avoid a new allocation, the pool uses the unused 4 GiB chunk, bumps a pointer, and returns 1 GiB of it to the output array.
5. After method "B" returns, there is still theoretical 6 GiB memory free: 3 GiB as part of the pool's 4 GiB chunk, and 3 GiB free on the device.
6. Method "A" runs again. This should be possible, because it needs 5 GiB total, and as of above, there is a total 6 GiB free. However, while the allocation of the output 1 GiB can be served by the remaining 3 GiB in the pool, the subsequent allocation for the temporary 4 GiB array will fail, as no contiguous 4 GiB is available either in the pool, or on the device.

This is analog with what happens in #666: the smaller output of darks/flats takes hold of a larger backing chunk in the memory pool, effectively limiting the maximum size of subsequent allocations.

# Possible solutions

## A: Avoid pool fragmentation on an incidental basis

By reorganizing the code, or invoking `free_all_blocks` on the global pool, it is possible to solve these problems one-by-one, just as seen in #666. However, this does not guarantee that new cases will not be introduced unexpectedly. Also, `free_all_blocks` can introduce an execution bottleneck and reduce performance.

## B: Make sure that auxiliary data is loaded to the GPU only on demand

A more generic, preventive solution is to make sure that auxiliary data, like darks and flats are only loaded to the device, when the subsequent method requires. After the method returns, the auxiliary data would be freed from the device, and kept in the host memory until it is needed again. This would guarantee that no unneeded data piece is loaded to the GPU that would cause the method to run out of memory.

## C: Isolating the method's execution by using a separated memory pool

Even if the auxiliary data is loaded only on demand, it can occur that the output of one method holds a backing chunk in the pool larger than itself, thereby fragmenting the memory. A preventive solution would be to allocate the inputs/outputs of the method up front, and for each method in the pipeline, a separated memory pool would be used, which is released after the method returns. This would require more changes to the program structure, but would also make certain optimizations possible, such as pre-allocating memory based on the memory estimator.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fragmentation of the global CuPy memory pool #667

Problem statement

Possible solutions

A: Avoid pool fragmentation on an incidental basis

B: Make sure that auxiliary data is loaded to the GPU only on demand

C: Isolating the method's execution by using a separated memory pool

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Fragmentation of the global CuPy memory pool #667

Description

Problem statement

Possible solutions

A: Avoid pool fragmentation on an incidental basis

B: Make sure that auxiliary data is loaded to the GPU only on demand

C: Isolating the method's execution by using a separated memory pool

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions