Skip to content

Question about stuck condition, probably due to constant GPU memory purging #38

@Muxas

Description

@Muxas

Hi, again!

As you already know, I like StarPU! This time I got a stuck condition. Setting WATCHDOG_TIMEOUT=1000000 (1 second) showed, that for some reason on a server with GPUs no tasks are finished in such a huge time. I believe the problem is within constant loading and purging of some memory buffers. I mean a task requires two input buffers, but the memory is not enough. So memory manager purges buffer number 1 to make space for a buffer number 2. Then it purges buffer number 2 to make space for a buffer number 1. In the end, what I observe, no task is done in several minutes (more than 100 messages from the watchdog about not finishing a task in last second with a delay of 1 second). It starts happening as I increase problem size (number of tasks increases while size of each task remains the same) while keeping the same hardware. The more tasks are to be computed, the more probability of getting stuck becomes. Any advice how to solve the issue? Did you experience such a problem before? How did you solve it?

During the stuck period I see no changes in nvidia-smi: memory utilization remains the same, while no computing is done (0 percent).

Thank you!

Metadata

Metadata

Assignees

No one assigned

    Labels

    questionFurther information is requested

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions