Skip to content

[BUG] Segmentation fault using cudf after HugeCTR model run #356

@oliverholworthy

Description

@oliverholworthy

Describe the bug

After instantiating a HugeCTR model and attempting to use cudf after, encounter a segmentation fault.

The scenario we've encountered this is in two separate pytest test functions. The first is one that creates a HugeCTR model (which runs ok). The second is something else that uses cudf. This test passes in isolation. However, if the test involving HugeCTR is run first, the second test fails with a segmenataion fault.

Presumably there is some global state that needs to be cleaned up that is not automatically happening as a result of the model reference going out-of-scope.

Steps To Reproduce

  1. Instantiate a HugeCTR model and fit it on a dataset
  2. delete the model object (pytest cleanup)
  3. (in the same python session) use cudf.DataFrame.from_pandas(x)
  4. Segmentation fault encoutered in a from_arrow function in cudf
import hugectr
import cudf

def test_hugectr_model():
    model = hugectr.Model(...)
    ...
    model.compile(...)
    model.fit(...)


def test_something_else():
    ....
    cudf.DataFrame.from_pandas(x)

Expected behavior

No segmentation fault calling cudf in a context where the HugeCTR model should be cleaned up.

Additional context

I don't currently have an environment where I can reproduce this locally (#337 ), so the reproduce steps are potentially not the most minimal.

Examples of this segmentation fault can be found in the nvidia-merlin-bot comments on this PR NVIDIA-Merlin/systems#129

Metadata

Metadata

Assignees

Labels

P1Should havebugIt's a bug / potential bug and need verification

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions