Skip to content

Converting LargeArray to arrow::Buffer can cause memory leaks #189

@arthurp

Description

@arthurp

The obvious way to convert a katana::LargeArray into an arrow::Buffer leaks memory because Buffer often doesn't own it's underlying data and assumes some other object owns it and that the Buffer lifetime is contained in the data's lifetime. Specifically, the following code leaks two blocks of memory:

auto numeric_array_out_indices =
std::make_shared<arrow::NumericArray<arrow::UInt64Type>>(
static_cast<int64_t>(num_nodes),
arrow::MutableBuffer::Wrap(out_indices.release()->data(), num_nodes));

This code, and other code like it that probably exists in the codebase, leaks:

  • The LargeBuffer instance out_indices because of the call to release without passing ownership of the instance to some other object.
  • The buffer out_indices->data() is leaking because arrow::MutableBuffer does not own it's data by default, so it will never deallocate the buffer passed to arrow::MutableBuffer::Wrap.

Further, arrow would not be able to correctly deallocate the LargeArray data pointer anyway, since arrow doesn't make assumptions about the allocator used for any given buffer. So there is no simple take_ownership flag.

The solution will be to do the following:

  • Implement a KatanaMemoryPool subclass of arrow::MemoryPool, which would use Katana's NUMA aware allocator and deallocator. Our memory pool can be passed to arrow::ArrayBuilders, arrow::AllocateBuffer, and even parquet to control how it's memory is allocated and freed.
  • Make katana::LargeArray into a subclass of arrow::MutableBuffer, so that it can be passed directly into Arrow without conversion. The semantics of LargeArray are simple enough they should fit into the Buffer interface just fine.

In fixing this, be careful of the ownership assumptions in Arrow. Buffer often doesn't own it's data. ArrayData and Array share their data with other instances (to implement zero-copy slicing). These objects generally use shared_ptr to manage the underlying data, but be a little careful.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions