-
Notifications
You must be signed in to change notification settings - Fork 67
Description
The obvious way to convert a katana::LargeArray into an arrow::Buffer leaks memory because Buffer often doesn't own it's underlying data and assumes some other object owns it and that the Buffer lifetime is contained in the data's lifetime. Specifically, the following code leaks two blocks of memory:
katana/libgalois/src/analytics/subgraph_extraction/subgraph_extraction.cpp
Lines 92 to 95 in 5a8c284
| auto numeric_array_out_indices = | |
| std::make_shared<arrow::NumericArray<arrow::UInt64Type>>( | |
| static_cast<int64_t>(num_nodes), | |
| arrow::MutableBuffer::Wrap(out_indices.release()->data(), num_nodes)); |
This code, and other code like it that probably exists in the codebase, leaks:
- The LargeBuffer instance
out_indicesbecause of the call to release without passing ownership of the instance to some other object. - The buffer
out_indices->data()is leaking becausearrow::MutableBufferdoes not own it's data by default, so it will never deallocate the buffer passed toarrow::MutableBuffer::Wrap.
Further, arrow would not be able to correctly deallocate the LargeArray data pointer anyway, since arrow doesn't make assumptions about the allocator used for any given buffer. So there is no simple take_ownership flag.
The solution will be to do the following:
- Implement a
KatanaMemoryPoolsubclass ofarrow::MemoryPool, which would use Katana's NUMA aware allocator and deallocator. Our memory pool can be passed toarrow::ArrayBuilders,arrow::AllocateBuffer, and even parquet to control how it's memory is allocated and freed. - Make
katana::LargeArrayinto a subclass ofarrow::MutableBuffer, so that it can be passed directly into Arrow without conversion. The semantics ofLargeArrayare simple enough they should fit into theBufferinterface just fine.
In fixing this, be careful of the ownership assumptions in Arrow. Buffer often doesn't own it's data. ArrayData and Array share their data with other instances (to implement zero-copy slicing). These objects generally use shared_ptr to manage the underlying data, but be a little careful.