Skip to content

Support efficient conversion of Tensor and Complex Data in Python #2

@sjperkins

Description

@sjperkins

Currently, conversion of Tensors and Complex data in Python is inefficient:

import casa_arrow as ca
casa_table = ca.table("~/data/WSRT_polar.MS_p0/")
arrow_table = casa_table.to_arrow()

print(arrow_table.column("DATA").to_numpy())

produces the following output

array([array([array([array([0., 1.], dtype=float32), array([0., 1.], dtype=float32),
                     array([0., 1.], dtype=float32), array([0., 1.], dtype=float32)],
                    dtype=object)                                                    ,
              array([array([0., 1.], dtype=float32), array([0., 1.], dtype=float32),
                     array([0., 1.], dtype=float32), array([0., 1.], dtype=float32)],
                    dtype=object)                                                    ,
           ...
           array([array([0., 1.], dtype=float32), array([0., 1.], dtype=float32),
                     array([0., 1.], dtype=float32), array([0., 1.], dtype=float32)],
                    dtype=object)                                                    ],
             dtype=object)                                                             ],
      dtype=object)

This is because the extension types are defined in C++, and the to_numpy() method on the default Python Extension Type wrapper isn't overridden. See daskms.experimental.arrow.extension_types.to_numpy for a possible implementation.

Two possible solutions exist

Provide wrappers with richer features within Apache Arrow

The Arrow maintainers are aware of this issue:

And the following exploratory PR's suggest initial solutions:

Provide wrappers at the casa-arrow level

Provide a table wrapper that creates numpy arrays directly from the arrow column buffers: e.g.

>>> AT.column("DATA").chunks[0].buffers()
 [None, None, None, None, <pyarrow.Buffer address=0x7fca84011000 size=2048 is_cpu=True is_mutable=True>]
>>> bufs=At.column("DATA").chunks[0].buffers()
>>> data = np.frombuffer(bufs[-1], np.complex64)

Metadata

Metadata

Assignees

Labels

enhancementNew feature or requestmediummedium prioritypythonPython related issues

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions