Skip to content

TypedBytesTableOutputFormat broken #3

@luk

Description

@luk

Commit 9c287d7 broke TypedBytesTableOutputFormat completely.
Before that, it was not possible to write non-utf8 bytes to hbase.

The issue was discussed in this on the dumbo-user mailing list.

I pointed out, that this could be resolved by changing the mapping of hadoop streaming types to python types in the typedbytes python module to

  • read typedbytes bytes to regular python strings
  • read typedbytes strings to python unicode strings
  • write regular python strings to typedbytes bytes
  • write unicode python strings to typedbytes strings

Klaas pointed out, that this would yield to performance degradation for dumbo client code which deals with text input as hadoop streaming emits textinput as typedbytes string and thus lots of utf-8 to python unicode conversion overhead would be paid.

He further pointed out, that this issue could be resolved by changing the mapping in typedbytes to

  • read typedbytes bytes to regular python strings
  • read typedbytes strings to regular python strings
  • write regular python strings to typedbytes bytes
  • write unicode python strings to typedbytes strings

which would not be so intuitive.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions