-
Notifications
You must be signed in to change notification settings - Fork 7
Open
Description
Commit 9c287d7 broke TypedBytesTableOutputFormat completely.
Before that, it was not possible to write non-utf8 bytes to hbase.
The issue was discussed in this on the dumbo-user mailing list.
I pointed out, that this could be resolved by changing the mapping of hadoop streaming types to python types in the typedbytes python module to
- read typedbytes bytes to regular python strings
- read typedbytes strings to python unicode strings
- write regular python strings to typedbytes bytes
- write unicode python strings to typedbytes strings
Klaas pointed out, that this would yield to performance degradation for dumbo client code which deals with text input as hadoop streaming emits textinput as typedbytes string and thus lots of utf-8 to python unicode conversion overhead would be paid.
He further pointed out, that this issue could be resolved by changing the mapping in typedbytes to
- read typedbytes bytes to regular python strings
- read typedbytes strings to regular python strings
- write regular python strings to typedbytes bytes
- write unicode python strings to typedbytes strings
which would not be so intuitive.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels