UTF-16 and UTF-32 algorithms

First of all, I'd like to thank you for your library, which is a really amazing instrument for everyone who wants to determine the encoding of the text.
I'd like to commit a bug. UTF-16 and UTF-32 algorithms (as well as UCS-2 and UCS-4) are not recognized even if the corresponding character set is the only character set used during determination. Here are some thoughts about determination.
# UCS-2

If NUL character occurred during text determination and it was not in the end of the buffer, then it may be a UCS-2. If two NUL characters occurred in the end of the buffer (but not more than two NUL characters!), then it is almost certainly UCS-2.
# UTF-16

First of all, if we see \xFF\xFE or \xFE\xFF in the beginning of the buffer, then increase probability that text is encoded using UTF-16LE or UTF-16BE respectively.
The same as for UCS-2, but uses surrogate pairs. If we see characters that looks like surrogate pairs, then it is UTF-16. I'd rather suggest to determine UTF-16 everywhere where we see UCS-2 if surrogate pairs are properly encoded.
# UTF-32 and UCS-4

First of all, if we see \xFF\xFE\x00\x00 or \x00\x00\xFE\xFF in the beginning of the buffer, then increase probability that text is encoded using UTF-16LE or UTF-16BE respectively.
Then check for NUL characters. If three or four NUL characters occurred during text processing, then it is almost certainly UTF-32.

Thank you very much again!


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UTF-16 and UTF-32 algorithms #11

UCS-2

UTF-16

UTF-32 and UCS-4

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

UTF-16 and UTF-32 algorithms #11

Description

UCS-2

UTF-16

UTF-32 and UCS-4

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions