-
Notifications
You must be signed in to change notification settings - Fork 14
Description
First of all, I'd like to thank you for your library, which is a really amazing instrument for everyone who wants to determine the encoding of the text.
I'd like to commit a bug. UTF-16 and UTF-32 algorithms (as well as UCS-2 and UCS-4) are not recognized even if the corresponding character set is the only character set used during determination. Here are some thoughts about determination.
UCS-2
If NUL character occurred during text determination and it was not in the end of the buffer, then it may be a UCS-2. If two NUL characters occurred in the end of the buffer (but not more than two NUL characters!), then it is almost certainly UCS-2.
UTF-16
First of all, if we see \xFF\xFE or \xFE\xFF in the beginning of the buffer, then increase probability that text is encoded using UTF-16LE or UTF-16BE respectively.
The same as for UCS-2, but uses surrogate pairs. If we see characters that looks like surrogate pairs, then it is UTF-16. I'd rather suggest to determine UTF-16 everywhere where we see UCS-2 if surrogate pairs are properly encoded.
UTF-32 and UCS-4
First of all, if we see \xFF\xFE\x00\x00 or \x00\x00\xFE\xFF in the beginning of the buffer, then increase probability that text is encoded using UTF-16LE or UTF-16BE respectively.
Then check for NUL characters. If three or four NUL characters occurred during text processing, then it is almost certainly UTF-32.
Thank you very much again!