Skip to content

UTF-16 and UTF-32 algorithms #11

@ghostmansd

Description

@ghostmansd

First of all, I'd like to thank you for your library, which is a really amazing instrument for everyone who wants to determine the encoding of the text.
I'd like to commit a bug. UTF-16 and UTF-32 algorithms (as well as UCS-2 and UCS-4) are not recognized even if the corresponding character set is the only character set used during determination. Here are some thoughts about determination.

UCS-2

If NUL character occurred during text determination and it was not in the end of the buffer, then it may be a UCS-2. If two NUL characters occurred in the end of the buffer (but not more than two NUL characters!), then it is almost certainly UCS-2.

UTF-16

First of all, if we see \xFF\xFE or \xFE\xFF in the beginning of the buffer, then increase probability that text is encoded using UTF-16LE or UTF-16BE respectively.
The same as for UCS-2, but uses surrogate pairs. If we see characters that looks like surrogate pairs, then it is UTF-16. I'd rather suggest to determine UTF-16 everywhere where we see UCS-2 if surrogate pairs are properly encoded.

UTF-32 and UCS-4

First of all, if we see \xFF\xFE\x00\x00 or \x00\x00\xFE\xFF in the beginning of the buffer, then increase probability that text is encoded using UTF-16LE or UTF-16BE respectively.
Then check for NUL characters. If three or four NUL characters occurred during text processing, then it is almost certainly UTF-32.

Thank you very much again!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions