Skip to content

Math bug in latin1prober #3

@ablegrape

Description

@ablegrape

(dcramer/erikrose - first, thanks for taking an interest in chardet - good to see someone is rescuing this useful package from oblivion).

Just noticed a serious logic bug in latin1prober. At line 133, you'll see:

 confidence = (self._mFreqCounter[3] / total) - (self._mFreqCounter[1] * 20.0 / total)

The problem with this is that self._mFreqCounter[3] and total are both integers, so this first term is always 0 any time self._mFreqCounter[3] != total, meaning that confidence is 0 for all cases where self._mFreqCounter[3] != total. In other words, even one "unlikely" character transition in a document of any length will produce a confidence of 0. This is certainly not the behavior of the original Mozilla code, which does this all in floating point.

An example document which shows this problem can be found at: http://www.lvo.com/GASTRONOMIE/VINS/VITI/VITI1F.HTML

A simple fix would be to change this line of code to:

 confidence = (self._mFreqCounter[3] - self._mFreqCounter[1] * 20.0) / total

which obviates the multiple divisions as well.

As an aside, the "0.5" confidence multiplier (a few lines later in the code) is a wild guess (as the original authors note, this is sort of a hacky Latin-1 detector anyway). I've had better experience with 0.8 (for example, the document given in this report is incorrectly detected as 'iso-8859-2,' which wreaks havoc on the accents, until about 0.8), though your mileage may vary.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions