Math bug in latin1prober

(dcramer/erikrose - first, thanks for taking an interest in chardet - good to see someone is rescuing this useful package from oblivion).

Just noticed a serious logic bug in latin1prober. At line 133, you'll see:

```
 confidence = (self._mFreqCounter[3] / total) - (self._mFreqCounter[1] * 20.0 / total)
```

The problem with this is that `self._mFreqCounter[3]` and `total` are both integers, so this first term is always 0 any time `self._mFreqCounter[3] != total`, meaning that confidence is 0 for all cases where `self._mFreqCounter[3] != total`. In other words, even one "unlikely" character transition in a document of _any_ length will produce a confidence of 0. This is certainly not the behavior of the original Mozilla code, which does this all in floating point.

An example document which shows this problem can be found at: http://www.lvo.com/GASTRONOMIE/VINS/VITI/VITI1F.HTML

A simple fix would be to change this line of code to:

```
 confidence = (self._mFreqCounter[3] - self._mFreqCounter[1] * 20.0) / total
```

which obviates the multiple divisions as well. 

As an aside, the "0.5" confidence multiplier (a few lines later in the code) is a wild guess (as the original authors note, this is sort of a hacky Latin-1 detector anyway). I've had better experience with 0.8 (for example, the document given in this report is incorrectly detected as 'iso-8859-2,' which wreaks havoc on the accents, until about 0.8), though your mileage may vary.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Math bug in latin1prober #3

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Math bug in latin1prober #3

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions