-
Notifications
You must be signed in to change notification settings - Fork 28
Description
(dcramer/erikrose - first, thanks for taking an interest in chardet - good to see someone is rescuing this useful package from oblivion).
Just noticed a serious logic bug in latin1prober. At line 133, you'll see:
confidence = (self._mFreqCounter[3] / total) - (self._mFreqCounter[1] * 20.0 / total)
The problem with this is that self._mFreqCounter[3] and total are both integers, so this first term is always 0 any time self._mFreqCounter[3] != total, meaning that confidence is 0 for all cases where self._mFreqCounter[3] != total. In other words, even one "unlikely" character transition in a document of any length will produce a confidence of 0. This is certainly not the behavior of the original Mozilla code, which does this all in floating point.
An example document which shows this problem can be found at: http://www.lvo.com/GASTRONOMIE/VINS/VITI/VITI1F.HTML
A simple fix would be to change this line of code to:
confidence = (self._mFreqCounter[3] - self._mFreqCounter[1] * 20.0) / total
which obviates the multiple divisions as well.
As an aside, the "0.5" confidence multiplier (a few lines later in the code) is a wild guess (as the original authors note, this is sort of a hacky Latin-1 detector anyway). I've had better experience with 0.8 (for example, the document given in this report is incorrectly detected as 'iso-8859-2,' which wreaks havoc on the accents, until about 0.8), though your mileage may vary.