-
Notifications
You must be signed in to change notification settings - Fork 28
Description
I had a frustrating issue recently when trying to use chardet to work with a web page: http://stackoverflow.com/questions/11588458/how-to-handle-encodings-using-python-requests-library
My solution was to write a bit of custom code that says, "Whenever chardet reports ISO-8859-1, instead use cp1252."
Basically, browsers don't use a number of character encodings, and instead map to other ones instead. This was done unofficially for a while by browsers, but it's now enshrined in the HTML5 spec:
http://dev.w3.org/html5/spec/single-page.html#character-encodings-0
Since most of the data that chardet is used on will be coming from the web, it makes sense for it to return the character encodings that are used by browsers. This might make sense as an option rather than default functionality....not sure, but I'd love to see this be added.
If this is a feature that'd be accepted, I'd be happy to put it together in a pull request, but I need guidance as to the design that'd be accepted.