Skip to content

Add html5 character encoding mappings #8

@mlissner

Description

@mlissner

I had a frustrating issue recently when trying to use chardet to work with a web page: http://stackoverflow.com/questions/11588458/how-to-handle-encodings-using-python-requests-library

My solution was to write a bit of custom code that says, "Whenever chardet reports ISO-8859-1, instead use cp1252."

Basically, browsers don't use a number of character encodings, and instead map to other ones instead. This was done unofficially for a while by browsers, but it's now enshrined in the HTML5 spec:

http://dev.w3.org/html5/spec/single-page.html#character-encodings-0

Since most of the data that chardet is used on will be coming from the web, it makes sense for it to return the character encodings that are used by browsers. This might make sense as an option rather than default functionality....not sure, but I'd love to see this be added.

If this is a feature that'd be accepted, I'd be happy to put it together in a pull request, but I need guidance as to the design that'd be accepted.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions