Skip to content
This repository was archived by the owner on Apr 26, 2020. It is now read-only.
This repository was archived by the owner on Apr 26, 2020. It is now read-only.

UTF-8 issue when try to create a DOM document #18

@vaso123

Description

@vaso123

I have a fetched page by CURL, what charset is windows-1250, and doctype is

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">

I change the encoding of my string, check it, and replace the meta charset in string:

$html = str_replace('windows-1250', 'UTF-8', mb_convert_encoding($result, 'UTF-8')); var_dump(mb_detect_encoding($html, "UTF-8, ASCII, ISO-8859-1, windows-1250")); $Doc = \phpQuery::newDocumentHTML($html, 'UTF-8'); echo pq($Doc)->html();

All the UTF-8 characters are messy. var_dump says, its UTF-8, content-type="text/plain; charset=UTF-8".

When I var_dump($Doc); I see, the DOMDocument encoding and xmlencoding values are nulls.

But if I am using:

$Dom = new \DOMDocument(); $Dom->loadHTML($html);

and var_dump it, then everyhing is fine, the characters are ok.

I've checked the createDocumentWrapper and the $contentType is ok.

If I set the static $debug to true I've get this:

`string 'Load markup for content type text/html;charset=utf-8' (length=52)

string 'Loading HTML, content type 'text/html;charset=utf-8'' (length=52)

string 'Full markup load (HTML):

' (length=275)

string 'DOC: UTF-8 REQ: UTF-8' (length=21)

string 'Full markup load (HTML), documentCreate('utf-8')' (length=48)

string 'Selecting document '52280a0c077ec7c5fb2f2350db12f22c' as default one' (length=68)`

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions