http://qs321.pair.com?node_id=1139792


in reply to Parsing of undecoded UTF-8 will give garbage when decoding entities

$page in is the raw html document
Does $page consist of bytes (e.g. "\x{e2}\x{98}\x{ba}", which can be decoded as U+263A White Smiling Face in UTF-8, or "\x{fe}\x{ff}\x{26}\x{3a}" which is the same U+263A, but in UTF-16), or of characters (e.g. "\x{263A}", which is a U+263A White Smiling Face character and should be encoded before writing it anywhere)? HTML::TokeParser seems to ask for the latter (it wants HTML to be decoded to characters from bytes in whatever encoding they were encoded to). See also: perlunitut.

Of course, this brings us to another problem of correctly determining the encoding of a byte stream, which sometimes should be done by an HTML parser (when charset is defined by meta tag in HTML4/HTML5), sometimes should be done by HTTP client (when a proper Content-type header is sent) and sometimes just has to be guessed. And it's not impossible to misconfigure a webserver to serve Content-Type: text/html; charset=utf-8 with <meta charset="koi8-r"> in HTML while the real encoding is UTF-16LE with BOM.

Replies are listed 'Best First'.
Re^2: Parsing of undecoded UTF-8 will give garbage when decoding entities
by itsscott (Sexton) on Aug 25, 2015 at 14:43 UTC

    Honestly, from what I've seen no. The only 'extended' characters are a few 'smart' apostrophes and a copyright symbol. I determined this with bbedit by opening the raw file, and switching the encoding to latin1 from the utf-8. each page is pretty much identical, but that's a good point about the headers. I will investigate that next. It's so hard figuring this kind issue as it's the module barking and our code!

    Thanks for the input so far!

      Only ASCII characters (with ord <= 0x7f) are represented in UTF-8 in the same way as in latin1 (as single bytes). By the way, there is a module IO::HTML which can be used to determine encoding of HTML files (seekable :raw streams only).

      If you are positive that your web pages consist only of ASCII and valid UTF8, you can use HTML::TokeParser::->new( \ decode "UTF-8", $raw_html ); (or even utf8::decode($html); HTML::TokeParser::->new($html)), but it's going to complain and/or produce mojibake (or at least U+FFFD REPLACEMENT CHARACTERs) if (when?) the crawler encounters latin1/cp1252/koi8/another non-ASCII encoding.