http://qs321.pair.com?node_id=938194

Ea has asked for the wisdom of the Perl Monks concerning the following question:

I'm parsing an XML document that has an acute accent acting as a right quote. It's char(180) (aka U+00B4) and the document encoding is UTF-8. When I run XML::Parser over it (or even the xml_pp tool), I get a "not well-formed (invalid token)" error.

I've naively tried adding use utf8; to the script, but I still get the error. I believe I could just tr/// that bad boy into something less problematic, but I was wondering if there was a lazier way, like a setting in XML::Parser that I can add to the handlers?

For the curious, I'm getting my output from LaTeXML, a set perl tools for converting LaTeX to XML. There might be some scope to process the output before I parse the XML, but I suspect that it'll look a.

thanks,

perl -e 'print qq(Just another Perl Hacker\n)' # where's the irony switch?

Replies are listed 'Best First'.
Re: XML invalid token
by Anonymous Monk on Nov 15, 2011 at 16:10 UTC
    It sounds like your XML has the ´ ACUTE ACCENT character encoded in ISO-8859-1 as 0xB4, even though it should be 0xC2 0xB4 in UTF-8. This usually happens when people produce XML by concatenating strings instead of using a proper XML library that is aware of encoding.

    Is my assumption correct? Then you have to fix the problem with preprocessing in order to make an standard compliant XML stream. Either replace the byte as above, or if it really affects all bytes in the range 0x80 to 0xFF, simply change the encoding declaration in the XML prolog, e.g.:

    <?xml version="1.0" encoding="ISO-8859-1" ?>

      That's the answer I was looking for. I change the encoding from UTF-8 to ISO-8859-1 and the error disappears. This gives me something to ask the LaTexML folx as to why they're producing XML documents claiming to be utf8 when they aren't.

      Many thanks, oh Nameless One!

    • Update - I even found that latexml has a --inputencoding=iso-8859-1 option to do just that. Now to figure out how to automatically detect a LaTex file's encoding ...
    • Update the II - checking out Encode::Guess and Encode::Detect

      perl -e 'print qq(Just another Perl Hacker\n)' # where's the irony switch?
Re: XML invalid token
by Sinistral (Monsignor) on Nov 15, 2011 at 15:28 UTC