Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW
 
PerlMonks  

error parsing utf8 chars using XML DOM parser

by avih (Initiate)
on Nov 01, 2011 at 14:23 UTC ( #935130=perlquestion: print w/replies, xml ) Need Help??

avih has asked for the wisdom of the Perl Monks concerning the following question:

Hello monks, I'm trying to parse the example xml below, which contains Latin chars, using the XML DOM parser in code that follows. The object returned by the parser contains jibberish instead the Latin letters.

xml:

<?xml version="1.0" encoding="UTF-8"?> <Name>IssuéTést</Name>

code:

use XML::DOM; my $XmlParserObj = XML::DOM::Parser->new(); my $doc = $XmlParserObj->parsefile("in.xml"); my $str = $doc->toString(); print $str;

The output xml I get is:

<?xml version="1.0" encoding="UTF-8"?> <Name>Issu&#14932;st</Name>

Any advice on how can I get the exact chars in the output, or at least with the correct escape chars? Thanks!

Replies are listed 'Best First'.
Re: error parsing utf8 chars using XML DOM parser
by Ninthwave (Chaplain) on Nov 01, 2011 at 15:07 UTC

    Try telling standard out you have utf8 -

    binmode(STDOUT, ":utf8");

    worked on my kit I did

    use XML::DOM; binmode(STDOUT, ":utf8"); my $XmlParserObj = XML::DOM::Parser->new(); my $doc = $XmlParserObj->parsefile("in.xml"); my $str = $doc->toString(); print $str;

    and the output included the characters.

    "No matter where you go, there you are." BB
Re: error parsing utf8 chars using XML DOM parser
by ikegami (Pope) on Nov 01, 2011 at 20:14 UTC

    I don't get your behaviour.

    $ perl -MEncode -e' print encode "UTF-8", qq{<?xml version="1.0" encoding="UTF-8"?>\n} . qq{<Name>Issu\x{E9}T\x{E9}st</Name>\n}; ' >in.xml $ perl -e' use open ":std", ":encoding(UTF-8)"; # I have a UTF-8 terminal use XML::DOM; my $parser = XML::DOM::Parser->new(); my $doc = $parser->parsefile("in.xml"); print $doc->toString(); ' <?xml version="1.0" encoding="UTF-8"?> <Name>IssuéTést</Name>

    I tried mis-encoding the XML to see if I could get your behaviour, but I don't get your behaviour even then.

    $ perl -MEncode -e' print encode "iso-8859-1", # Wrong! qq{<?xml version="1.0" encoding="UTF-8"?>\n} . qq{<Name>Issu\x{E9}T\x{E9}st</Name>\n}; ' >in.xml $ perl -e' use open ":std", ":encoding(UTF-8)"; # I have a UTF-8 terminal use XML::DOM; my $parser = XML::DOM::Parser->new(); my $doc = $parser->parsefile("in.xml"); print $doc->toString(); ' not well-formed (invalid token) at line 2, column 10, byte 49 at .../X +ML/Parser.pm line 187

    Either your file doesn't contain what you say it does, or there was a bug that's been fixed. Try upgrading XML::DOM and its dependencies. Versions I used:

    • XML::DOM 1.44
    • XML::RegExp 0.02
    • XML::Parser 2.41
    • XML::Parser::Expat 2.41
Re: error parsing utf8 chars using XML DOM parser
by Anonymous Monk on Nov 01, 2011 at 15:10 UTC

    You need to make sure your output is in UTF8 too, otherwise you'll get Perl's internal codes.

    If writing to a normal file, you can just specify utf8 encoding:

    open(my $fh, '>:utf8', 'out.xml') || die "Failed to open file"; print $fh $str; close($fh) || die "Failed to close file";

    For STDOUT, specify your default output files will be in utf8 and that this should apply to STD* handles too

    use XML::DOM; use open OUT => ':utf8'; use open ':std'; # as before... print $str;

    See "perldoc open" for some discussion.

      Thanks for the answers. Half way there. If I get the string my self with getData on DOM::Node, it looks great, but I still get Jibberish when printing the string XML::DOM produces.

      xml:

      <?xml version="1.0" encoding="utf-8"?> <Name>IssuéTést</Name>

      code:

      #!/usr/bin/perl -w use XML::DOM; use Encode; use open OUT => ":utf8"; use open ":std"; my $XmlParserObj = XML::DOM::Parser->new(); open(IN,"<:utf8","in.xml"); my @in = <IN>; my $inStr = join("",@in); #$inStr = encode("utf8",$inStr); # redundant if I use <:utf8 in open #$inStr = decode("utf8",$inStr); # make all tested strings get "?" ins +tead of latin chars my $doc = $XmlParserObj->parse($inStr); my $value = $doc->getElementsByTagName("Name")->item(0)->getChildNodes +()->item(0)->getData(); my $str = $doc->toString(); #binmode(STDOUT,":utf8"); # redundant print "is input utf8 ? ",Encode::is_utf8($inStr),"\n"; print "Input:\n".$inStr; print "is value utf8 ? ",Encode::is_utf8($value),"\n"; print "Value: ".$value."\n"; print "is output utf8 ? ",Encode::is_utf8($str),"\n"; print "Output:\n".$str; exit(0);

      output:

      is input utf8 ? 1 Input: <?xml version="1.0" encoding="utf-8"?> <Name>IssuéTést</Name> is value utf8 ? 1 Value: IssuéTést is output utf8 ? 1 Output: <?xml version="1.0" encoding="utf-8"?> <Name>Issu&#14932;st</Name>

      Thanks again

        Works for me. What version of XML::DOM do you have?

        Solved. Updating the modules and a little help from the Encode utilities solved it. Thanks guys.

Log In?
Username:
Password:

What's my password?
Create A New User
Node Status?
node history
Node Type: perlquestion [id://935130]
Approved by Corion
Front-paged by Corion
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others having an uproarious good time at the Monastery: (3)
As of 2020-07-02 06:26 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found

    Notices?