Beefy Boxes and Bandwidth Generously Provided by pair Networks
Pathologically Eclectic Rubbish Lister
 
PerlMonks  

Re^5: XML:: DOM and Accented Characters

by Pickwick (Beadle)
on Aug 07, 2010 at 15:26 UTC ( [id://853583]=note: print w/replies, xml ) Need Help??


in reply to Re^4: XML:: DOM and Accented Characters
in thread XML:: DOM and Accented Characters

After running the code provided by almut I'm still not seeing C3 A9 as the hex code for the e-acute. TextPad is displaying an E9 code and NotePad++ EF BF BD.

E9 for your character is windows-1252 according to Wikipedia, which would mean that the perl I/O layer does convert your parsed UTF-8-string into windows-1252 and is ignoring the >:utf8. Maybe you should post you complete code where you parse and save the xml.

  • Comment on Re^5: XML:: DOM and Accented Characters

Replies are listed 'Best First'.
Re^6: XML:: DOM and Accented Characters
by freeflyer (Novice) on Aug 07, 2010 at 18:07 UTC

    Picwick, here's the code prior to trying any of the suggestions made. It's a simple test xml with nothing but a couple a spaces and an e-accute.

    I've created a small test XML (accentTest.xml) to demonstrate what I am seeing:

    <?xml version="1.0" encoding="UTF-8"?> <TEST> </TEST>

    And below is the perl code that reads this in and saves it back out as accentTestOutPut.xml.

    use XML::DOM; my $parser = new XML::DOM::Parser; my $doc = $parser->parsefile ("c:\\accentTest.xml"); # Print doc file $doc->printToFile ("c:\\accentTestOutPut.xml"); # Print to string print $doc->toString; # cleanup $doc->dispose;
      Picwick, here's the code prior to trying any of the suggestions made.

      We don't need the code prior the suggestions because we already know why this code can't work as expacted. We need the code where you override automatic encoding of the perl I/O layer with >:utf8, because this code really should work.

      Give us your latest code, there shure is an error somewhere.

        Hi, I've got 5 versions of code incorporating various suggestions made to me, none of which I can (yet) get to work on windows. The last version I have tested on a Unix machine and it worked OK. Trying to open this Unix created XML on windows results in it opening OK

        #!/bin/perl -w use XML::DOM; my $parser = new XML::DOM::Parser; my $doc = $parser->parsefile ("c:\\accentTest.xml", ProtocolEncoding +=> 'UTF-8'); # Print doc file $doc->printToFile ("c:\\accentTestOutPut.xml"); #re-open file in UTF-8 encoded filehandle open my $fh, ">:utf8", "accentTestOutPut.xml" or die $!; $doc->print($fh); # cleanup $doc->dispose;
        #!/bin/perl -w use XML::DOM; my $parser = new XML::DOM::Parser; my $doc = $parser->parsefile ("c:\\accentTest.xml", ProtocolEncoding +=> 'UTF-8'); # Print doc file $doc->printToFile ("c:\\accentTestOutPut.xml"); #re-open file in UTF-8 encoded filehandle open my $fh, ">:utf8", "accentTestOutPut.xml" or die $!; print $fh "\x{FEFF}"; # BOM $doc->print($fh); # cleanup $doc->dispose;
        #!/bin/perl -w use XML::DOM; use UTF8BOM; my $parser = new XML::DOM::Parser; my $doc = $parser->parsefile ("c:\\accentTest.xml", ProtocolEncoding +=> 'UTF-8'); # Print doc file $doc->printToFile ("c:\\accentTestOutPut.xml"); UTF8BOM->insert_into_file('c:\\accentTestOutPut.xml'); # cleanup $doc->dispose;
        #!/bin/perl -w use XML::DOM; use Encode qw(encode_utf8); my $parser = new XML::DOM::Parser; my $doc = $parser->parsefile ("c:\\accentTest.xml", ProtocolEncoding = +> 'UTF-8'); # Print doc file $doc->printToFile ("c:\\accentTestOutPut.xml"); open my $fh, ">:utf8", "accentTestOutPut.xml" or die $!; encode_utf8($fh); $doc->print($fh); # cleanup $doc->dispose;
        #!/bin/perl -w use XML::DOM; use PerlIO::encoding; my $parser = new XML::DOM::Parser; my $doc = $parser->parsefile ("c:\\accentTest.xml", ProtocolEncoding = +> 'UTF-8'); # Print doc file $doc->printToFile ("c:\\accentTestOutPut.xml"); open my $fh, ">:encoding(UTF-8)", "accentTestOutPut.xml" or die $!; $doc->print($fh); # cleanup $doc->dispose;

        What I have also discovered is that changing the 1st line of the XML to <?xml version="1.0" encoding="windows-1252"?> (as suggested by ikegami) in all cases results in me being able to open the file OK in windows.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://853583]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others scrutinizing the Monastery: (5)
As of 2024-04-17 02:21 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found