http://qs321.pair.com?node_id=1194918

jfrm has asked for the wisdom of the Perl Monks concerning the following question:

I have been uploading an XML file to a service provider for a long time with the first line output as:
$cureq .= '<?xml version="1.0" encoding="latin1"?>'."\n"; $cureq .= # lots of other xml stuff open (XML, ">$xmlfile") or return("Could not open $xmlfile"); print $cureq; close XML or return("Could not close $xmlfile");
Now I have to change the encoding from latin1 to UTF-8 and having read around quite a lot now, I realise that I just don't get it. I have tried changing what I thought were the critical 2 lines viz:
$cureq .= '<?xml version="1.0" encoding="UTF-8"?>'."\n"; open (XML, '>:encoding(UTF-8)', $xmlfile) or return("Could not open $x +mlfile");
This creates the file but my service provider now returns 'Invalid XML'. I just don't get it and what's more I cannot think of a way to debug it or investigate more deeply. Any clues for this poor padawan would be appreciated.

Replies are listed 'Best First'.
Re: Encoding horridness
by choroba (Cardinal) on Jul 12, 2017 at 13:07 UTC
    Unfortunately, you haven't provided enough information. How do you include the non-ascii characters into the XML?

    The following creates a well-formed XML:

    #!/usr/bin/perl
    use warnings;
    use strict;
    use utf8;
     
    binmode STDOUT, ':encoding(UTF-8)';
    print "<?xml version='1.0' encoding='utf-8'?><áéíóůÿ/>";
    

    Note the utf8 which interprets the characters in the right way. If you're reading the characters from a file, you need to specify the :encoding(UTF-8) layer for it, as well. Etc.

    ($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord }map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,

      A careless reader might see "utf8 interprets the characters in the right way" and get the idea that it's going to fix all their utf8 woes. To be clear, use utf8 only changes how perl reads your program source code -- probably just your string literals.

        The "the" in "the characters" means I referenced the characters in the element name. utf8 changes how Perl reads your program source, but it does more than string literals:
        #!/usr/bin/perl use warnings; use strict; use feature qw{ say }; use utf8; my $á = 123; say $á;
        ($q=q:Sq=~/;[c](.)(.)/;chr(-||-|5+lengthSq)`"S|oS2"`map{chr |+ord }map{substrSq`S_+|`|}3E|-|`7**2-3:)=~y+S|`+$1,++print+eval$q,q,a,
Re: Encoding horridness
by Corion (Patriarch) on Jul 12, 2017 at 13:10 UTC

    You will also have to make sure that the data you are writing to the XML file has been read properly from your data source and has been properly decoded when reading it.

    Ideally you use Encode and decode all data when you read it into your program and encode it when writing it to your output. You have already taken care of encodeing the output, but the input might not be valid UTF-8 or not be recognized by Perl as such.

    Assuming that your input data is a file with bytes encoded in Latin-1, you could read/decode the data as

    while( <$fh>) { my $payload = decode('Latin-1', $_); };

    For database values, you have the additional fun of finding out as what kind of data/encoding your database actually stores the values.

      Good advice to be sure. But since latin-1 is a subset of unicode, isn't decode('Latin-1', $_) pretty much a no-op?

        No, because high-bit characters/octets in Latin-1 encode differently as octets in UTF-8, and Perl doesn't know what to do with high-bit characters when writing them.

        The OP wants to move from Latin-1 to UTF-8. Latin-1 is not a subset of UTF-8.

Re: Encoding horridness
by runrig (Abbot) on Jul 12, 2017 at 17:58 UTC
    You don't mention, but you have successfully parsed the file that you're generating with XML::LibXML, right?
      No, I haven't but based on your suggestion and on others comments (thanks to all), I will now do that. Incidentally, to answer questions from others, some of the data in the file is coming from a mySQL database and having checked some of the fields are in UTF and some are latin1 so maybe that is the problem (although I believe you are right - my service provider should give more feedback and I am going to badger them to do this). Other values are just coming from the script itself. I read that PERL internally uses UTF-8 format. So doesn't that mean that all data values unless sourced direct from the database are UTF-8 and therefore my latin1 encoded XML should never have worked? Or is it just that I was probably lucky as latin1 is 'almost' a subset of UTF-8?
        I read that PERL internally uses UTF-8 format.

        Where did you read that? Certainly not from perlunitut which says (my emphasis):

        Perl has an internal format, an encoding that it uses to encode text strings so it can store them in memory. All text strings are in this internal format. In fact, text strings are never in any other format!

        You shouldn't worry about what this format is, because conversion is automatically done when you decode or encode.

Re: Encoding horridness
by karlgoethebier (Abbot) on Jul 12, 2017 at 17:24 UTC

    Not sure. Hence i don't reply to the OP. I'll praise the lord if i ever fully understand this encoding stuff.

    Shouldn't do something like use open IN  => ":encoding(iso-8859-1)", OUT => ':utf8'; do the job?

    Regards, Karl

    «The Crux of the Biscuit is the Apostrophe»

    perl -MCrypt::CBC -E 'say Crypt::CBC->new(-key=>'kgb',-cipher=>"Blowfish")->decrypt_hex($ENV{KARL});'Help

      In principle yes, but data could also come from the script (fun) or a database (more fun) or a web page (incredible fun). Reading from a file is the easiest way to acquire data provided that the file only contains one encoding of characters.

        "In principle yes..."

        Just another question to Radio Yerevan.

        Thanks and best regards, Karl

        «The Crux of the Biscuit is the Apostrophe»

        perl -MCrypt::CBC -E 'say Crypt::CBC->new(-key=>'kgb',-cipher=>"Blowfish")->decrypt_hex($ENV{KARL});'Help

Re: Encoding horridness
by Anonymous Monk on Jul 12, 2017 at 14:06 UTC
    We don't know what your service provide considers "valid xml." Just for fun, what happens if you try this?
    use Encode qw( encode XMLCREF ); print XML encode('ascii', $cureq, XMLCREF);
A reply falls below the community's threshold of quality. You may see it by logging in.