http://qs321.pair.com?node_id=197119

Emanuel has asked for the wisdom of the Perl Monks concerning the following question:

Hello wise monks

I'm running into a problem, and I can't seem to find any solution to it for the past 2 days. Here's what it's all about:

o I got an UTF-8 encoded XML file
o I parse it in and want to write some parts of it to a mysql database, ISO-8859-1 encoded

Everything is working fine, i'm reading the XML in, creating a hash out if it with XML::Parser, data gets written to MySQL aswell, but when I check the data in the table, it's UTF-8 again.

So I started playing with Text::Iconv, and came to this:

--- some stuff above --- $parser->parsefile(shift @ARGV, ProtocolEncoding => "ISO-8859-1"); --- some stuff inbetween --- my $converter = Text::Iconv->new("UTF-8","ISO-8859-1"); while (my($key,$value)=each(%attrs)) { push (@value_stack, { $key=>$value }); if ($program_hash{$current_filmid}{$key} eq '') { $program_hash{$current_filmid}{$key} = $conver +ter->convert($value); } else { if ($key ne "EventId" && $key ne "KanalId") { $program_hash{$current_filmid}{"$key.$ +value"} = $converter->convert($value); } } } --- some stuff below ----


When I print the values out (eg: print $converter->convert($value)."\n"; ) it looks correct (eg ISO-8859-1 encoded), but when writing to the DB, it's UTF-8 again (meaning all special chars, like öäüéàè etc are some weird chars like Ã| etc...).

I'm really going nuts here, and would appreciate any help provided for this.

If more of the source is needed just tell me.

Thanks in Advance
Emanuel

Replies are listed 'Best First'.
Re: XML::Parser Encoding (UTF-8 -> ISO-8859-1)
by grantm (Parson) on Sep 12, 2002 at 02:13 UTC

    The Perl-XML FAQ has a section on encodings.

    When you parse a file, the resulting data in the Perl variables will be UTF8 encoded regardless of the source encoding. I'm not an expert on mySQL but I wouldn't have thought that the act of INSERTing into a table would result in characters being converted from UTF8 to ISO8859-1.

    With Perl 5.6.0 and later, you can convert a UTF string to a latin-1 string with the somewhat cryptic:

    use utf8; my $latin = pack("C*", unpack('U*', $utf));

    As jkahn said, it's not possible to map all UTF characters to Latin-1. In particular, the 'smart quotes' characters from MS Office apps do not have Latin-1 equivalents. You could simply encode characters beyond 0x7f as numeric entities (if you're ultimately going to write them back out as XML or HTML) or you could replace troublesome characters with more generic equivalents. The FAQ has some code snippets for both options.

      thank you very very very much!

      this solved my headache, and now everything is inserted into the database correctly. My task for aftersleep is to dig through the FAQ at perl-xml, and learn more about everything.

      You can't imagine how happy I am right now :)

      About the additional characters that won't fit into Latin-1, there won't be any occurence of such characters. But still i'm going to read up on this, since it's possible that something like this might occur one day.

      Emanuel
Re: XML::Parser Encoding (UTF-8 -> ISO-8859-1)
by jkahn (Friar) on Sep 12, 2002 at 00:49 UTC
    I'm not an expert on Text::Iconv (I prefer Unicode::String, it seems to be more portable) but I know a little about Unicode encodings.

    A couple of notes, some of which may be relevant. Forgive me if you know all this -- I figured it might be useful to somebody who's looking for this kind of information, even if it doesn't necessarily help Emanuel:

    1. ISO-8859-1 doesn't have the richness to encode every possible character from UTF-8. Many Eastern European characters (not to mention South and East Asian characters) cannot be encoded in ISO-8859-1. There just aren't enough bits. Are there characters in your data outside the range (U+0000 .. U+00FF) ?
    2. Perl's internals are in UTF-8, so the fact that print outputs correct-looking data may be because Text::Iconv is not doing it's job correctly (or you're not using the feature the way it's intended). In other words, if your data is *still* in UTF-8, then it will probably print correctly.
    3. If you're using .nix or cygwin, then you probably have the od (octal dump) tool available, which I find indispensable for determining codeset issues (editors like vi and emacs tend to operate at too high a level, because they try to interpret the encoding for you and things "look fine" even when they're in the wrong encoding). I use:od -a all the time to figure out whether I've used encoding tools correctly.

    HTH, jkahn

      Hello, thanks for the reply

      It has to be ISO-8859-1, or Latin-1. I don't have any Asian Characters in it.

      The second point mention could be true I guess. Think i'll have to do more digging into that. Hope i'll find something this way, although I thought I'm doing it the right way, I might easily be wrong.

      I've been using hexdump and calculated to octal, didn't know about od.. you never stop learning.

      thanks for your reply, i'll get working on Text::Iconv.

      Emanuel
      Edit:
      Here's a sample output, i quickly hacked in..:

      Before Conversion: Live Fußball: Bundesliga, 3. Spieltag
      After Conversion: Live Fußball: Bundesliga, 3. Spieltag

      dumped it to a file, checked it with od and hexdump, but it looks correct.. still in the dB it appears as in Bevore Conversion.

Re: XML::Parser Encoding (UTF-8 -> ISO-8859-1)
by ash (Monk) on Sep 12, 2002 at 09:20 UTC
    use bytes and Text::Iconv made my day better!

    -- 
    ash/asksh <ask@unixmonks.net>

Re: XML::Parser Encoding (UTF-8 -> ISO-8859-1)
by bart (Canon) on Sep 12, 2002 at 20:15 UTC
    Annoying, isn't it? Perl has its own idea of whether your data should be treated as UTF8 or as 8-bit encoding, and it may not agree with your idea. If you concat a plain 8-bit string with a string marked as UTF-8, Perl will convert the 8-bit string to UTF-8 as well, whether you want it or not.

    The trickery is all connected to the UTF8 flag, which is a bit flag attached to each string. For the data that XML::Parser returns, that UTF8 flag is set.

    To get rid of this behaviour, clear the UTF8 flag. One way you can do it, is like this:

    sub de_utf8 { use bytes; return "$_[0]"; }
    This way, the resulting string will be the same byte data as the original string, but with the UTF8 flag cleared. So now, the string should stay in ISO-Latin-1.
Re: XML::Parser Encoding (UTF-8 -> ISO-8859-1)
by Stegalex (Chaplain) on Sep 14, 2002 at 12:26 UTC
    Mime::Decoder worked for me.
    Something like:
    my $decoder = new MIME::Decoder 'quoted-printable' or warn "\nunsuppor +ted\n";<br> open (TEMP, ">/tmp/$$") or warn "\ncan't open /tmp/$$ for writing";<br +> $decoder->decode(\*MSG, \*TEMP);<br>
    ~~~~~~~~~~~~~~~
    I like chicken.