Beefy Boxes and Bandwidth Generously Provided by pair Networks
Perl Monk, Perl Meditation
 
PerlMonks  

Re: A UTF8 round trip with MySQL

by Joost (Canon)
on Jun 13, 2007 at 02:22 UTC ( [id://620863]=note: print w/replies, xml ) Need Help??


in reply to A UTF8 round trip with MySQL

DBD::mysql 4.004 ( and higher with mysql_enable_utf8 will (or, should) correctly detect any retrieved utf-8 columns (i.e. it will set the utf8 flag on them). That means that retrieved strings will just do the right thing.

update: really, don't use any lower version for this - 4.003 and 4.004 contain some important utf8 fixes

In other words, you should probably not use encode_utf8() on query results. Especially since that discards the utf8 flag.

INSERTing data into utf-8 columns is trickier. As far as I know you still need to make sure the data is really utf-8 encoded. You can use utf8::upgrade() for that.

Do not use methods (like encode_utf8()) that convert to utf-8 but don't set the utf-8 flag for this purpose, since if/when this issue in DBD::mysql gets fixed, those methods will not work correctly.

Replies are listed 'Best First'.
Re^2: A UTF8 round trip with MySQL
by clinton (Priest) on Jun 13, 2007 at 08:26 UTC
    you should probably not use encode_utf8() on query results

    Query results are data coming into Perl, so if anything you would use decode or decode_utf8, but since DBD::mysql version 4, with mysql_enable_utf8, the driver handles this automatically. My use of encode is solely for OUTPUTting the results to the console.

    Do not use methods (like encode_utf8()) that convert to utf-8 but don't set the utf-8 flag for this purpose, since if/when this issue in DBD::mysql gets fixed, those methods will not work correctly.

    I don't understand what issue you're referring to here. The issue of handling UTF-8 with MySQL has already been fixed, no?

    My understanding is that you make sure that Perl understands what your string contains (eg use decode_utf8 to convert input into Perl's internal string format) - this may or may not set the UTF8 flag depending on what characters the string contains. The driver will then handle the conversion to UTF-8 bytes for storage in the database.

    Clint
      My use of encode is solely for OUTPUTting the results to the console.

      I noticed that. I just wanted to make it clear to the readers that DBD::mysql does the right thing when retrieving utf-8 data, and the programmer doesn't need to do anything special - provided (s)he makes sure utf-8 marked strings are handled correctly on output.

      My problem is with more with using encode_utf8() to output utf-8 text to handles. It's a subtle issue, but since encode_utf8 returns unmarked octets it must be treated as binary data; the string can not safely be used as a text string. For one thing, appending a utf-8 marked string or an 8bit latin-1 string to an unmarked utf-8 string causes (possibly irreversible) mangling.

      If you're working with Unicode text, it's almost always better to have all utf-8 encoded strings marked and use the :utf8 IO layers; that way you won't have to worry about which encoding the strings are in while you're working with it.

      Do not use methods (like encode_utf8()) that convert to utf-8 but don't set the utf-8 flag for this purpose, since if/when this issue in DBD::mysql gets fixed, those methods will not work correctly.

      I don't understand what issue you're referring to here. The issue of handling UTF-8 with MySQL has already been fixed, no?

      As far as I know (i haven't tested 4.005 yet) trying to insert a $string into a utf-8 column will not work correctly if the $string is in the default 8-bit encoding with the high bit set (for instance, when $string is in Latin-1 with accented characters).

      There's a fairly recent bug-report on that on rt.cpan.org and it seems that the issue might get fixed so you won't have to manually encode the input strings - dbd-mysql will then do the right thing automatically. (note: rt is often unresponsive - if that link doesn't work, try again a bit later).

      A prerequisite for fixing that bug is that DBD-mysql knows what encoding the input strings are actually in, to prevent it from doing the 8bit -> utf-8 transformation twice (right now it blindly assumes they are utf-8). But the only way to tell is to check the utf-8 flag, which encode_utf8() does not set. utf8::upgrade() does more or less the same thing as encode_utf8(), but might be a bit more efficient since it doesn't need to create a new string when the input string is 7bit ASCII (upgrade works in-place) and it set the utf8 flag correctly.

      If dbd-mysql would work correctly, utf-8 strings marked as utf-8 will work, and 8bit strings will work too. Unmarked utf-8 strings won't work. Currently only valid utf-8 encoded strings work, regardless of the utf-8 mark. In other words, make sure your strings are correctly marked AND utf-8 encoded. utf8::upgrade() does exactly that.

        it's almost always better to have all utf-8 encoded strings marked and use the :utf8 IO layers

        Good point

        trying to insert a $string into a utf-8 column will not work correctly if the $string is in the default 8-bit encoding with the high bit set (for instance, when $string is in Latin-1 with accented characters)

        I think I understand - is this what you mean:

        • If we have : $string = 'latin-1 áccented striñg';
        • The string is stored internally with bytes > 128, but without UTF-8 flag turned on, but Perl still understands this string.
        • DBD::mysql does not recognise this as UTF-8 (because missing UTF-8 flag, so accented characters are stripped.
        • utf8::upgrade($string) turns on the flag
        • DBD::mysql recognises this as UTF-8 and stores it correctly

        Would using $string = Encode::decode_utf8($string) $string = Encode::decode('iso-8859-1',$string) also work in this case?

        Clint

        Update - corrected decode

Re^2: A UTF8 round trip with MySQL
by Anonymous Monk on Jul 25, 2013 at 13:34 UTC
    I am using dbd::mysql 4.020, and no single utf8 problem seems to be fixed (yes, I have mysql_enable_utf8 on).

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://620863]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others scrutinizing the Monastery: (4)
As of 2024-04-25 06:38 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found