Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked
 
PerlMonks  

Re^5: A UTF8 round trip with MySQL

by Joost (Canon)
on Jun 13, 2007 at 11:28 UTC ( [id://620917]=note: print w/replies, xml ) Need Help??


in reply to Re^4: A UTF8 round trip with MySQL
in thread A UTF8 round trip with MySQL

The string is stored internally with bytes > 128, but without UTF-8 flag turned on, but Perl still understands this string.

Yes, because it's stored in the default 8bit encoding, probably Latin-1. This is assuming you're not using the utf8 pragma, and your script file really is in the default 8bit encoding.

DBD::mysql does not recognise this as UTF-8 (because missing UTF-8 flag, so accented characters are stripped.
No, dbd::mysql will -currently- assume the string is utf-8 anyway, but since it's actually latin-1 the mysql database will (in my experience) truncate the string at the first accented character. In other words, that value in the database will end up as "latin-1 "

utf8::upgrade($string) turns on the flag
And it converts the string to utf8 first. At that point you're guaranteed that the internal encoding of $string is really utf-8. utf8::upgrade() is a no-op if the string already is flagged as utf-8, so you can always safely use it when your strings are correctly marked.
Would using $string = Encode::decode_utf8($string) also work in this case?
No, because the string isn't in utf8 but in the default 8bit encoding.

Replies are listed 'Best First'.
Re^6: A UTF8 round trip with MySQL
by clinton (Priest) on Jun 13, 2007 at 11:50 UTC
    No, because the string isn't in utf8 but in the default 8bit encoding.
    Sorry, that should have been $string = Encode::decode('iso-8859-1',$string)

    From the Encode docs:

    the UTF8 flag for $string is on unless $octets entirely consists of ASCII data (or EBCDIC on EBCDIC machines).

    However, that wouldn't work for $string = 'ρα'; without a preceeding use utf8; because it would, by default be stored internally as Latin-1, and here you would need to utf8::upgrade($string).

    From the perlunicode docs:

    By default, there is a fundamental asymmetry in Perl's unicode model: implicit upgrading from byte strings to Unicode strings assumes that they were encoded in *ISO 8859-1 (Latin-1)*, but Unicode strings are downgraded with UTF-8 encoding. This happens because the first 256 codepoints in Unicode happens to agree with Latin-1.

    Have I got this right?

    Clint
      If your script is in latin-1, decode('iso-8859-1',$string) will work too. As far as I know decode() will always upgrade to utf8 (or ascii, which is a byte-compatible subset of utf-8)

      If $string = 'ρα'; is a literal in a utf-8 encoded script, you should use the utf8 pragma to set the utf-8 markers correctly on literals. And then decode('iso-8859-1') probably won't work correctly on it. But utf8::upgrade() will still work.

      By default, there is a fundamental asymmetry in Perl's unicode model: implicit upgrading from byte strings to Unicode strings assumes that they were encoded in *ISO 8859-1 (Latin-1)*, but Unicode strings are downgraded with UTF-8 encoding. This happens because the first 256 codepoints in Unicode happens to agree with Latin-1.
      I don't know what "Unicode strings are downgraded with UTF-8 encoding" means. Also the line below that paragraph in perlunicode says

      If you wish to interpret byte strings as UTF-8 instead, use the "encod +ing" pragma: use encoding 'utf8';

      Don't believe it. You should use utf8; instead. use encoding 'utf8' is broken.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://620917]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others goofing around in the Monastery: (2)
As of 2024-04-25 19:08 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found