Re^2: A UTF8 round trip with MySQL

Replies are listed 'Best First'.
Re^3: A UTF8 round trip with MySQL by Juerd (Abbot) on Jun 13, 2007 at 20:45 UTC
It seems at odds with the docs for the open function and perlopentut, both of which give examples using it Ah, more documentation needs updates then! I'll look into it; thanks for the pointers. binmode in perlfunc, in the current development tree, already has the following change: `-To mark FILEHANDLE as UTF-8, use C<:utf8>. +To mark FILEHANDLE as UTF-8, use C<:utf8>. This will fail on invalid +UTF-8 sequences; C<:encoding(UTF-8)> is a safer (but slightly less +efficient) choice.` [download] Juerd # { site => 'juerd.nl', do_not_use => 'spamtrap', perl6_server => 'feather' }	[reply] [d/l]
Re^4: A UTF8 round trip with MySQL by Anonymous Monk on Jul 30, 2013 at 13:28 UTC
I am not sure what could be safer than failing on invalid data - if invalid data is encountered, failing would be better than e.g. guessing and silently corrupting data.	[reply]
Re^3: A UTF8 round trip with MySQL by Joost (Canon) on Jun 13, 2007 at 20:34 UTC
Using "<:utf8" has worked fine for me so far. However, juerd does know about this stuff. I'd be interested to know the risks involved. "What should it profit a man, if he should win a flame war, yet lose his cool?"	[reply]
Re^4: A UTF8 round trip with MySQL by Juerd (Abbot) on Jun 13, 2007 at 20:52 UTC
I'd be interested to know the risks involved. The most obvious risk involved is that your program can halt if you have malformed internal data. The error "Malformed UTF-8 character" is fatal. Less obvious risks include security bugs because things may be interpreted differently at different levels: something may pass an untainting regex, but still be unsafe in a library call. This is because there is no single standard way of dealing with malformed byte sequences. With naive (yet common) C code it can even lead to data corruption. The following change is in current blead: --- perl-current/pod/perldiag.pod 2007-01-02 19:17:01.000000000 ++0100 +++ mijn/pod/perldiag.pod 2007-03-03 18:12:23.000000000 +0100 @@ -2263,12 +2263,19 @@ =item Malformed UTF-8 character (%s) -(S utf8) (F) Perl detected something that didn't comply with UTF-8 -encoding rules. +(S utf8) (F) Perl detected a string that didn't comply with UTF-8 +encoding rules, even though it had the UTF8 flag on. -One possible cause is that you read in data that you thought to be in -UTF-8 but it wasn't (it was for example legacy 8-bit data). Another -possibility is careless use of utf8::upgrade(). +One possible cause is that you set the UTF8 flag yourself for data th +at +you thought to be in UTF-8 but it wasn't (it was for example legacy +8-bit data). To guard against this, you can use Encode::decode_utf8. + +If you use the C<:encoding(UTF-8)> PerlIO layer for input, invalid by +te +sequences are handled gracefully, but if you use C<:utf8>, the flag i +s +set without validating the data, possibly resulting in this error +message. + +See also L<Encode/"Handling Malformed Data">. [download] Juerd # { site => 'juerd.nl', do_not_use => 'spamtrap', perl6_server => 'feather' }	[reply] [d/l]
Re^5: A UTF8 round trip with MySQL by Joost (Canon) on Jun 13, 2007 at 21:01 UTC
I now get the 'unchecked input' part. And I can sort of understand issues with tainting. About the C code: you're talking about C code that tries to interpret the invalid utf-8, right? Because C's basic string operations don't look at the encoding, so they are just as (un)safe when you send them a non-utf8 marked string with miscellaneous binary data in it. update: about the (removed) line: "Another possibility is careless use of utf8::upgrade()." That's removed because utf8::upgrade() is always safe (if you start out with valid utf-8 flags), right? "What should it profit a man, if he should win a flame war, yet lose his cool?"	[reply]
Re^6: A UTF8 round trip with MySQL by Juerd (Abbot) on Jun 13, 2007 at 21:12 UTC
Re^5: A UTF8 round trip with MySQL by mje (Curate) on Mar 31, 2009 at 10:24 UTC
I realise the quoted text is from blead but are you saying the :utf8 IO layer in earlier perls (say 5.8.8 for example) just sets the utf-8 flag without checking the encoding? If so then I don't understand the following in 5.8.8 `od -x x.data 0000000 8181 8282 8383 000a` [download] `use strict; use warnings; my $fh; open ($fh, "<:utf8", "x.data"); my $img = ''; while (<$fh>) {$img .= $_;}` [download] produces 1 utf8 "\x81" does not map to Unicode at invalid_utf8.pl line 8, <$fh> line 1. but changing the io layer to :encoding(UTF8) seems to make no difference other than reporting that same error 6 times, one for each byte.	[reply] [d/l] [select]


We don't bite newbies here... much
	PerlMonks