Re^4: A UTF8 round trip with MySQL

I'd be interested to know the risks involved.

The most obvious risk involved is that your program can halt if you have malformed internal data. The error "Malformed UTF-8 character" is fatal. Less obvious risks include security bugs because things may be interpreted differently at different levels: something may pass an untainting regex, but still be unsafe in a library call. This is because there is no single standard way of dealing with malformed byte sequences. With naive (yet common) C code it can even lead to data corruption.

The following change is in current blead:

--- perl-current/pod/perldiag.pod       2007-01-02 19:17:01.000000000 
++0100
+++ mijn/pod/perldiag.pod       2007-03-03 18:12:23.000000000 +0100
@@ -2263,12 +2263,19 @@

 =item Malformed UTF-8 character (%s)

-(S utf8) (F) Perl detected something that didn't comply with UTF-8
-encoding rules.
+(S utf8) (F) Perl detected a string that didn't comply with UTF-8
+encoding rules, even though it had the UTF8 flag on.

-One possible cause is that you read in data that you thought to be in
-UTF-8 but it wasn't (it was for example legacy 8-bit data).  Another
-possibility is careless use of utf8::upgrade().
+One possible cause is that you set the UTF8 flag yourself for data th
+at
+you thought to be in UTF-8 but it wasn't (it was for example legacy
+8-bit data). To guard against this, you can use Encode::decode_utf8.
+
+If you use the C<:encoding(UTF-8)> PerlIO layer for input, invalid by
+te
+sequences are handled gracefully, but if you use C<:utf8>, the flag i
+s
+set without validating the data, possibly resulting in this error
+message.
+
+See also L<Encode/"Handling Malformed Data">.
[download]

Juerd # { site => 'juerd.nl', do_not_use => 'spamtrap', perl6_server => 'feather' }

Comment on Re^4: A UTF8 round trip with MySQL Download Code

Replies are listed 'Best First'.
Re^5: A UTF8 round trip with MySQL by Joost (Canon) on Jun 13, 2007 at 21:01 UTC
I now get the 'unchecked input' part. And I can sort of understand issues with tainting. About the C code: you're talking about C code that tries to interpret the invalid utf-8, right? Because C's basic string operations don't look at the encoding, so they are just as (un)safe when you send them a non-utf8 marked string with miscellaneous binary data in it. update: about the (removed) line: "Another possibility is careless use of utf8::upgrade()." That's removed because utf8::upgrade() is always safe (if you start out with valid utf-8 flags), right? "What should it profit a man, if he should win a flame war, yet lose his cool?"	[reply]
Re^6: A UTF8 round trip with MySQL by Juerd (Abbot) on Jun 13, 2007 at 21:12 UTC
About the C code: you're talking about C code that tries to interpret the invalid utf-8, right? Yes. I was specifically (but implicitly) referring to XS code, C code catered for Perl interaction. The UTF8 flag is interpreted as a promise that the buffer will be valid UTF8. Of course, it would be better to use Perl's macros for UTF8 handling, but that doesn't work if you're calling a library function that doesn't do SVs but does require valid UTF-8. about the (removed) line: "Another possibility is careless use of utf8::upgrade()." That's removed because utf8::upgrade() is always safe (if you start out with valid utf-8 flags), right? Exactly. The original author probably confused utf8::upgrade with Encode::_utf8_on. Juerd # { site => 'juerd.nl', do_not_use => 'spamtrap', perl6_server => 'feather' }	[reply]
Re^5: A UTF8 round trip with MySQL by mje (Curate) on Mar 31, 2009 at 10:24 UTC
I realise the quoted text is from blead but are you saying the :utf8 IO layer in earlier perls (say 5.8.8 for example) just sets the utf-8 flag without checking the encoding? If so then I don't understand the following in 5.8.8 `od -x x.data 0000000 8181 8282 8383 000a` [download] `use strict; use warnings; my $fh; open ($fh, "<:utf8", "x.data"); my $img = ''; while (<$fh>) {$img .= $_;}` [download] produces 1 utf8 "\x81" does not map to Unicode at invalid_utf8.pl line 8, <$fh> line 1. but changing the io layer to :encoding(UTF8) seems to make no difference other than reporting that same error 6 times, one for each byte.	[reply] [d/l] [select]