Beefy Boxes and Bandwidth Generously Provided by pair Networks
We don't bite newbies here... much
 
PerlMonks  

Re^2: A UTF8 round trip with MySQL

by clinton (Priest)
on Jun 13, 2007 at 20:13 UTC ( [id://621084]=note: print w/replies, xml ) Need Help??


in reply to Re: A UTF8 round trip with MySQL
in thread A UTF8 round trip with MySQL

Thanks Juerd

The :utf8 layer should not be used on input filehandles. Use :encoding(UTF-8) instead.

Why do you say this? It seems at odds with the docs for the open function and perlopentut, both of which give examples using it:

open(my $fh, "<:utf8", $fn);

thanks

Clint

Replies are listed 'Best First'.
Re^3: A UTF8 round trip with MySQL
by Juerd (Abbot) on Jun 13, 2007 at 20:45 UTC

    It seems at odds with the docs for the open function and perlopentut, both of which give examples using it

    Ah, more documentation needs updates then! I'll look into it; thanks for the pointers.

    binmode in perlfunc, in the current development tree, already has the following change:

    -To mark FILEHANDLE as UTF-8, use C<:utf8>. +To mark FILEHANDLE as UTF-8, use C<:utf8>. This will fail on invalid +UTF-8 sequences; C<:encoding(UTF-8)> is a safer (but slightly less +efficient) choice.

    Juerd # { site => 'juerd.nl', do_not_use => 'spamtrap', perl6_server => 'feather' }

      I am not sure what could be safer than failing on invalid data - if invalid data is encountered, failing would be better than e.g. guessing and silently corrupting data.
Re^3: A UTF8 round trip with MySQL
by Joost (Canon) on Jun 13, 2007 at 20:34 UTC

      I'd be interested to know the risks involved.

      The most obvious risk involved is that your program can halt if you have malformed internal data. The error "Malformed UTF-8 character" is fatal. Less obvious risks include security bugs because things may be interpreted differently at different levels: something may pass an untainting regex, but still be unsafe in a library call. This is because there is no single standard way of dealing with malformed byte sequences. With naive (yet common) C code it can even lead to data corruption.

      The following change is in current blead:

      --- perl-current/pod/perldiag.pod 2007-01-02 19:17:01.000000000 ++0100 +++ mijn/pod/perldiag.pod 2007-03-03 18:12:23.000000000 +0100 @@ -2263,12 +2263,19 @@ =item Malformed UTF-8 character (%s) -(S utf8) (F) Perl detected something that didn't comply with UTF-8 -encoding rules. +(S utf8) (F) Perl detected a string that didn't comply with UTF-8 +encoding rules, even though it had the UTF8 flag on. -One possible cause is that you read in data that you thought to be in -UTF-8 but it wasn't (it was for example legacy 8-bit data). Another -possibility is careless use of utf8::upgrade(). +One possible cause is that you set the UTF8 flag yourself for data th +at +you thought to be in UTF-8 but it wasn't (it was for example legacy +8-bit data). To guard against this, you can use Encode::decode_utf8. + +If you use the C<:encoding(UTF-8)> PerlIO layer for input, invalid by +te +sequences are handled gracefully, but if you use C<:utf8>, the flag i +s +set without validating the data, possibly resulting in this error +message. + +See also L<Encode/"Handling Malformed Data">.

      Juerd # { site => 'juerd.nl', do_not_use => 'spamtrap', perl6_server => 'feather' }

        I now get the 'unchecked input' part. And I can sort of understand issues with tainting. About the C code: you're talking about C code that tries to interpret the invalid utf-8, right? Because C's basic string operations don't look at the encoding, so they are just as (un)safe when you send them a non-utf8 marked string with miscellaneous binary data in it.

        update: about the (removed) line: "Another possibility is careless use of utf8::upgrade()."

        That's removed because utf8::upgrade() is always safe (if you start out with valid utf-8 flags), right?

        I realise the quoted text is from blead but are you saying the :utf8 IO layer in earlier perls (say 5.8.8 for example) just sets the utf-8 flag without checking the encoding? If so then I don't understand the following in 5.8.8

        od -x x.data 0000000 8181 8282 8383 000a
        use strict; use warnings; my $fh; open ($fh, "<:utf8", "x.data"); my $img = ''; while (<$fh>) {$img .= $_;}

        produces 1 utf8 "\x81" does not map to Unicode at invalid_utf8.pl line 8, <$fh> line 1.

        but changing the io layer to :encoding(UTF8) seems to make no difference other than reporting that same error 6 times, one for each byte.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://621084]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others pondering the Monastery: (4)
As of 2024-04-16 13:10 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found