http://qs321.pair.com?node_id=838883


in reply to Re: DWIM with non ASCII characters
in thread DWIM with non ASCII characters

Decode everything that comes from the outside. Encode everything that leaves your program. use utf8;

Why use utf8;? As I understand the documentation, its purpose is to enable the source code to be in UTF-8 (so you can do e.g. my $ñ = 'foo'; where 'ñ' is not a single byte). It even says "Do not use this pragma for anything else than telling Perl that your script is written in UTF-8".

I thought the preferred way to decode/encode the program's input/output was by using Encode.

--
 David Serrano
 (Please treat my english text just like Perl code, i.e. feel free to notify me of any syntax, grammar, style and/or spelling errors. Thank you!).

Replies are listed 'Best First'.
Re^3: DWIM with non ASCII characters
by moritz (Cardinal) on May 07, 2010 at 07:58 UTC
    Why use utf8;? As I understand the documentation, its purpose is to enable the source code to be in UTF-8

    Yes, that way you avoid concatenating decoded and non-decoded strings.

    Of course it requires your script to be actually stored in UTF-8. But since the more general solution (use encoding $your_encoding) is severly broken (wrt to AUTOLOAD, thread safety and other issues), that's currently the only sane way to store non-ASCII Perl programs.

    As for the rest, I can only agree to what ikegami wrote; using IO layers is much more convenient than using encode() and decode() on every IO operation. More importantly since there are fewer spots you have to care about encoding, the probability of forgetting it somewhere (and getting Mojibake in response) is much lower.

    Perl 6 - links to (nearly) everything that is Perl 6.
Re^3: DWIM with non ASCII characters
by ikegami (Patriarch) on May 07, 2010 at 07:48 UTC

    More importantly, use utf8; allows you to do

    my $foo = 'ñ';

    So far, I've stuck to ASCII in my sources, so use utf8; wouldn't do anything for me.

    I thought the preferred way to decode/encode the program's input/output was by using Encode.

    No way. Why encode and decode everything yourself when you can let PerlIO do it. At least, that's the way I see it.

      More importantly, use utf8; allows you to do
      my $foo = 'ñ';

      Hmm, then I must have configured something in my system, since I can do that without use'ing utf8:

      $ xxd ñ.pl 0000000: 7072 696e 7420 27c3 b127 0a print '..'. $ env -i /usr/bin/perl -Mstrict -wl ñ.pl ñ

      --
       David Serrano
       (Please treat my english text just like Perl code, i.e. feel free to notify me of any syntax, grammar, style and/or spelling errors. Thank you!).

        This only works because you have a UTF-8 terminal, but haven't told Perl about it.  In other words, Perl is treating the UTF-8 encoded byte sequence in the source code - which represents the Unicode char U+00F1 (ñ) - as two separate bytes, and passes them on as is (i.e. UTF-8 encoded) to the terminal, which consequently displays the character correctly.

        Perl internally, however, you don't have a character string, so you cannot properly match, etc.:

        #!/usr/local/bin/perl -l use strict; use warnings; use Encode; my $bytes = 'ñ'; # UTF-8 encoded source (c3 b1 = ñ) # displays as two latin1 chars here (c3 = Ã, b1 = + ±), # because PM doesn't handle UTF-8 my $chars = decode('UTF-8', $bytes); print '$bytes eq \x{f1} ? ', $bytes eq "\x{f1}" ? "match":"no match"; print '$chars eq \x{f1} ? ', $chars eq "\x{f1}" ? "match":"no match"; print '$bytes: ', $bytes; print '$chars: ', $chars; binmode STDOUT, "utf8"; print '$bytes (STDOUT is UTF-8): ', $bytes; print '$chars (STDOUT is UTF-8): ', $chars;

        The string comparison outputs:

        $bytes eq \x{f1} ? no match $chars eq \x{f1} ? match

        and the byte/char values print as (in a UTF-8 terminal):

        $bytes: ñ $chars: $bytes (STDOUT is UTF-8): ñ $chars (STDOUT is UTF-8): ñ

        Note that as soon as you tell Perl that your terminal is UTF-8 (with binmode), the byte string stops printing correctly, because Perl is now converting the two byte/latin1 chars c3 and b1 to the respective UTF-8 sequences c3 83 and c2 b1, which display as two separate characters...

        That example demonstrates the use of an optimisation: You skipped specifying use utf8; by also skipping encoding. In the common case, it won't work. You'll find the length of the string is wrong. In turn, that means you'll have problems with regex, etc.