Re^2: DWIM with non ASCII characters

in reply to Re: DWIM with non ASCII characters
in thread DWIM with non ASCII characters

Decode everything that comes from the outside. Encode everything that leaves your program. use utf8;

Why use utf8;? As I understand the documentation, its purpose is to enable the source code to be in UTF-8 (so you can do e.g. my $ñ = 'foo'; where 'ñ' is not a single byte). It even says "Do not use this pragma for anything else than telling Perl that your script is written in UTF-8".

I thought the preferred way to decode/encode the program's input/output was by using Encode.

--
David Serrano
(Please treat my english text just like Perl code, i.e. feel free to notify me of any syntax, grammar, style and/or spelling errors. Thank you!).

Comment on Re^2: DWIM with non ASCII characters Select or Download Code

Replies are listed 'Best First'.
Re^3: DWIM with non ASCII characters by moritz (Cardinal) on May 07, 2010 at 07:58 UTC
Why use utf8;? As I understand the documentation, its purpose is to enable the source code to be in UTF-8 Yes, that way you avoid concatenating decoded and non-decoded strings. Of course it requires your script to be actually stored in UTF-8. But since the more general solution (use encoding $your_encoding) is severly broken (wrt to AUTOLOAD, thread safety and other issues), that's currently the only sane way to store non-ASCII Perl programs. As for the rest, I can only agree to what ikegami wrote; using IO layers is much more convenient than using encode() and decode() on every IO operation. More importantly since there are fewer spots you have to care about encoding, the probability of forgetting it somewhere (and getting Mojibake in response) is much lower. Perl 6 - links to (nearly) everything that is Perl 6.	[reply]
Re^3: DWIM with non ASCII characters by ikegami (Patriarch) on May 07, 2010 at 07:48 UTC
More importantly, `use utf8;` allows you to do `my $foo = 'ñ';` [download] So far, I've stuck to ASCII in my sources, so `use utf8;` wouldn't do anything for me. I thought the preferred way to decode/encode the program's input/output was by using Encode. No way. Why encode and decode everything yourself when you can let PerlIO do it. At least, that's the way I see it.	[reply] [d/l] [select]
Re^4: DWIM with non ASCII characters by Hue-Bond (Priest) on May 07, 2010 at 08:23 UTC
More importantly, `use utf8;` allows you to do `my $foo = 'ñ';` [download] Hmm, then I must have configured something in my system, since I can do that without use'ing utf8: `$ xxd ñ.pl 0000000: 7072 696e 7420 27c3 b127 0a print '..'. $ env -i /usr/bin/perl -Mstrict -wl ñ.pl ñ` [download] -- David Serrano (Please treat my english text just like Perl code, i.e. feel free to notify me of any syntax, grammar, style and/or spelling errors. Thank you!).	[reply] [d/l] [select]
Re^5: DWIM with non ASCII characters by almut (Canon) on May 07, 2010 at 15:10 UTC
This only works because you have a UTF-8 terminal, but haven't told Perl about it. In other words, Perl is treating the UTF-8 encoded byte sequence in the source code - which represents the Unicode char `U+00F1 (ñ)` - as two separate bytes, and passes them on as is (i.e. UTF-8 encoded) to the terminal, which consequently displays the character correctly. Perl internally, however, you don't have a character string, so you cannot properly match, etc.: #!/usr/local/bin/perl -l use strict; use warnings; use Encode; my $bytes = 'Ã±'; # UTF-8 encoded source (c3 b1 = ñ) # displays as two latin1 chars here (c3 = Ã, b1 = + ±), # because PM doesn't handle UTF-8 my $chars = decode('UTF-8', $bytes); print '$bytes eq \x{f1} ? ', $bytes eq "\x{f1}" ? "match":"no match"; print '$chars eq \x{f1} ? ', $chars eq "\x{f1}" ? "match":"no match"; print '$bytes: ', $bytes; print '$chars: ', $chars; binmode STDOUT, "utf8"; print '$bytes (STDOUT is UTF-8): ', $bytes; print '$chars (STDOUT is UTF-8): ', $chars; [download] The string comparison outputs: `$bytes eq \x{f1} ? no match $chars eq \x{f1} ? match` [download] and the byte/char values print as (in a UTF-8 terminal): `$bytes: ñ $chars: $bytes (STDOUT is UTF-8): Ã± $chars (STDOUT is UTF-8): ñ` [download] Note that as soon as you tell Perl that your terminal is UTF-8 (with `binmode`), the byte string stops printing correctly, because Perl is now converting the two byte/latin1 chars `c3` and `b1` to the respective UTF-8 sequences `c3 83` and `c2 b1`, which display as two separate characters...	[reply] [d/l] [select]
Re^6: DWIM with non ASCII characters by Hue-Bond (Priest) on May 07, 2010 at 21:06 UTC
Re^5: DWIM with non ASCII characters by ikegami (Patriarch) on May 07, 2010 at 16:09 UTC
That example demonstrates the use of an optimisation: You skipped specifying `use utf8;` by also skipping encoding. In the common case, it won't work. You'll find the length of the string is wrong. In turn, that means you'll have problems with regex, etc.	[reply] [d/l]

In Section Seekers of Perl Wisdom