Beefy Boxes and Bandwidth Generously Provided by pair Networks
good chemistry is complicated,
and a little bit messy -LW
 
PerlMonks  

Re: The Queensrÿche Situation

by aitap (Curate)
on Oct 19, 2014 at 19:00 UTC ( [id://1104335]=note: print w/replies, xml ) Need Help??


in reply to The Queensrÿche Situation

You didn't use binmode to apply an IOLayer to encode Unicode characters you print to STDOUT, neither you encode them manually. When Perl encounters characters where it expects bytes (in any IO) it applies some heuristics to translate the former to the latter. Usually it means that what can be translated to latin1 gets (silently!) translated and everything else is printed in utf8 (with a warning):

$ perl -w -Mutf8 -E'say "ы"; say "ÿ";'
Wide character in say at -e line 1.
ы
�
(my terminal is utf-8)

And when you use utf8, Perl decodes utf8 byte string literals into characters for you. The same is done by Encode::decode.

Does adding binmode STDOUT, ":utf8"; fix your problem? You can also use :encoding(...) IOLayers to encode into other encodings.

Replies are listed 'Best First'.
Re^2: The Queensrÿche Situation
by Rodster001 (Pilgrim) on Oct 19, 2014 at 19:24 UTC
    Yes! That fixes the printing problem in my terminal. And this makes complete sense now. Thank you for clearing this up!

    One problem remains that I still don't quite understand.

    #!/usr/bin/perl use strict; use Encode; use Text::Unaccent::PurePerl; binmode STDOUT, ":utf8"; use utf8; my $string = "Queensrÿche"; no utf8; chars($string); (Encode::is_utf8($string))? print "this is utf8\n" : print "this is NO +T utf8\n"; print "$string\n"; print "unaccented: " . Text::Unaccent::PurePerl::unac_string($string) +. "\n"; exit; sub chars { my $k = shift; my @chars = split("",$k); foreach (@chars) { my $dec = ord($_); my $chr = chr(ord($_)); my $q = qquote($_); print "\t$dec\t$chr\t$q\n"; } } sub qquote { local($_) = shift; s/([\\\"\@\$])/\\$1/g; my $bytes; { use bytes; $bytes = length } s/([[:^ascii:]])/'\x{'.sprintf("%x",ord($1)).'}'/ge if $bytes +> length; return $_;
    Why does that produce, this:
    81 Q Q 117 u u 101 e e 101 e e 110 n n 115 s s 114 r r 255 ÿ \x{ff} 99 c c 104 h h 101 e e this is utf8 Queensrÿche unaccented: Queensryche
    Is that actually valid utf-8? Shouldn't the ÿ be two bytes (decimal 195 191)? Like this:
    81 Q Q 117 u u 101 e e 101 e e 110 n n 115 s s 114 r r 195 - \x{c3} 191 - \x{bf} 99 c c 104 h h 101 e e

      When you work with Unicode, you should get greater character codes (>=255), not byte sequences, because Perl encapsulates encodings for you. For example,

      use utf8;
      binmode STDOUT, ":utf8";
      my $string = "Queensrÿche ы";
      printf "%x\t%s\n", ord($_), $_ for split "", $string;
      __END__
      51      Q
      75      u
      65      e
      65      e
      6e      n
      73      s
      72      r
      ff      ÿ
      63      c
      68      h
      65      e
      20       
      44b     ы
      

      If you need to work with utf-8 bytes, encode them back:

      use utf8;
      use Encode 'encode';
      binmode STDOUT, ":utf8";
      my $string = "Queensrÿche ы";
      printf "%x\t%s\n", ord($_), $_ for split "", encode utf8 => $string;
      __END__
      51      Q
      75      u
      65      e
      65      e
      6e      n
      73      s
      72      r
      c3      Ã
      bf      ¿
      63      c
      68      h
      65      e
      20       
      d1      Ñ
      8b
      
      But there would be no point in using utf8 and Encode in this case.

        Ok, this is all falling into place for me now. Thank you.
      "Yes! That fixes the printing problem in my terminal!"

      Thats nice. But just to add a little bit confusion., please see this:

      A One-Liner prints it out as expected:

      karl$ perl -e 'print qq(Queensrÿche\n)' Queensrÿche

      But please see what happens when i put the stuff into a script (in the same terminal session):

      #!/usr/bin/env perl use strict; use warnings; binmode STDOUT, ":utf8"; my $string = qq(Queensrÿche); print qq($string\n); my $y_with_trema = qq(\N{LATIN SMALL LETTER Y WITH DIAERESIS}); print qq($y_with_trema\n); $string = qq(Queensr) . $y_with_trema . qq(che); print qq($string\n); __END__ karls-mac-mini:monks karl$ ./roadster001.pl Queensrÿche ÿ Queensrÿche

      Seems like things are getting weird. I wonder when i ever will understand this crap.

      N.B.: I came in a bit late and didn't read all the posts yet.

      Best regards, Karl

      «The Crux of the Biscuit is the Apostrophe»

        I can actually answer this now :) Take a look below, hopefully that will clear it up.
        #!/usr/bin/env perl use strict; use warnings; use Encode; binmode STDOUT, ":utf8"; my $string = qq(Queensrÿche); print qq($string\n); Encode::is_utf8($string)? print " - is utf8\n" : print " - is not utf8 +\n"; use utf8; $string = qq(Queensrÿche); no utf8; print qq($string\n); Encode::is_utf8($string)? print " - is utf8\n" : print " - is not utf8 +\n";
        Ouput:
        Queensrÿche - is not utf8 Queensrÿche - is utf8
      I figured it out, sort of. The first is actually ascii (255 maps to "ÿ"): http://www.ascii-code.com

      So, when I take the string "Queensrÿche" (which IS actually encoded as utf-8) for example:
      Decimal Char escaped 81 Q Q 117 u u 101 e e 101 e e 110 n n 115 s s 114 r r 195 - \x{c3} 191 - \x{bf} 99 c c 104 h h 101 e e
      It is now printing on my terminal like this:
      Queensrÿche
      This makes sense, in a way, now because 195 maps to "Ã" and 191 maps to "¿". So, now my question is, why isn't this mapping using a utf-8 table (instead of ascii)? Encode thinks the string is utf-8 (which I assume means the utf-8 flag is on).

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://1104335]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others about the Monastery: (5)
As of 2024-03-28 14:10 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found