#!/usr/bin/perl
use strict;
use Encode;
use Text::Unaccent::PurePerl;
binmode STDOUT, ":utf8";
use utf8;
my $string = "Queensrÿche";
no utf8;
chars($string);
(Encode::is_utf8($string))? print "this is utf8\n" : print "this is NO
+T utf8\n";
print "$string\n";
print "unaccented: " . Text::Unaccent::PurePerl::unac_string($string)
+. "\n";
exit;
sub chars {
my $k = shift;
my @chars = split("",$k);
foreach (@chars) {
my $dec = ord($_);
my $chr = chr(ord($_));
my $q = qquote($_);
print "\t$dec\t$chr\t$q\n";
}
}
sub qquote {
local($_) = shift;
s/([\\\"\@\$])/\\$1/g;
my $bytes; { use bytes; $bytes = length }
s/([[:^ascii:]])/'\x{'.sprintf("%x",ord($1)).'}'/ge if $bytes
+> length;
return $_;
Why does that produce, this:
81 Q Q
117 u u
101 e e
101 e e
110 n n
115 s s
114 r r
255 ÿ \x{ff}
99 c c
104 h h
101 e e
this is utf8
Queensrÿche
unaccented: Queensryche
Is that actually valid utf-8? Shouldn't the ÿ be two bytes (decimal 195 191)? Like this:
81 Q Q
117 u u
101 e e
101 e e
110 n n
115 s s
114 r r
195 - \x{c3}
191 - \x{bf}
99 c c
104 h h
101 e e
| [reply] [Watch: Dir/Any] [d/l] [select] |
When you work with Unicode, you should get greater character codes (>=255), not byte sequences, because Perl encapsulates encodings for you. For example,
use utf8;
binmode STDOUT, ":utf8";
my $string = "Queensrÿche ы";
printf "%x\t%s\n", ord($_), $_ for split "", $string;
__END__
51 Q
75 u
65 e
65 e
6e n
73 s
72 r
ff ÿ
63 c
68 h
65 e
20
44b ы
If you need to work with utf-8 bytes, encode them back:
use utf8;
use Encode 'encode';
binmode STDOUT, ":utf8";
my $string = "Queensrÿche ы";
printf "%x\t%s\n", ord($_), $_ for split "", encode utf8 => $string;
__END__
51 Q
75 u
65 e
65 e
6e n
73 s
72 r
c3 Ã
bf ¿
63 c
68 h
65 e
20
d1 Ñ
8b
But there would be no point in using utf8 and Encode in this case. | [reply] [Watch: Dir/Any] |
Ok, this is all falling into place for me now. Thank you.
| [reply] [Watch: Dir/Any] |
"Yes! That fixes the printing problem in my terminal!"
Thats nice. But just to add a little bit confusion., please see this:
A One-Liner prints it out as expected:
karl$ perl -e 'print qq(Queensrÿche\n)'
Queensrÿche
But please see what happens when i put the stuff into a script (in the same terminal session):
#!/usr/bin/env perl
use strict;
use warnings;
binmode STDOUT, ":utf8";
my $string = qq(Queensrÿche);
print qq($string\n);
my $y_with_trema = qq(\N{LATIN SMALL LETTER Y WITH DIAERESIS});
print qq($y_with_trema\n);
$string = qq(Queensr) . $y_with_trema . qq(che);
print qq($string\n);
__END__
karls-mac-mini:monks karl$ ./roadster001.pl
Queensrÿche
ÿ
Queensrÿche
Seems like things are getting weird. I wonder when i ever will understand this crap.
N.B.: I came in a bit late and didn't read all the posts yet.
Best regards, Karl
«The Crux of the Biscuit is the Apostrophe»
| [reply] [Watch: Dir/Any] [d/l] [select] |
I can actually answer this now :) Take a look below, hopefully that will clear it up.
#!/usr/bin/env perl
use strict;
use warnings;
use Encode;
binmode STDOUT, ":utf8";
my $string = qq(Queensrÿche);
print qq($string\n);
Encode::is_utf8($string)? print " - is utf8\n" : print " - is not utf8
+\n";
use utf8;
$string = qq(Queensrÿche);
no utf8;
print qq($string\n);
Encode::is_utf8($string)? print " - is utf8\n" : print " - is not utf8
+\n";
Ouput:
Queensrÿche
- is not utf8
Queensrÿche
- is utf8
| [reply] [Watch: Dir/Any] [d/l] [select] |
I figured it out, sort of. The first is actually ascii (255 maps to "ÿ"): http://www.ascii-code.com
So, when I take the string "Queensrÿche" (which IS actually encoded as utf-8) for example:
Decimal Char escaped
81 Q Q
117 u u
101 e e
101 e e
110 n n
115 s s
114 r r
195 - \x{c3}
191 - \x{bf}
99 c c
104 h h
101 e e
It is now printing on my terminal like this:
Queensrÿche
This makes sense, in a way, now because 195 maps to "Ã" and 191 maps to "¿". So, now my question is, why isn't this mapping using a utf-8 table (instead of ascii)? Encode thinks the string is utf-8 (which I assume means the utf-8 flag is on). | [reply] [Watch: Dir/Any] [d/l] [select] |