RCH has asked for the wisdom of the Perl Monks concerning the following question:
Dear PerlMonks
I am trying to make myself a table of Unicode (UTF-8) characters
with their decimal, hex, binary and byte equivalents
Here is what a bit of it should look like
A. Ð Ñ Ò Ó ...
B. 208 209 210 211 ...
C. d0 d1 d2 d3 ...
D. 11010000 11010001 11010010 11010011 ...
E. c3 90 c3 91 c3 92 c3 93 ...
F. 11000011 11000011 11000011 11000011 ...
G. 10010000 10010001 10010010 10010011 ...
I know how to make rows A. B. C. and D. 1
How do I generate lines E. F. and G. in Perl?
RichardH
1 (using sprintf in a loop -
A: "%s", chr($n);
B: "%d",$n;
C: "%x",$n;
D: "%b",$n;
)
Re: How to print the actual bytes of UTF-8 characters ?
by choroba (Cardinal) on Feb 06, 2014 at 15:01 UTC
|
Using a variable as a file to handle the encodings:
#!/usr/bin/perl
use warnings;
use strict;
use utf8;
for my $char (qw(Ð Ñ Ò Ó)) {
my $n = ord $char;
open my $BYTE, '>:utf8', \ my $bytes;
print {$BYTE} $char;
printf "%s\t%s\t%x\t%b\t%x %x\t %b %b\n",
$char, $n, $n, $n, (unpack('CC', $bytes)) x 2;
}
The pivoting of the table left as an exercise to the reader.
| [reply] [d/l] |
|
Magic!
Could you explain how it works?
I had tried a simple minded unpack('C', $char) but it gave me the wrong answer.
There are two things that I dont understand in your unpack solution
(1) what are the contents of $bytes, and
(2) what is the function of the slash "\" in
open my $BYTE, '>:utf8', \ my $bytes;
? | [reply] [d/l] |
|
\ is the reference operator. Instead of using a file, I open the variable for output (see FILEHANDLE, MODE, REFERENCE in open). I set its encoding to UTF-8 and print the character to it. $bytes now contains the two bytes of the character as encoded in UTF-8.
| [reply] [d/l] |
|
|
open my $BYTE, '>:utf8', \my $bytes; print {$BYTE} $char;? utf8::encode(my $bytes = $char);!
| [reply] [d/l] [select] |
Re: How to print the actual bytes of UTF-8 characters ?
by atcroft (Abbot) on Feb 06, 2014 at 16:19 UTC
|
For simplicity, I would deal with the integer values. (The following was adapted from a one-liner, and made heavy use of Tom Christiansen's Unicode articles on perl.com.)
(I don't work with Unicode often, but there was an error thrown when I tried using 0xD800 as a character. I seem to remember there are some ranges that may not be defined, so adding anything above 0xD799 and formatting changes are left as exercises for the reader.)
Hope that helps.
Update: 2014-02-06
This article shed light on why 0xD800-0xDFFF are considered invalid. The code above was updated to skip said range.
Update: 2014-02-06
Remove left-over debug print(); add eval() around print to catch invalid Unicode points.
| [reply] [d/l] |
Re: How to print the actual bytes of UTF-8 characters ?
by andal (Hermit) on Feb 07, 2014 at 08:37 UTC
|
Your question is somewhat confusing. What is decimal value of "Unicode(UTF-8) character"? There is Unicode standard that assigns 32-bit number to image of a character. There's UTF-8 encoding that can be used to represent 32-bit number as sequence of bytes that is backward compatible with ASCII only text. So, do you want to have decimal value from Unicode standard, or decimal value of the sequence of bytes from UTF-8 encoding? The latter one does not really make sense since number of bytes can be 3 or 5 and then you'd have to decide how you create "decimal" from it.
Assuming that you want codes assigned by Unicode standard and the bytes used to represent those codes as UTF-8 sequence. Also assuming that you start from codes. Then the following can be used
use utf8;
use Encode;
my $code = 208; # the unicode expressed as decimal
my $char = chr($code); # convert to internal perl character
my $utf8_octets = encode("UTF-8", $ch); # get sequence of bytes in UTF
+-8
print sprintf("Decimal: %d, Hex: %x, Bits: %b\n", $code, $code, $code)
+;
print "UTF-8 hex: ", unpack("H*", $utf8_octets), "\n";
print "UTF-8 bits: ", unpack("B*", $utf8_octets), "\n";
This is the same as what choroba has offered, just using different way, without file handles and redirection.
If you want to go from characters to codes, then you get $code via ord($char).
| [reply] [d/l] |
Re: How to print the actual bytes of UTF-8 characters ?
by ikegami (Patriarch) on Feb 07, 2014 at 21:06 UTC
|
use utf8; # Source code is encoded using UTF-8.
use open ':std', ':locale'; # Decode inputs and encode inputs.
use strict;
use warnings;
use feature qw( say );
my @chars;
for my $char (qw( Ð Ñ Ò Ó )) {
my $cp = ord($char); # Or unpack 'C'
my $utf8 = $char;
utf8::encode($utf8);
my @utf8 = unpack('C*', $utf8);
push @chars, [ $char, $cp, $utf8, @utf8 ];
}
Then it's just a question of displaying correctly.
my $last = 0;
for (@chars) {
$last = $#$_ if $last < $#$_;
}
say join ' ', map { sprintf '%-8s', $_->[0] }
+@chars;
say join ' ', map { sprintf '%-8d', $_->[1] }
+@chars;
say join ' ', map { sprintf '%-8s', sprintf '%02x', $_->[1] }
+@chars;
say join ' ', map { sprintf '%08b', $_->[1] }
+@chars;
say join ' ', map { sprintf '%-8s', sprintf '%*v02x', ' ', $_->[2] }
+@chars;
for my $i (3..$last) {
say join ' ', map { defined($_->[$i]) ? sprintf '%08b', $_->[$i] :
+ (' 'x8) } @chars;
}
Notes:
- The binary of the code point could take up to 21 characters, but only 8 are available.
- The hex of the UTF-8 bytes could take up to 11 chars, but only 8 are available.
| [reply] [d/l] [select] |
Re: How to print the actual bytes of UTF-8 characters ?
by Jim (Curate) on Feb 07, 2014 at 19:06 UTC
|
I've always found unpack() and bit manipulation confusing. Here's my variation on the theme that uses ord() and sprintf() instead of unpack(). This script takes advantage of the fact that Unicode::UCD::charinfo() returns undef for unassigned code points and non-characters.
Jim
Update: Here's a revised version of the script that handles surrogate code points more appropriately. And for comparison, I've used unpack('C*', ...). ☺
#!perl
use strict;
use warnings;
use v5.12;
use Encode qw( encode_utf8 );
use English qw( -no_match_vars );
use Unicode::UCD qw( charinfo );
binmode STDOUT, ':encoding(UTF-8)';
# Include a Unicode byte order mark in the output...
print "\x{FEFF}";
local $OUTPUT_AUTOFLUSH = 1;
local $OUTPUT_RECORD_SEPARATOR = "\n";
local $OUTPUT_FIELD_SEPARATOR = "\t";
CODE:
for my $code (0x000000 .. 0x10FFFF) {
# Look up the code point in the Unicode Character Database...
my $charinfo = charinfo($code);
# Skip unassigned code points and non-characters...
next CODE unless defined $charinfo;
my $codepoint = sprintf 'U+%06X', $code;
my $character = chr $code;
my $name = $charinfo->{'name'};
my $category = $charinfo->{'category'};
my $block = $charinfo->{'block'};
my $script = $charinfo->{'script'};
my @utf8_octets
= unpack 'C*', encode_utf8($character);
my $utf8_hex_string
= join ' ', map { sprintf '%02X', $ARG } @utf8_octets;
my $utf8_bin_string
= join ' ', map { sprintf '%08b', $ARG } @utf8_octets;
# Don't try to print unprintable or private use characters...
if ($category =~ m/^C[cfos]$/) {
$character = '';
# Don't falsely represent surrogates as valid UTF-8...
if ($category eq 'Cs') {
$utf8_hex_string = $utf8_bin_string = '';
}
}
print $character,
$code,
$codepoint,
$utf8_hex_string,
$utf8_bin_string,
$name,
$category,
$block,
$script;
}
exit 0;
Another update: I removed this…
# Don't complain about surrogates...
no warnings qw( surrogate );
…from the script because I realized it's not doing anything. I'm already skipping trying to print surrogates later in the script, so suppressing warnings about them isn't necessary.
| [reply] [d/l] [select] |
Re: How to print the actual bytes of UTF-8 characters ?
by pajout (Curate) on Feb 07, 2014 at 13:09 UTC
|
I think you need something like this:
#!/usr/bin/perl
use utf8;
my $str = 'Ð Ñ Ò Ó';
print $str."\n";
foreach my $ch (split('', $str)) {
print ord($ch)."\n";
}
use bytes;
print "bytes\n";
foreach my $ch (split('', $str)) {
printf("%x %b\n", ord($ch), ord($ch));
}
| [reply] [d/l] |
|
|