How to print the actual bytes of UTF-8 characters ?

RCH has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: How to print the actual bytes of UTF-8 characters ? by choroba (Cardinal) on Feb 06, 2014 at 15:01 UTC
Using a variable as a file to handle the encodings: `#!/usr/bin/perl use warnings; use strict; use utf8; for my $char (qw(Ð Ñ Ò Ó)) { my $n = ord $char; open my $BYTE, '>:utf8', \ my $bytes; print {$BYTE} $char; printf "%s\t%s\t%x\t%b\t%x %x\t %b %b\n", $char, $n, $n, $n, (unpack('CC', $bytes)) x 2; }` [download] The pivoting of the table left as an exercise to the reader. لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ	[reply] [d/l]
Re^2: How to print the actual bytes of UTF-8 characters ? by RCH (Sexton) on Feb 06, 2014 at 16:18 UTC
Magic! Could you explain how it works? I had tried a simple minded unpack('C', $char) but it gave me the wrong answer. There are two things that I dont understand in your unpack solution (1) what are the contents of $bytes, and (2) what is the function of the slash "\" in `open my $BYTE, '>:utf8', \ my $bytes;` [download] ?	[reply] [d/l]
Re^3: How to print the actual bytes of UTF-8 characters ? by choroba (Cardinal) on Feb 06, 2014 at 17:16 UTC
`\` is the reference operator. Instead of using a file, I open the variable for output (see FILEHANDLE, MODE, REFERENCE in open). I set its encoding to UTF-8 and print the character to it. $bytes now contains the two bytes of the character as encoded in UTF-8. لսႽ† ᥲᥒ⚪⟊Ⴙᘓᖇ Ꮅᘓᖇ⎱ Ⴙᥲ𝇋ƙᘓᖇ	[reply] [d/l]
Re^4: How to print the actual bytes of UTF-8 characters ? by RCH (Sexton) on Feb 06, 2014 at 17:49 UTC
Re^2: How to print the actual bytes of UTF-8 characters ? by ikegami (Patriarch) on Feb 07, 2014 at 21:14 UTC
`open my $BYTE, '>:utf8', \my $bytes; print {$BYTE} $char;`? `utf8::encode(my $bytes = $char);`!	[reply] [d/l] [select]
Re: How to print the actual bytes of UTF-8 characters ? by atcroft (Abbot) on Feb 06, 2014 at 16:19 UTC
For simplicity, I would deal with the integer values. (The following was adapted from a one-liner, and made heavy use of Tom Christiansen's Unicode articles on perl.com.) Read more... (824 Bytes) (I don't work with Unicode often, but there was an error thrown when I tried using 0xD800 as a character. I seem to remember there are some ranges that may not be defined, so adding anything above 0xD799 and formatting changes are left as exercises for the reader.) Hope that helps. Update: 2014-02-06 This article shed light on why 0xD800-0xDFFF are considered invalid. The code above was updated to skip said range. Update: 2014-02-06 Remove left-over debug print(); add eval() around print to catch invalid Unicode points.	[reply] [d/l]
Re: How to print the actual bytes of UTF-8 characters ? by andal (Hermit) on Feb 07, 2014 at 08:37 UTC
Your question is somewhat confusing. What is decimal value of "Unicode(UTF-8) character"? There is Unicode standard that assigns 32-bit number to image of a character. There's UTF-8 encoding that can be used to represent 32-bit number as sequence of bytes that is backward compatible with ASCII only text. So, do you want to have decimal value from Unicode standard, or decimal value of the sequence of bytes from UTF-8 encoding? The latter one does not really make sense since number of bytes can be 3 or 5 and then you'd have to decide how you create "decimal" from it. Assuming that you want codes assigned by Unicode standard and the bytes used to represent those codes as UTF-8 sequence. Also assuming that you start from codes. Then the following can be used `use utf8; use Encode; my $code = 208; # the unicode expressed as decimal my $char = chr($code); # convert to internal perl character my $utf8_octets = encode("UTF-8", $ch); # get sequence of bytes in UTF +-8 print sprintf("Decimal: %d, Hex: %x, Bits: %b\n", $code, $code, $code) +; print "UTF-8 hex: ", unpack("H", $utf8_octets), "\n"; print "UTF-8 bits: ", unpack("B", $utf8_octets), "\n";` [download] This is the same as what choroba has offered, just using different way, without file handles and redirection. If you want to go from characters to codes, then you get $code via ord($char).	[reply] [d/l]
Re: How to print the actual bytes of UTF-8 characters ? by ikegami (Patriarch) on Feb 07, 2014 at 21:06 UTC
Use builtin `utf8::encode` or core `Encode::encode_utf8` to get the UTF-8 encoding. `use utf8; # Source code is encoded using UTF-8. use open ':std', ':locale'; # Decode inputs and encode inputs. use strict; use warnings; use feature qw( say ); my @chars; for my $char (qw( Ð Ñ Ò Ó )) { my $cp = ord($char); # Or unpack 'C' my $utf8 = $char; utf8::encode($utf8); my @utf8 = unpack('C', $utf8); push @chars, [ $char, $cp, $utf8, @utf8 ]; }` [download] Then it's just a question of displaying correctly. `my $last = 0; for (@chars) { $last = $#$_ if $last < $#$_; } say join ' ', map { sprintf '%-8s', $_->[0] } +@chars; say join ' ', map { sprintf '%-8d', $_->[1] } +@chars; say join ' ', map { sprintf '%-8s', sprintf '%02x', $_->[1] } +@chars; say join ' ', map { sprintf '%08b', $_->[1] } +@chars; say join ' ', map { sprintf '%-8s', sprintf '%v02x', ' ', $_->[2] } +@chars; for my $i (3..$last) { say join ' ', map { defined($_->[$i]) ? sprintf '%08b', $_->[$i] : + (' 'x8) } @chars; }` [download] Notes: The binary of the code point could take up to 21 characters, but only 8 are available. The hex of the UTF-8 bytes could take up to 11 chars, but only 8 are available.	[reply] [d/l] [select]
Re: How to print the actual bytes of UTF-8 characters ? by Jim (Curate) on Feb 07, 2014 at 19:06 UTC
I've always found `unpack()` and bit manipulation confusing. Here's my variation on the theme that uses `ord()` and `sprintf()` instead of `unpack()`. This script takes advantage of the fact that `Unicode::UCD::charinfo()` returns `undef` for unassigned code points and non-characters. Read more... (2 kB) Jim Update: Here's a revised version of the script that handles surrogate code points more appropriately. And for comparison, I've used `unpack('C',` `...)`. ☺ #!perl use strict; use warnings; use v5.12; use Encode qw( encode_utf8 ); use English qw( -no_match_vars ); use Unicode::UCD qw( charinfo ); binmode STDOUT, ':encoding(UTF-8)'; # Include a Unicode byte order mark in the output... print "\x{FEFF}"; local $OUTPUT_AUTOFLUSH = 1; local $OUTPUT_RECORD_SEPARATOR = "\n"; local $OUTPUT_FIELD_SEPARATOR = "\t"; CODE: for my $code (0x000000 .. 0x10FFFF) { # Look up the code point in the Unicode Character Database... my $charinfo = charinfo($code); # Skip unassigned code points and non-characters... next CODE unless defined $charinfo; my $codepoint = sprintf 'U+%06X', $code; my $character = chr $code; my $name = $charinfo->{'name'}; my $category = $charinfo->{'category'}; my $block = $charinfo->{'block'}; my $script = $charinfo->{'script'}; my @utf8_octets = unpack 'C', encode_utf8($character); my $utf8_hex_string = join ' ', map { sprintf '%02X', $ARG } @utf8_octets; my $utf8_bin_string = join ' ', map { sprintf '%08b', $ARG } @utf8_octets; # Don't try to print unprintable or private use characters... if ($category =~ m/^C[cfos]$/) { $character = ''; # Don't falsely represent surrogates as valid UTF-8... if ($category eq 'Cs') { $utf8_hex_string = $utf8_bin_string = ''; } } print $character, $code, $codepoint, $utf8_hex_string, $utf8_bin_string, $name, $category, $block, $script; } exit 0; [download] Another update: I removed this… `# Don't complain about surrogates... no warnings qw( surrogate );` [download] …from the script because I realized it's not doing anything. I'm already skipping trying to print surrogates later in the script, so suppressing warnings about them isn't necessary.	[reply] [d/l] [select]
Re: How to print the actual bytes of UTF-8 characters ? by pajout (Curate) on Feb 07, 2014 at 13:09 UTC
I think you need something like this: `#!/usr/bin/perl use utf8; my $str = 'Ð Ñ Ò Ó'; print $str."\n"; foreach my $ch (split('', $str)) { print ord($ch)."\n"; } use bytes; print "bytes\n"; foreach my $ch (split('', $str)) { printf("%x %b\n", ord($ch), ord($ch)); }` [download]	[reply] [d/l]


We don't bite newbies here... much
	PerlMonks