http://qs321.pair.com?node_id=11121926


in reply to Losing Bits with Pack/Unpack

If I ignore your text and look at your code, it appears you are trying to reformat each 2.5 8-bit characters as one 20-bit code point in order to reduce the number of 'characters' in your string. This would be valid if all 20-bit code points were valid. Your example does work when you get the details right. Your twelve character string is stored as a buffer of five code points.
use strict; use warnings; my $text='Hello World!'; my $hex_text = unpack 'H*', $text; my @code_points; while ($hex_text) { my $hex_num = substr($hex_text, 0, 5, ''); push @code_points, hex(sprintf '%05s',$hex_num); } my $buffer = pack '(U)*', @code_points; my @_code_points = unpack('(U)*', $buffer); my $_hex_text = sprintf '%X' x scalar(@_code_points), @_code_points; my $_text = pack 'H*', $_hex_text; print $_text;

UPDATE - Added improved code (with testing)

use strict; use warnings; use Encode qw(decode); use Test::More tests=>2; my $text='Hello World!'; my $buffer = pack '(U)*', # Convert to Unicode map {hex($_)} # Convert to decimal unpack '(a5)*', # Groups of 5 unpack 'H*', $text; # Convert to hex my $num_uni_chars = length(decode('UTF-8', $buffer)); is( $num_uni_chars, int(length($text)/2.5 + .5), 'Number of Unicode characters'); my $_text = pack 'H*', # Convert pairs of hex to ascii sprintf '%X' x $num_uni_chars, # Convert to hex and join unpack('(U)*', $buffer); # Decimal code points is($_text, $text, 'Restored text');

OUTPUT:

1..2 ok 1 - Number of Unicode characters ok 2 - Restored text
Bill