Losing Bits with Pack/Unpack

o0lit3 has asked for the wisdom of the Perl Monks concerning the following question:

I am trying to pack bytes into as few unicode characters as possible. Consider the Twitter 140 character limit, which allows byte-heavy unicode characters. I believe I can squeeze 3 8-bit ascii characters into a single 20-bit unicode character, which I am currently trying via the following:

$\="\n";
$_='Hello World!'; print;
$_=pack'(U)*',map{hex}unpack'(H5)*'; print;
$_=pack'(H5)*',map{sprintf"%05X",$_}unpack'(U)*'; print;
[download]

Hello World!
񈙖񬛲񗛷񬙂
He`lo Wopld

Note that when I attempt to pack'(H5)*' the same data I map/unpacked, (almost*) every third character is garbled, a symptom of dealing with an odd 5 hex characters, I'm sure. What is the appropriate way to do this without losing bits?

Packing/Unpacking with the normal 16-bits works as expected:

$\="\n";
$_='Hello World!'; print;
$_=pack'(U)*',map{hex}unpack'(H4)*'; print;
$_=pack'(H4)*',map{sprintf"%04X",$_}unpack'(U)*'; print;
[download]

Hello World!
䡥汬漠坯牬搡
Hello World!

...and even allows 'n*' packing, since n is a 16-bit unsigned short:

$\="\n";
$_='Hello World!'; print;
$_=pack'(U)*',map{hex}unpack'(H4)*'; print;
$_=pack'n*',unpack'(U)*'; print
[download]

Hello World!
䡥汬漠坯牬搡
Hello World!

Thanks for the help, and Cheers!

Comment on Losing Bits with Pack/Unpack Select or Download Code

Replies are listed 'Best First'.
Re: Losing Bits with Pack/Unpack by afoken (Chancellor) on Sep 17, 2020 at 07:14 UTC
I believe I can squeeze 3 8-bit ascii characters into a single 20-bit unicode character No, you can't. ASCII is 7 bit, not 8 bit. Unicode defines code points from 0 to 0x10FFFF, i.e. 0x110000 code points. You need at least 21 bit for that (ln₂(0x110000) = 20.087...), not 20 bit. Depending on the selected Unicode Transformation Format, you need up to 32 bit to encode those code points (see UTF-8 and UTF-16). Especially note that not all 32-bit combinations are valid Unicode. Three 7-bit characters need 21 bits, not 20 bits. Three 8-bit characters need 24 bits, not 20 bits. If you want to store more bits in a limited storage area than that storage area allows, you need compression, either lossy or lossless. Just shifting bits around won't help. Alexander -- Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)	[reply]
Re: Losing Bits with Pack/Unpack by Eily (Monsignor) on Sep 17, 2020 at 07:56 UTC
Well others have already explained why what you're trying to do is a bad idea. Maybe if you told us why you're trying to do something like that we might be able to help find a better solution though. But, for your information, one reason why this just won't work is that pack is always byte aligned. That's why `(H5)*` changes every third char. H is half a byte/char, so H2 reads a full one, H4 reads two, and H5 reads two and a half. But the next iteration aligns itself on the next char and simply ignores the half byte that was unread. So you're just losing data, not shifting any bits.	[reply] [d/l]
Re^2: Losing Bits with Pack/Unpack by o0lit3 (Friar) on Sep 17, 2020 at 11:55 UTC
Is "unpack" byte aligned also? If I have not lost any data with the unpack'(H5)%' step, I am wondering how one would go about reversing the process.	[reply]
Re^3: Losing Bits with Pack/Unpack by Eily (Monsignor) on Sep 17, 2020 at 12:42 UTC
Yes unpack is also byte aligned. unpack moves exactly like pack does given the same pattern, except it reads instead of writes. You did lose data with `H5`, that's why some chars were changed. One way to see that is to try "hehlo" or "hemlo" instead of hello. You'll see that you get the same result, because four bits in the byte are replaced by 0000 when pack was only given 5 hex digits for 3 bytes.	[reply] [d/l]
Re: Losing Bits with Pack/Unpack by haukex (Archbishop) on Sep 17, 2020 at 07:19 UTC
I am trying to pack bytes into as few unicode characters as possible. Consider the Twitter 140 character limit, which allows byte-heavy unicode characters. I don't think it's as straightforward as this. Consider that there are lots of unassigned Unicode code points, different kinds of whitespace, nonprintables, and so on, all of which may or may not be removed/replaced/folded depending on where you're posting this text. This means you'd have to look at the character properties, select only those characters that are likely to work correctly, and make a mapping. I believe I can squeeze 3 8-bit ascii characters into a single 20-bit unicode character Well, to nitpick, ASCII is 7 bits, and 3*7 is 21, while Unicode code points range from 0 to 0x10FFFF (1114111), which won't fit 2^21==2097152. If you exclude ASCII control characers, for example, you'd also be excluding Tab, CR, and LF. afoken went into more details.	[reply]
Re^2: Losing Bits with Pack/Unpack by ikegami (Patriarch) on Sep 18, 2020 at 08:54 UTC
ASCII has 30 control characters and four whitespace characters (SPACE, TAB, CR and LF). If you forgo support for control characters, TAB and CR (but keep space and LF), you end up with 0x60 characters. This isn't a power of 2 (which would help make things simple and very efficient), but it's still a nice number (3/4 of 2^7). That would require an address space of 0x60^3 = 884,736 (0xD_8000) code points. That's a fair bit smaller than the 1,114,112 (0x11_0000) code points Unicode supports. Of those, some are best avoided. I would avoid at least the following: High surrogates (1024) Lo surrogates (1024) Non-characters (66) U+FFFD Control characters, which includes U+FEFF (226) That's only 2,341 and we have a buffer of 229,376. Golden! Mapping the 3 ASCII characters (with the limitations mentioned above) unto only "safe" characters won't be nice and easy, but it is doable.	[reply]
Re: Losing Bits with Pack/Unpack by kikuchiyo (Hermit) on Sep 17, 2020 at 10:43 UTC
Are you trying to maximize the amount of data you can stuff into a single tweet by abusing Twitter's arbitrarily defined character limit rules? If yes, there are schemes for that: see e.g. https://github.com/qntm/base2048	[reply]
Re^2: Losing Bits with Pack/Unpack by o0lit3 (Friar) on Sep 17, 2020 at 14:34 UTC
Yeah, this is the idea. I see this javascript solution uses a lookup array and is using bitwise operations to construct an index to lookup in the array... Seems like this could be rewritten in perl with similar bitwise ops, but there is no "simple" solution using only pack and unpack based on the fact that those methods are byte aligned. Thanks for the help!	[reply]
Re^3: Losing Bits with Pack/Unpack by kikuchiyo (Hermit) on Sep 17, 2020 at 15:51 UTC
I doubt there are any simple solutions with just pack and unpack. Most of the complexity of the base2048 implementation I've linked comes from paying attention to Twitter's rules about what it considers a "text" or "CJK" or "sendable" character. Perl's comprehensive Unicode support (character classes in regexes etc.) might be useful, if you were to rewrite the thing in it.	[reply]
Re: Losing Bits with Pack/Unpack by BillKSmith (Monsignor) on Sep 18, 2020 at 15:27 UTC
If I ignore your text and look at your code, it appears you are trying to reformat each 2.5 8-bit characters as one 20-bit code point in order to reduce the number of 'characters' in your string. This would be valid if all 20-bit code points were valid. Your example does work when you get the details right. Your twelve character string is stored as a buffer of five code points. `use strict; use warnings; my $text='Hello World!'; my $hex_text = unpack 'H', $text; my @code_points; while ($hex_text) { my $hex_num = substr($hex_text, 0, 5, ''); push @code_points, hex(sprintf '%05s',$hex_num); } my $buffer = pack '(U)', @code_points; my @_code_points = unpack('(U)', $buffer); my $_hex_text = sprintf '%X' x scalar(@_code_points), @_code_points; my $_text = pack 'H', $_hex_text; print $_text;` [download] UPDATE - Added improved code (with testing) use strict; use warnings; use Encode qw(decode); use Test::More tests=>2; my $text='Hello World!'; my $buffer = pack '(U)', # Convert to Unicode map {hex($_)} # Convert to decimal unpack '(a5)', # Groups of 5 unpack 'H', $text; # Convert to hex my $num_uni_chars = length(decode('UTF-8', $buffer)); is( $num_uni_chars, int(length($text)/2.5 + .5), 'Number of Unicode characters'); my $_text = pack 'H', # Convert pairs of hex to ascii sprintf '%X' x $num_uni_chars, # Convert to hex and join unpack('(U)*', $buffer); # Decimal code points is($_text, $text, 'Restored text'); [download] OUTPUT: `1..2 ok 1 - Number of Unicode characters ok 2 - Restored text` [download] Bill	[reply] [d/l] [select]
Re: Losing Bits with Pack/Unpack by Anonymous Monk on Sep 17, 2020 at 01:33 UTC
heh, twitter limit is 280 for non-CJK languages since November 2017.	[reply]
Re: Losing Bits with Pack/Unpack by Anonymous Monk on Sep 17, 2020 at 01:18 UTC
Hi As a first step, i'd make it crystal clear what your input/output is supposed to be, in other words, dont print characters, none of this `搡` 搡 stuff	[reply] [d/l]


We don't bite newbies here... much
	PerlMonks