http://qs321.pair.com?node_id=11121852

o0lit3 has asked for the wisdom of the Perl Monks concerning the following question:

I am trying to pack bytes into as few unicode characters as possible. Consider the Twitter 140 character limit, which allows byte-heavy unicode characters. I believe I can squeeze 3 8-bit ascii characters into a single 20-bit unicode character, which I am currently trying via the following:

$\="\n"; $_='Hello World!'; print; $_=pack'(U)*',map{hex}unpack'(H5)*'; print; $_=pack'(H5)*',map{sprintf"%05X",$_}unpack'(U)*'; print;
Hello World!
񈙖񬛲񗛷񬙂
He`lo Wopld

Note that when I attempt to pack'(H5)*' the same data I map/unpacked, (almost*) every third character is garbled, a symptom of dealing with an odd 5 hex characters, I'm sure. What is the appropriate way to do this without losing bits?

Packing/Unpacking with the normal 16-bits works as expected:

$\="\n"; $_='Hello World!'; print; $_=pack'(U)*',map{hex}unpack'(H4)*'; print; $_=pack'(H4)*',map{sprintf"%04X",$_}unpack'(U)*'; print;
Hello World!
䡥汬漠坯牬搡
Hello World!

...and even allows 'n*' packing, since n is a 16-bit unsigned short:

$\="\n"; $_='Hello World!'; print; $_=pack'(U)*',map{hex}unpack'(H4)*'; print; $_=pack'n*',unpack'(U)*'; print
Hello World!
䡥汬漠坯牬搡
Hello World!
Thanks for the help, and Cheers!

Replies are listed 'Best First'.
Re: Losing Bits with Pack/Unpack
by afoken (Chancellor) on Sep 17, 2020 at 07:14 UTC
    I believe I can squeeze 3 8-bit ascii characters into a single 20-bit unicode character

    No, you can't.

    • ASCII is 7 bit, not 8 bit.
    • Unicode defines code points from 0 to 0x10FFFF, i.e. 0x110000 code points. You need at least 21 bit for that (ln2(0x110000) = 20.087...), not 20 bit. Depending on the selected Unicode Transformation Format, you need up to 32 bit to encode those code points (see UTF-8 and UTF-16). Especially note that not all 32-bit combinations are valid Unicode.
    • Three 7-bit characters need 21 bits, not 20 bits.
    • Three 8-bit characters need 24 bits, not 20 bits.

    If you want to store more bits in a limited storage area than that storage area allows, you need compression, either lossy or lossless. Just shifting bits around won't help.

    Alexander

    --
    Today I will gladly share my knowledge and experience, for there are no sweeter words than "I told you so". ;-)
Re: Losing Bits with Pack/Unpack
by Eily (Monsignor) on Sep 17, 2020 at 07:56 UTC

    Well others have already explained why what you're trying to do is a bad idea. Maybe if you told us why you're trying to do something like that we might be able to help find a better solution though.

    But, for your information, one reason why this just won't work is that pack is always byte aligned. That's why (H5)* changes every third char. H is half a byte/char, so H2 reads a full one, H4 reads two, and H5 reads two and a half. But the next iteration aligns itself on the next char and simply ignores the half byte that was unread. So you're just losing data, not shifting any bits.

      Is "unpack" byte aligned also? If I have not lost any data with the unpack'(H5)%' step, I am wondering how one would go about reversing the process.

        Yes unpack is also byte aligned. unpack moves exactly like pack does given the same pattern, except it reads instead of writes. You did lose data with H5, that's why some chars were changed. One way to see that is to try "hehlo" or "hemlo" instead of hello. You'll see that you get the same result, because four bits in the byte are replaced by 0000 when pack was only given 5 hex digits for 3 bytes.

Re: Losing Bits with Pack/Unpack
by haukex (Archbishop) on Sep 17, 2020 at 07:19 UTC
    I am trying to pack bytes into as few unicode characters as possible. Consider the Twitter 140 character limit, which allows byte-heavy unicode characters.

    I don't think it's as straightforward as this. Consider that there are lots of unassigned Unicode code points, different kinds of whitespace, nonprintables, and so on, all of which may or may not be removed/replaced/folded depending on where you're posting this text. This means you'd have to look at the character properties, select only those characters that are likely to work correctly, and make a mapping.

    I believe I can squeeze 3 8-bit ascii characters into a single 20-bit unicode character

    Well, to nitpick, ASCII is 7 bits, and 3*7 is 21, while Unicode code points range from 0 to 0x10FFFF (1114111), which won't fit 2^21==2097152. If you exclude ASCII control characers, for example, you'd also be excluding Tab, CR, and LF. afoken went into more details.

      ASCII has 30 control characters and four whitespace characters (SPACE, TAB, CR and LF). If you forgo support for control characters, TAB and CR (but keep space and LF), you end up with 0x60 characters. This isn't a power of 2 (which would help make things simple and very efficient), but it's still a nice number (3/4 of 2^7).

      That would require an address space of 0x60^3 = 884,736 (0xD_8000) code points. That's a fair bit smaller than the 1,114,112 (0x11_0000) code points Unicode supports.

      Of those, some are best avoided. I would avoid at least the following:

      • High surrogates (1024)
      • Lo surrogates (1024)
      • Non-characters (66)
      • U+FFFD
      • Control characters, which includes U+FEFF (226)

      That's only 2,341 and we have a buffer of 229,376. Golden!

      Mapping the 3 ASCII characters (with the limitations mentioned above) unto only "safe" characters won't be nice and easy, but it is doable.

Re: Losing Bits with Pack/Unpack
by kikuchiyo (Hermit) on Sep 17, 2020 at 10:43 UTC

    Are you trying to maximize the amount of data you can stuff into a single tweet by abusing Twitter's arbitrarily defined character limit rules?

    If yes, there are schemes for that: see e.g. https://github.com/qntm/base2048

      Yeah, this is the idea. I see this javascript solution uses a lookup array and is using bitwise operations to construct an index to lookup in the array... Seems like this could be rewritten in perl with similar bitwise ops, but there is no "simple" solution using only pack and unpack based on the fact that those methods are byte aligned. Thanks for the help!
        I doubt there are any simple solutions with just pack and unpack. Most of the complexity of the base2048 implementation I've linked comes from paying attention to Twitter's rules about what it considers a "text" or "CJK" or "sendable" character. Perl's comprehensive Unicode support (character classes in regexes etc.) might be useful, if you were to rewrite the thing in it.
Re: Losing Bits with Pack/Unpack
by BillKSmith (Monsignor) on Sep 18, 2020 at 15:27 UTC
    If I ignore your text and look at your code, it appears you are trying to reformat each 2.5 8-bit characters as one 20-bit code point in order to reduce the number of 'characters' in your string. This would be valid if all 20-bit code points were valid. Your example does work when you get the details right. Your twelve character string is stored as a buffer of five code points.
    use strict; use warnings; my $text='Hello World!'; my $hex_text = unpack 'H*', $text; my @code_points; while ($hex_text) { my $hex_num = substr($hex_text, 0, 5, ''); push @code_points, hex(sprintf '%05s',$hex_num); } my $buffer = pack '(U)*', @code_points; my @_code_points = unpack('(U)*', $buffer); my $_hex_text = sprintf '%X' x scalar(@_code_points), @_code_points; my $_text = pack 'H*', $_hex_text; print $_text;

    UPDATE - Added improved code (with testing)

    use strict; use warnings; use Encode qw(decode); use Test::More tests=>2; my $text='Hello World!'; my $buffer = pack '(U)*', # Convert to Unicode map {hex($_)} # Convert to decimal unpack '(a5)*', # Groups of 5 unpack 'H*', $text; # Convert to hex my $num_uni_chars = length(decode('UTF-8', $buffer)); is( $num_uni_chars, int(length($text)/2.5 + .5), 'Number of Unicode characters'); my $_text = pack 'H*', # Convert pairs of hex to ascii sprintf '%X' x $num_uni_chars, # Convert to hex and join unpack('(U)*', $buffer); # Decimal code points is($_text, $text, 'Restored text');

    OUTPUT:

    1..2 ok 1 - Number of Unicode characters ok 2 - Restored text
    Bill
Re: Losing Bits with Pack/Unpack
by Anonymous Monk on Sep 17, 2020 at 01:33 UTC
    heh, twitter limit is 280 for non-CJK languages since November 2017.
Re: Losing Bits with Pack/Unpack
by Anonymous Monk on Sep 17, 2020 at 01:18 UTC

    Hi

    As a first step, i'd make it crystal clear what your input/output is supposed to be, in other words, dont print characters, none of this  搡 搡 stuff