Beefy Boxes and Bandwidth Generously Provided by pair Networks
Come for the quick hacks, stay for the epiphanies.
 
PerlMonks  

3-byte representation

by gerleu (Novice)
on Oct 12, 2011 at 15:49 UTC ( [id://931045]=perlquestion: print w/replies, xml ) Need Help??

gerleu has asked for the wisdom of the Perl Monks concerning the following question:

Hello Dear PerlMonks, I've signed integers which a must save in a file in the most compact format because, latter, I must read those values back to retrieve those signed integers... I know that I can use pack and unpack using C4 as format, but in this case I increase the storage space by 25% ! My signed integer values fit perfectly in only 24 bits. Do you have any simple or difficult or perlmonked solution because I was unable to find it ? Do you think that converting my signed integers to unsigned (thus all positive values) integers (because I know what will be the greatest negative integer) can help ? Thanks in advance for your help (or prayers otherwise) ! Germain

Replies are listed 'Best First'.
Re: 3-byte representation (Simplified!)
by BrowserUk (Patriarch) on Oct 12, 2011 at 16:31 UTC

    Update: Simplified the packing by removing a redundant step.

    Update2: eliminated the map from the packing.

    Non-trivial:

    #! perl -slw use strict; use Data::Dump qw[ pp ]; use Math::Random::MT qw[ rand ]; my @sint24s = map{ -2**23 + int( rand 2**24 ) } 1 .. 20; ## removed duplicated my $packed -- Thanks to AnonmalousMonk my $packed = join '', unpack '(a3x)*', pack 'l*', @sint24s; my @unpacked = map { unpack 'l', $_ . chr( vec( $_, 23, 1 ) ? 255 : 0 ); } unpack '(a3)*', $packed; print "$sint24s[ $_ ] ;; $unpacked[ $_ ]" for 0 .. $#sint24s; __END__ C:\test>sint24 4243386 ;; 4243386 4809369 ;; 4809369 -888567 ;; -888567 -7576685 ;; -7576685 1987080 ;; 1987080 -2170022 ;; -2170022 -1135866 ;; -1135866 1924446 ;; 1924446 6348263 ;; 6348263 1911716 ;; 1911716 -1791354 ;; -1791354 -8343943 ;; -8343943 -6224088 ;; -6224088 3919567 ;; 3919567 -1176382 ;; -1176382 6288012 ;; 6288012 -5569609 ;; -5569609 -5363232 ;; -5363232 -1344267 ;; -1344267 3649155 ;; 3649155

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      my $packed = join '', unpack '(a3x)*', pack 'l*', @sint24s;

      Rather than the above,
          my $packed = pack '(l<X)*', @array;
      is more concise, probably faster (but this latter not checked).

      A shame there seems to be no way to get rid of the final  map on the unpacking side.

      >perl -wMstrict -le "use Test::More 'no_plan'; use List::Util qw(shuffle); ;; sub rand_s24 { return map { int(rand 2**24) - 2**23 } 1 .. $_[0]; } ;; use constant MIN_S24 => -(2**23); use constant MAX_S24 => -MIN_S24() - 1; use constant VALUES => (0, 1, -1, MIN_S24, MAX_S24); use constant N_VALUES => scalar(@{[ VALUES ]}); use constant MAX_LEN => 25; ;; for my $pass (1 .. 200) { for my $len (1 .. MAX_LEN) { my @to_pack = (rand_s24($len - N_VALUES), VALUES)[-$len .. -1]; @to_pack = shuffle @to_pack; $len == @to_pack or die qq{bad len: array to pack}; ;; my $packed = pack '(l<X)*', @to_pack; length($packed) == 3 * $len or die qq{bad len: packed string}; ;; my @unpacked = map { unpack('l<', qq{\x00$_})/256 } unpack '(a3)*', $packed; $len == @unpacked or die qq{bad len: unpacked array}; ;; is_deeply \@unpacked, \@to_pack, sprintf('pass %d: len %d', $pass, $len); } } " ok 1 - pass 1: len 1 ok 2 - pass 1: len 2 ... (4996 lines elided) ... ok 4999 - pass 200: len 24 ok 5000 - pass 200: len 25 1..5000
        my $packed = pack '(l<X)*', @array; is more concise, probably faster

        That's clever, but does have the limitation that you cannot produce a big-endian stream.

        Mind, I don't know if there is any hardware that accepts 24-bit BE values.

        A shame there seems to be no way to get rid of the final map on the unpacking side.

        Agreed. This isn't the first time that I've wished that pack would allow the insertion of values from the template.

        I've also wished for a more generic version of vec that allowed arbitrary numbers of bits, rather than just powers of 2.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
Re: 3-byte representation
by ikegami (Patriarch) on Oct 12, 2011 at 17:50 UTC

    If you want big-endian byte order:

    pack: $s = substr(pack('l>', $n), 1); unpack: $n = unpack('l>', "$s\0")/256;

    If you want little-endian byte order:

    pack: $s = substr(pack('l<', $n), 0, 3); unpack: $n = unpack('l<', "\0$s")/256;

    PS — Going from 3 to 4 increases the space needed by 33%, not 25%.

    Update: Slightly simpler solutions.
    Update: The /256 got dropped when I posted my solution. Fixed.

      Hm?

      Big-endian:

      sub pack24{ substr( pack('l>', $_[0]), 1) };; sub unpack24{ unpack('l>', "$_[0]\0") };; print "$_: ", unpack24( pack24( $_ ) ) for ( -8388608, -8388607, -2, -1, 0, 1, 2, 8388606, 8388607 );; -8388608: -2147483648 -8388607: -2147483392 -2: -512 -1: -256 0: 0 1: 256 2: 512 8388606: 2147483136 8388607: 2147483392

      Little-endian:

      sub pack24{ substr( pack('l<', $_[0]), 1) };; sub unpack24{ unpack('l<', "$_[0]\0") };; print "$_: ", unpack24( pack24( $_ ) ) for ( -8388608, -8388607, -2, -1, 0, 1, 2, 8388606, 8388607 );; -8388608: 16744448 -8388607: 16744448 -2: 16777215 -1: 16777215 0: 0 1: 0 2: 0 8388606: 32767 8388607: 32767

      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

        Thanks, a "/256" got accidentally dropped when I posted my code. Fixed.

      I tested my solutions before posting them.

      Really? Really, really? Hm?

      #! perl -slw use strict; sub pack24be{ substr( pack('l>', $_[0]), 1) } sub unpack24be{ unpack('l>', "$_[0]\0") >> 8 } sub pack24le{ substr( pack('l<', $_[0] ), 0, 3); } sub unpack24le{ unpack( 'l<', "\0$_[0]" ) >> 8 } print( "be: $_: ", unpack24be( pack24be( $_ ) ) ), print( "le: $_: ", unpack24le( pack24le( $_ ) ) ) for ( -8388608, -8388607, -2, -1, 0, 1, 2, 8388606, 8388607 ) __END__ C:\test>junk101 (version 2 ) be: -8388608: -2147483648 le: -8388608: -2147483648 be: -8388607: -2147483392 le: -8388607: -2147483392 be: -2: -512 le: -2: -512 be: -1: -256 le: -1: -256 be: 0: 0 le: 0: 0 be: 1: 256 le: 1: 256 be: 2: 512 le: 2: 512 be: 8388606: 2147483136 le: 8388606: 2147483136 be: 8388607: 2147483392 le: 8388607: 2147483392 C:\test>junk101 (version 3) be: -8388608: 72057594029539328 le: -8388608: 72057594029539328 be: -8388607: 72057594029539329 le: -8388607: 72057594029539329 be: -2: 72057594037927934 le: -2: 72057594037927934 be: -1: 72057594037927935 le: -1: 72057594037927935 be: 0: 0 le: 0: 0 be: 1: 1 le: 1: 1 be: 2: 2 le: 2: 2 be: 8388606: 8388606 le: 8388606: 8388606 be: 8388607: 8388607 le: 8388607: 8388607

      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
      Update: The >> 8 got dropped when I posted my solution. Fixed.

      Silently becomes

      Update: The /256 got dropped when I posted my solution. Fixed.

      Hm. "Got dropped" huh!

        Thank you for all your gentle answers, but if I run this:
        #!/usr/bin/perl print "Content-type: text/html\n\n"; srand(); $fil="ch"; open(OUT, '>>'.$fil); for($i=1;$i<10001;$i++){ $j=int(rand(20000))+440000; $k=substr(pack('l>',$j),1); print OUT $k; } close(OUT); print "DONE !";
        it produces one file with a variable length, but always greater than 30000 characters (for exemple 30045) ! So I suppose something is wrong in the pack line coding...
        It was /256. Just misremembered cause I changed comps in the middle.
Re: 3-byte representation
by cavac (Parson) on Oct 12, 2011 at 16:38 UTC
    Edit: Ooops, my code only works for unsigned integers. I'll leave it here anyway, might at least come in handy for someone else at a later time... Ok, Here's a simple example of how i did it as a test:
    #!/usr/bin/perl use strict; use warnings; my $val = 70000; print "Saving Value: $val\n"; # ENCODE my @bytes; for(my $i = 0; $i < 3; $i++) { $bytes[$i] = $val % 256; if($val) { $val = int($val / 256); } } my $savebytes = pack("ccc", @bytes); # DECODE my @newbytes = unpack("ccc", $savebytes); my $newval = 0; while(scalar @newbytes) { my $byte = pop @newbytes; $newval = $newval * 256; $newval += $byte; } print "Retrieved Value is: $newval\n";

    Here's the output:
    Saving Value: 70000 Retrieved Value is: 70000
    I'm pretty sure there is a better way, but i'm not an expert on all this math stuff... Edit: If you have a lot of numbers to encode/decode, you might want to consider putting that stuff into a C module or use Inline::C for speed.
    Don't use '#ff0000':
    use Acme::AutoColor; my $redcolor = RED();
    All colors subject to change without notice.
Re: 3-byte representation
by TomDLux (Vicar) on Oct 18, 2011 at 15:10 UTC

    How many numbers do you have in a file? How many files? How many files can you store on a $65 2-TB drive?

    The basic premise of Unix is to store data as text, if at all possible. This makes it simple to process it using utilities you hadn't considered when the file was created.

    Your numbers fit in 24 bits, so +/- 8,366,608 ... in fact possibly a smaller range, since you suggest adding a constant to shift the numbers to all-positive. If the numbers are evenly distributed, storing as text requires a separator plus 1-7 digits, plus a possible minus sign. That works out to an average of 5 bytes per number, if you're using ASCII. If there is any asymmetry to make small values more likely than large ones, it might be better than 5 bytes. For a small loss you are now able to feed your files through grep, dc, tr, sed, awk, perl.

    As Occam said: Entia non sunt multiplicanda praeter necessitatem.

      Hello TomDLux and thank for your answer ! I've several thousands couples of numbers (in fact latitudes and longitudes, at first with 4 decimal precision. but converted to integers during the creation of a file). Physical storage place is not an issue, only memory space if the fastest parsing solution is memory intensive, because many different files will be processed by many different users at the same time......

        Considering a process is allocated megabytes of memory when it runs, worrying about a few thousand kilobytes of wasted space is not really productive.

        Wasting memory or wasting CPU resources is never good, of course, but correctness is the first priority. Once you have a solution that works correctly, tighten up memory use, IF it presents a problem in the number of instances you can run; tighten up the algorithmns involved, IF there's a problem with run time.

        No problem? Go on to the next task.

        As Occam said: Entia non sunt multiplicanda praeter necessitatem.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://931045]
Approved by Corion
Front-paged by Old_Gray_Bear
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others meditating upon the Monastery: (6)
As of 2024-04-16 20:40 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found