Beefy Boxes and Bandwidth Generously Provided by pair Networks
XP is just a number
 
PerlMonks  

Run length encode a bit vector

by Anonymous Monk
on Jan 05, 2012 at 06:57 UTC ( [id://946331]=perlquestion: print w/replies, xml ) Need Help??

Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi, I have some large (multiple million bits) bit vectors, and I'd like to run length encode them to save storage.

Given the input: my $bv = pack 'N', 0x01020304; I'd like to get out an array

-1, 9, -1, 6, -2, 8, -1, 5

Where negative counts represent set bits and positive counts clear bits.

Searches for Perl and run length encoding all seem to turn up the rosetta stone thing of RLE strings: '"wwwbbbw" => 'w3b3w1' using regex.

I know I need to use vec(), but I keep getting lost in the state changes and end up with huge nest of ifs and garbage output.

Any help appreciated.

Replies are listed 'Best First'.
Re: Run length encode a bit vector
by GrandFather (Saint) on Jan 05, 2012 at 07:45 UTC

    Show us what you have tried and what it generates for your sample data.

    Note that if you need speed for this task Perl probably isn't what you should use.

    True laziness is hard work
    A reply falls below the community's threshold of quality. You may see it by logging in.
Re: Run length encode a bit vector
by johngg (Canon) on Jan 05, 2012 at 10:59 UTC

    I'm not sure your expected output is making sense as I think you have probably mis-counted the number of zeros in the second term because there is actually a run of eight zeros there.

    $ perl -E 'say unpack q{b*}, pack q{N}, 0x01020304;' 10000000010000001100000000100000

    As well as that, you seem to be parsing your data with the low-order bit first in each byte even though you have packed in network order. I would expect you bit vector to look like this

    $ perl -E 'say unpack q{B*}, pack q{N}, 0x01020304;' 00000001000000100000001100000100

    with expected output

    7, -1, 6, -1, 7, -2, 5, -1, 2

    Perhaps you could clarify your requirement.

    Cheers,

    JohnGG

      -- I think you have probably mis-counted the number of zeros in the second term because there is actually a run of eight zeros there.

      Yes. I miscounted.

      -- As well as that, you seem to be parsing your data with the low-order bit first in each byte even though you have packed in network order.

      Actually irrelevant as it is just sample data. The actual vector to be compressed is built with vec(), so 'b' is the right template.

        Actually irrelevant as it is just sample data.

        Can you please provide an example of relevant data and the relevant output you expect from that data?

        The actual vector to be compressed is built with vec()...

        Can you say why vec must be used? pack/unpack seem better suited.

Re: Run length encode a bit vector
by TJPride (Pilgrim) on Jan 05, 2012 at 07:57 UTC
    Not sure if this helps or not:

    use strict; use warnings; my $bytes = pack 'N', 0x01020304; my $bits = unpack('B32', $bytes); my @counts; while ($bits =~ /(0+|1+)/g) { push @counts, length($1) * (substr($1, 0, 1) == 1 ? 1 : -1); } print "$bits\n@counts";

    Output:

    00000001000000100000001100000100 -7 1 -6 1 -7 2 -5 1 -2

      The parent could use B* instead of B32.

      Methods that rely on unpack('B*', $bytes) can even handle arbitrary long packed string if they are packed in "network" (big endian) order.

      my $bytes = pack 'N*', 0x01020304, 0x05060708; my $bits = unpack 'B*', $bytes; ...

      (By the way, the == 1 is superfluous in the parent's code.)

        Odd, I would have thought that 0 as a string would still trigger true, and that in this context the pattern match would return 0 as a string. But I tested just now and you're correct, it works the same way without the ==.
Re: Run length encode a bit vector
by AnomalousMonk (Archbishop) on Jan 05, 2012 at 14:04 UTC

    My $0.02. I've used the bit-ordering implied in the 'irrelevant' OP.

    >perl -wMstrict -le "my $bits = unpack 'b*', pack 'N*', 0x01020304, 0x11121314; print qq{'$bits'}; ;; $bits =~ s{ ((.) \2*) } { ($2 eq '1' ? '-' : '') . length($1) . ',' }xmsge; print qq{'$bits'}; ;; my @n = map 0+$_, split /,/, $bits; print qq{@n}; " '1000000001000000110000000010000010001000010010001100100000101000' '-1,8,-1,6,-2,8,-1,5,-1,3,-1,4,-1,2,-1,3,-2,2,-1,5,-1,1,-1,3,' -1 8 -1 6 -2 8 -1 5 -1 3 -1 4 -1 2 -1 3 -2 2 -1 5 -1 1 -1 3

    Update: Maybe better:

    >perl -wMstrict -le "my $bits = unpack 'b*', pack 'N*', 0x01020304, 0x11121314; print qq{'$bits'}; ;; my @n = map { my $l = length; chop eq '1' ? -$l : $l; } $bits =~ m{ 0+ | 1+ }xmsg ; print qq{@n}; " '1000000001000000110000000010000010001000010010001100100000101000' -1 8 -1 6 -2 8 -1 5 -1 3 -1 4 -1 2 -1 3 -2 2 -1 5 -1 1 -1 3
Re: Run length encode a bit vector
by BrowserUk (Patriarch) on Jan 05, 2012 at 16:02 UTC

    sub rle { my $rVec = shift; my $bits = length( $$rVec ) *8; my $i = vec( $$rVec, 0, 1 ); my @counts; my $p = 1; while( $p < $bits ) { push @counts, 1; ++$counts[ -1 ] while $p < $bits and vec( $$rVec, $p++, 1 ) == + $i; $counts[ -1 ] *= -1 if $i; $i ^= 1; } return \@counts; } ;; $v = pack 'N', 0x01020304;; print unpack 'b*', $v;; 10000000010000001100000000100000 print @{ rle( \$v ) };; -1 8 -1 6 -2 8 -1 5

    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

    The start of some sanity?

Re: Run length encode a bit vector
by johngg (Canon) on Jan 05, 2012 at 23:46 UTC

    Because you are compressing sequences of bits rather than bytes I am not convinced that you will see a saving in space. It will depend on whether you have longish sequences of contiguous ones or zeros. If there are many short runs or singletons, your array of counts is likely to take more space than the vector. Unless I'm misunderstanding something, you need to have runs of more than eight ones or zeros before you break even if representing each count as a signed character; it gets worse if having to use shorts or longs.

    Packing the array as signed characters would cater for up to 256128 contiguous ones or 255127 contiguous zeros. The following code uses a regex against a string representation of the vector but a pure vec solution as suggested by BrowserUk would be better. However, it illustrates my point.

    use strict; use warnings; use 5.010; my $vector = pack q{N*}, 0x01020304, 0x11121314; my $bitStr = unpack q{b*}, $vector; say $bitStr; say q{=} x 20; my $rcSign = do { my $val = $bitStr =~ m{^1} ? 1 : -1; sub { $val *= -1 } }; my @posns = ( 0 ); my @lengths; while ( $bitStr =~ m{(?x) (?: (?<=0)(?=1) | (?<=1)(?=0) | (?=\z) )}g ) { push @posns, pos $bitStr; push @lengths, ( $posns[ -1 ] - shift @posns ) * $rcSign->(); } my $rle = pack q{c*}, @lengths; say q{Length of vector - }, length $vector; say q{Length of packed RLE array - }, length $rle; say q{=} x 20; say qq{@lengths}; my @restored = unpack q{c*}, $rle; say qq{@restored}; say q{=} x 20;

    The output.

    1000000001000000110000000010000010001000010010001100100000101000 ==================== Length of vector - 8 Length of packed RLE array - 24 ==================== -1 8 -1 6 -2 8 -1 5 -1 3 -1 4 -1 2 -1 3 -2 2 -1 5 -1 1 -1 3 -1 8 -1 6 -2 8 -1 5 -1 3 -1 4 -1 2 -1 3 -2 2 -1 5 -1 1 -1 3 ====================

    As I say, I may be misunderstanding something or your vector might consist of very long contiguous runs. I'd be interested to know whether you will see a benefit using RLE.

    Update: Corrected the senior moment with maximum -ve and +ve values that can be held in a byte.

    Cheers,

    JohnGG

      -- It will depend on whether you have longish sequences of contiguous ones or zeros.

      The one set (of 25 sets) of indexes that I've analyzed, consists of 88 x 31MB vectors. They vary between 86% and 98% sparse (by zero bytes rather than bits).

      The largest 0 runs range between 8 and 12 million bits. The largest 1 run is 67 bits. By packing the run counts as 0/1 pairs into 32-bit words, 24-bits for the 0 runs and 8-bits for the one run, I can reduce the size by more 2/3rds and am still able to perform boolean operations with decompressing first.

      For the underlying principles see http://crd.lbl.gov/~kewu/ps/LBNL-49626.pdf.

Re: Run length encode a bit vector
by sundialsvc4 (Abbot) on Jan 06, 2012 at 14:12 UTC

    What happens if you use zip or tar or bzip2 to compress the files right now?   These “off the shelf” algorithms use compression techniques such as these, and more, and .. who knows .. might they be sufficient as-is for this task?

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://946331]
Approved by lidden
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others goofing around in the Monastery: (6)
As of 2024-03-29 09:09 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found