http://qs321.pair.com?node_id=732843

In a reply to Netflix (or on handling large amounts of data efficiently in perl), I suggested that Judy::1/Judy1(3) was a sparse bit vector which might be interesting as a replacement for vec(). Perl's built-in vec can also be used as a bit vector but it's not sparse - all bits between the zeroth to the highest ever set are instantiated inside a perl string. This is only convenient if your bit vector isn't too big in ram.

Example code

A simple perl bit vector

use constant MEGABYTE => 2 ** 20; my $vector = ''; # 5,000 bits. Set every 100th bit between 15 million and 20 million. for ( 15_000_000 .. 20_000_000 ) { if ( ! ( $_ % 100 ) ) { vec( $vector, $_, 1 ) = 1; } } printf "%0.1fM\n", length( $vector ) / MEGABYTE;

A simple Judy::1 bit vector

use constant KILOBYTE => 2 ** 10; use Judy::1 qw( Set MemUsed ); my $vec; # 5,000 bits. Set every 100th bit between 15 million and 20 million. for ( 15_000_000 .. 20_000_000 ) { if ( ! ( $_ % 100 ) ) { Set( $judy, $_ ); } } printf "%0.1fK\n", MemUsed( $judy ) / KILOBYTE;

Memory "benchmarks"

Densityvec()Judy::1
Tests using the entire range 0..20,000,000
Every bit2.3M618K
Densityvec()Judy::1
Tests using range 15,000,000..20,000,000
Every bit2.3M162K
Every 10th bit2.3M772K
Every 100th bit2.3M158K
Every 1000th bit2.3M65K

⠤⠤ ⠙⠊⠕⠞⠁⠇⠑⠧⠊

Replies are listed 'Best First'.
Re: Compact and sparse bit vector
by demerphq (Chancellor) on Jan 03, 2009 at 13:35 UTC

    The only thing is a lot of the time people use vec() specifically because they want the string to be a bit-vector. They dont want to use it as an array of <8 bit units of information so much, as a way to individually control bits in a string with a reasonably syntax.

    Although having said that, the problem with vec() in this form is that it doesnt use native byte order, so you cant normally pass it directly into a C routine. So i guess your proposal has some merit...

    ---
    $world=~s/war/peace/g

      Actually, the main use for bitvectors in the NetFlix challenge. is that you can do very fast relational math.

      So performing the select to find all the people who have rated all the same movies (and the movie we are trying to generate a rating for), that we already have a rating for by the person we are trying to generate a rating for, is just a tight loop over the dataset performing bitstring boolean operations which are very fast (in perl).

      Doing the same operations (AND/OR/XOR/etc. of every bit of one bitstring against every bit of another bitstring for every customer in the DB) using Judy arrays, would not benefit at all from any of their special properties (cache line locality), and be extremely slow.


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.

        Heh, I kind zeroed in on the expression "was a sparse bit vector which might be interesting as a replacement for vec()", and didnt even notice the netflix reference. :-)

        ---
        $world=~s/war/peace/g

        I've never really found myself doing vector operations of bitmaps against each other. I didn't notice that being a feature of the linked Netflix node. I do occasionally find it convenient to have a sparse bitmask. Usually I just use a hash of the stringified integer.

        ⠤⠤ ⠙⠊⠕⠞⠁⠇⠑⠧⠊