http://qs321.pair.com?node_id=11107846

Amendil has asked for the wisdom of the Perl Monks concerning the following question:

Hello Perl Monks,

I'm working on a tsv, one of its columns is a csv list of keywords (28 unique values). I'd like to compute the Jaccard Index (Intersection / Union) of this list of keywords. To do so efficiently I'd like to use a bit array to represent the list of keywords.

I tried to read few articles on Perlmonks and stackoverflow, but so far I feel I'm missing something completely obvious.

Here is what I wrote:

use common::sense; my $a = ''; my $b = ''; $a += 1 << 0; $a += 1 << 1; $b += 1 << 1; $b += 1 << 2; my $i = $a & $b; my $u = $a | $b; my $i_cnt = unpack '%32b*', $i; my $u_cnt = unpack '%32b*', $u; printf "a is %#032b %d\n", $a, $a; printf "b is %#032b %d\n", $b, $b; printf "intersection is %#032b %d\n", $i, $i; printf "union is %#032b %d\n", $u, $u; say "set bit count in intersection: $i_cnt"; say "set bit count in union: $u_cnt";

Actual result:

a is 0b000000000000000000000000000011 3 b is 0b000000000000000000000000000110 6 intersection is 0b000000000000000000000000000010 2 union is 0b000000000000000000000000000111 7 set bit count in intersection: 3 set bit count in union: 5

Expected result:

a is 0b000000000000000000000000000011 3 b is 0b000000000000000000000000000110 6 intersection is 0b000000000000000000000000000010 2 union is 0b000000000000000000000000000111 7 set bit count in intersection: 1 set bit count in union: 3