http://qs321.pair.com?node_id=11107862


in reply to bit array comparison

So, I had some fun with this one. There are two things you can optimize for here: space, and time. Let's look at both.

Space

You seem to want to optimize for space, so let's get that out of the way first with Devel::Size:

@set1 is an array of three keywords.

Sizes: \@set1: 112 bytes, 364 bytes total set_to_bits(@set1): 24 bytes, 24 bytes total set_to_string(@set1): 44 bytes, 44 bytes total

This size doesn't matter unless you are storing the list of keywords for all rows at once, rather than processing one row at a time, or at least just storing the Jaccard similarity (which as you know is a real number 0 < n < 1).

Time

I've considered three options: bitwise operations, string operations, and CPAN module Set::Similarity::Jaccard. Of those, the results are interesting:

Performance: Rate $jac->sim sim only jaccard_string jaccard_bits $jac->sim 63480/s -- -21% -62% -80% sim only 79868/s 26% -- -53% -75% jaccard_hash 168300/s 165% 111% -- -47% jaccard_bits 318317/s 401% 299% 89% --

"sim only" uses pre-computed arguments to $jac->similarity, so that number is pretty much the maximum speed one could expect from Set::Similarity::Jaccard. "$jac->sim" computes the string representation of both sets every time, so it's a little slower.

"jaccard_hash" and "jaccard_bits" are my own implementations which work directly with hashes or bitwise representations of each set, respectively.

Whether any of this is worth it or not is a question I can't answer for you. This code could still be optimized further, but I've already had all the fun I have time for today. :-)

Full code (~100 lines) is below the <readmore>. Error checking is up to you.

use 5.026; use strict; use warnings; use Benchmark qw/cmpthese/; use Devel::Size qw/size total_size/; use Set::Similarity::Jaccard; my @all = qw< alpha bravo charlie delta echo foxtrot >; my %mask; $mask{ $all[$_] } = 2 << $_ for 0..$#all; my %first; $first{$_} = substr $_, 0, 1 for @all; # For set_to_string my @set1 = qw< alpha charlie delta >; my @set2 = qw< alpha delta echo foxtrot >; # Convert a list to bits. e.g. qw<alpha charlie echo foxtrot> : 0b1010 +11 sub set_to_bits { my $bits = 0; $bits |= $mask{$_} for @_; $bits; } # Convert a list to a string representation. Note we can't just work w +ith the # full keywords here, as Set::Similarity::Jaccard would look at the ke +ywords # character-wise, which would give an incorrect result. # e.g.: qw<alpha charlie echo> : 'ace' sub set_to_string { join '', map { $first{$_} } @_; } # Compute Jaccard similarity of two sets, using hashes only sub jaccard_hash { my ($set1, $set2) = @_; my %set1 = map { $_ => 1 } @$set1; my %set2 = map { $_ => 1 } @$set2; my $int_count = 0; $int_count += 1 for grep { $set1{$_} && $set2{$_} } keys %set1; my $uni_count = 0; $uni_count += 1 for grep { $set1{$_} || $set2{$_} } @all; $uni_count ? $int_count/$uni_count : 1; } # Compute Jaccard similarity of two sets by converting to bits first ( +inlined) sub jaccard_bits { my ($set1, $set2) = @_; my ($bits1, $bits2) = (0,0); # Inline set_to_bits for speed $bits1 |= $mask{$_} for @$set1; $bits2 |= $mask{$_} for @$set2; # Calculate intersection and union my $int = $bits1 & $bits2; my $uni = $bits1 | $bits2; # Compute Hamming weights of $int and $uni to get # of bits set (i +nlined) my $ic = $int; $ic -= (($ic >> 1) & 0x55555555); $ic = ($ic & 0x33333333) + (($ic >> 2) & 0x33333333); $ic = ((($ic + ($ic >> 4)) & 0x0f0f0f0f) * 0x01010101) >> 24; my $uc = $uni; $uc -= (($uc >> 1) & 0x55555555); $uc = ($uc & 0x33333333) + (($uc >> 2) & 0x33333333); $uc = ((($uc + ($uc >> 4)) & 0x0f0f0f0f) * 0x01010101) >> 24; $uc ? $ic/$uc : 1; } # Verification my $jac = Set::Similarity::Jaccard->new; my $sim = $jac->similarity(set_to_string(@set1), set_to_string(@set2)) +; say "CPAN Jaccard: $sim"; say "Bits Jaccard: " . jaccard_bits(\@set1, \@set2); say "Hash Jaccard: " . jaccard_hash(\@set1, \@set2); printf "\nSizes:\n"; printf " %20s: %4d bytes, %4d bytes total\n", $_, eval "size($_)", eval "total_size($_)" for qw< \@set1 set_to_bits(@set1) set_to_string(@set1) >; printf "\nPerformance:\n"; my $str1 = set_to_string(@set1); my $str2 = set_to_string(@set2); my $r; cmpthese(-5, { 'similarity only' => sub { $r = Set::Similarity::Jaccard->new->similarity($str1, $str2) }, '$jac->similarity' => sub { $r = Set::Similarity::Jaccard->new->similarity(set_to_string(@set1) +, set_to_string(@set2) +) }, 'jaccard_bits' => sub { $r = jaccard_bits(\@set1, \@set2) }, 'jaccard_hash' => sub { $r = jaccard_hash(\@set1, \@set2) }, });
use strict; use warnings; omitted for brevity.