So, I had some fun with this one. There are two things you can optimize for here: space, and time. Let's look at both.
Space
You seem to want to optimize for space, so let's get that out of the way first with Devel::Size:
@set1 is an array of three keywords.
Sizes:
\@set1: 112 bytes, 364 bytes total
set_to_bits(@set1): 24 bytes, 24 bytes total
set_to_string(@set1): 44 bytes, 44 bytes total
This size doesn't matter unless you are storing the list of keywords for all rows at once, rather than processing one row at a time, or at least just storing the Jaccard similarity (which as you know is a real number 0 < n < 1).
Time
I've considered three options: bitwise operations, string operations, and CPAN module Set::Similarity::Jaccard. Of those, the results are interesting:
Performance:
Rate $jac->sim sim only jaccard_string jaccard_bits
$jac->sim 63480/s -- -21% -62% -80%
sim only 79868/s 26% -- -53% -75%
jaccard_hash 168300/s 165% 111% -- -47%
jaccard_bits 318317/s 401% 299% 89% --
"sim only" uses pre-computed arguments to $jac->similarity, so that number is pretty much the maximum speed one could expect from Set::Similarity::Jaccard. "$jac->sim" computes the string representation of both sets every time, so it's a little slower.
"jaccard_hash" and "jaccard_bits" are my own implementations which work directly with hashes or bitwise representations of each set, respectively.
Whether any of this is worth it or not is a question I can't answer for you. This code could still be optimized further, but I've already had all the fun I have time for today. :-)
Full code (~100 lines) is below the <readmore>. Error checking is up to you.
use 5.026;
use strict;
use warnings;
use Benchmark qw/cmpthese/;
use Devel::Size qw/size total_size/;
use Set::Similarity::Jaccard;
my @all = qw< alpha bravo charlie delta echo foxtrot >;
my %mask; $mask{ $all[$_] } = 2 << $_ for 0..$#all;
my %first; $first{$_} = substr $_, 0, 1 for @all; # For set_to_string
my @set1 = qw< alpha charlie delta >;
my @set2 = qw< alpha delta echo foxtrot >;
# Convert a list to bits. e.g. qw<alpha charlie echo foxtrot> : 0b1010
+11
sub set_to_bits {
my $bits = 0;
$bits |= $mask{$_} for @_;
$bits;
}
# Convert a list to a string representation. Note we can't just work w
+ith the
# full keywords here, as Set::Similarity::Jaccard would look at the ke
+ywords
# character-wise, which would give an incorrect result.
# e.g.: qw<alpha charlie echo> : 'ace'
sub set_to_string {
join '', map { $first{$_} } @_;
}
# Compute Jaccard similarity of two sets, using hashes only
sub jaccard_hash {
my ($set1, $set2) = @_;
my %set1 = map { $_ => 1 } @$set1;
my %set2 = map { $_ => 1 } @$set2;
my $int_count = 0;
$int_count += 1 for grep { $set1{$_} && $set2{$_} } keys %set1;
my $uni_count = 0;
$uni_count += 1 for grep { $set1{$_} || $set2{$_} } @all;
$uni_count ? $int_count/$uni_count : 1;
}
# Compute Jaccard similarity of two sets by converting to bits first (
+inlined)
sub jaccard_bits {
my ($set1, $set2) = @_;
my ($bits1, $bits2) = (0,0);
# Inline set_to_bits for speed
$bits1 |= $mask{$_} for @$set1;
$bits2 |= $mask{$_} for @$set2;
# Calculate intersection and union
my $int = $bits1 & $bits2;
my $uni = $bits1 | $bits2;
# Compute Hamming weights of $int and $uni to get # of bits set (i
+nlined)
my $ic = $int;
$ic -= (($ic >> 1) & 0x55555555);
$ic = ($ic & 0x33333333) + (($ic >> 2) & 0x33333333);
$ic = ((($ic + ($ic >> 4)) & 0x0f0f0f0f) * 0x01010101) >> 24;
my $uc = $uni;
$uc -= (($uc >> 1) & 0x55555555);
$uc = ($uc & 0x33333333) + (($uc >> 2) & 0x33333333);
$uc = ((($uc + ($uc >> 4)) & 0x0f0f0f0f) * 0x01010101) >> 24;
$uc ? $ic/$uc : 1;
}
# Verification
my $jac = Set::Similarity::Jaccard->new;
my $sim = $jac->similarity(set_to_string(@set1), set_to_string(@set2))
+;
say "CPAN Jaccard: $sim";
say "Bits Jaccard: " . jaccard_bits(\@set1, \@set2);
say "Hash Jaccard: " . jaccard_hash(\@set1, \@set2);
printf "\nSizes:\n";
printf " %20s: %4d bytes, %4d bytes total\n",
$_, eval "size($_)", eval "total_size($_)"
for qw< \@set1 set_to_bits(@set1) set_to_string(@set1) >;
printf "\nPerformance:\n";
my $str1 = set_to_string(@set1);
my $str2 = set_to_string(@set2);
my $r;
cmpthese(-5, {
'similarity only' => sub { $r =
Set::Similarity::Jaccard->new->similarity($str1, $str2) },
'$jac->similarity' => sub { $r =
Set::Similarity::Jaccard->new->similarity(set_to_string(@set1)
+,
set_to_string(@set2)
+) },
'jaccard_bits' => sub { $r = jaccard_bits(\@set1, \@set2) },
'jaccard_hash' => sub { $r = jaccard_hash(\@set1, \@set2) },
});
use strict; use warnings; omitted for brevity.