If I do it when I want a similarity I have to compare against every single set in my database every time, which seems a little suboptimal.
The beauty of using bitstrings is that the comparisons are very fast. For example, the following compares each of 15,000 bitstrings against the sample input in less that 1/10th of a second:
#! perl -slw
use strict;
use Data::Dump qw[ pp ]; $Data::Dump::WIDTH = 1000;
use Time::HiRes qw[ time ];
sub comp {
my( $a, $b ) = @_;
my $total = unpack( '%32b*', $a | $b );
return ( unpack( '%32b*', $a & $b ) / $total )
- ( unpack( '%32b*', $a ^ $b ) / $total );
}
our $N //= 20;
my @sets = map pack( 'Q', int rand 2**64 ), 1 .. $N;
our $S //= '0101010101010101010101010101010101010101010101010101010101
+010101';
$S = pack 'b*', $S;
our $T //= 0.8;
my $start = time;
print 'Input: ', unpack 'b*', $S;
for ( 0 .. $#sets ) {
my $score = comp( $sets[ $_ ], $S );
if( $score > $T ) {
printf "%8.6f: %s\n", $score, unpack 'b*', $sets[ $_ ];
}
}
printf "Comparing 1 against $N sets, took: %.6f seconds\n", time() - $
+start;
__END__
C:\test>996530 -N=15000 -S=1111111111111111111111111111111100000000000
+000000000000000000000 -T=0.25
Input: 111111111111111111111111111111110000000000000000000000000000
+0000
0.280000: 111111111111111111111111111111111111111111111111100000000000
+0001
0.254902: 111111111111111111111111111111111111111111111111100000000010
+0001
0.254902: 111111111111111111111111111111111111111111111111100000000000
+1001
0.254902: 111111111111111111111111111111111111111111111111101000000000
+0001
0.254902: 111111111111111111111111111111111111111111111111100000000000
+1001
0.254902: 111111111111111111111111111111111111111111111111100000000100
+0001
Comparing 1 against 15000 sets, took: 0.069085 seconds
The bitstrings used can present a corpus of up to 64 phrases -- just for convenience -- but even if you multiply the size of the corpus by 10: #! perl -slw
use strict;
use Data::Dump qw[ pp ]; $Data::Dump::WIDTH = 1000;
use Time::HiRes qw[ time ];
sub comp {
my( $a, $b ) = @_;
my $total = unpack( '%32b*', $a | $b );
return ( unpack( '%32b*', $a & $b ) / $total )
- ( unpack( '%32b*', $a ^ $b ) / $total );
}
our $N //= 20;
my @sets = map{ pack( 'Q', int rand 2**64 ) x 10 } 1 .. $N;
our $S //= '0101010101010101010101010101010101010101010101010101010101
+010101';
$S = pack( 'b*', $S ) x 10;
our $T //= 0.8;
my $start = time;
print 'input: ', unpack 'b*', $S;
for ( 0 .. $#sets ) {
my $score = comp( $sets[ $_ ], $S );
if( $score > $T ) {
printf "%8.6f: %s\n", $score, unpack 'b*', $sets[ $_ ];
}
}
printf "Comparing 1 against $N sets, took: %.6f seconds\n", time() - $
+start;
__END__
C:\test>996530 -N=15000 -T=0.1
input: 010101010101010101010101010101010101010101010101010101010101
+01010101010101010101
010101010101010101010101010101010101010101010101010101010101
+01010101010101010101
010101010101010101010101010101010101010101010101010101010101
+01010101010101010101
010101010101010101010101010101010101010101010101010101010101
+01010101010101010101
010101010101010101010101010101010101010101010101010101010101
+01010101010101010101
010101010101010101010101010101010101010101010101010101010101
+01010101010101010101
010101010101010101010101010101010101010101010101010101010101
+01010101010101010101
010101010101010101010101010101010101010101010101010101010101
+01010101010101010101
0.103448:
111111111111111111111111111111111111111111111111101010111010
+10111111111111111111
111111111111111111111111111111111010101110101011111111111111
+11111111111111111111
111111111111111110101011101010111111111111111111111111111111
+11111111111111111111
101010111010101111111111111111111111111111111111111111111111
+11111010101110101011
111111111111111111111111111111111111111111111111101010111010
+10111111111111111111
111111111111111111111111111111111010101110101011111111111111
+11111111111111111111
111111111111111110101011101010111111111111111111111111111111
+11111111111111111111
101010111010101111111111111111111111111111111111111111111111
+11111010101110101011
0.103448:
111111111111111111111111111111111111111111111111110111010101
+01011111111111111111
111111111111111111111111111111111101110101010101111111111111
+11111111111111111111
111111111111111111011101010101011111111111111111111111111111
+11111111111111111111
110111010101010111111111111111111111111111111111111111111111
+11111101110101010101
111111111111111111111111111111111111111111111111110111010101
+01011111111111111111
111111111111111111111111111111111101110101010101111111111111
+11111111111111111111
111111111111111111011101010101011111111111111111111111111111
+11111111111111111111
110111010101010111111111111111111111111111111111111111111111
+11111101110101010101
0.122807:
111111111111111111111111111111111111111111111111110101010101
+01011111111111111111
111111111111111111111111111111111101010101010101111111111111
+11111111111111111111
111111111111111111010101010101011111111111111111111111111111
+11111111111111111111
110101010101010111111111111111111111111111111111111111111111
+11111101010101010101
111111111111111111111111111111111111111111111111110101010101
+01011111111111111111
111111111111111111111111111111111101010101010101111111111111
+11111111111111111111
111111111111111111010101010101011111111111111111111111111111
+11111111111111111111
110101010101010111111111111111111111111111111111111111111111
+11111101010101010101
Comparing 1 against 15000 sets, took: 0.069599 seconds
The time required barely changes: 0.065085 .v. 0.069599, despite doing 10 times the amount of work.
Used with imagination, bitstrings are the best kept secret of of data processing.
With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
RIP Neil Armstrong
|