Re: Fuzzy Searching: Optimizing Algorithm Selection

If you XOR two strings together,

print join'',unpack 'C*',   
  'AGGCGGAAGCACCCAACAGCAACAG' 
^ 'ACACGCAAAAACCGAAGAGAAAGCG';

0460040062000400400200420
[download]

The resultant string will be the null character (chr( 0 )) in each position where the two strings match and some other char where they did not.

If you count the non-null bytes, it gives you a score for the fuzziness of the match.

print $_ = grep $_,unpack 'C*', 
  'AGGCGGAAGCACCCAACAGCAACAG' 
^ 'ACACGCAAAAACCGAAGAGAAAGCG';

10
[download]

To avoid having to process the large file of sequences multiple times, it is better to process all of the 25-ers against each sequence in turn. By using substr to step down the sequence, you can perform the XOR against each of the 25-ers for each position of the sequence.

Putting that all together, I got:

#! perl -slw
use strict;
use bytes;
$| = 1;

our $FUZZY ||= 10;

open FUZ, '< :raw', $ARGV[ 0 ] or die "$ARGV[ 0 ] : $!";
my %fuz;
$fuz{ $_ } = '' while chomp( $_ = <FUZ> );
close FUZ;

print "Loaded ${ \scalar keys %fuz } 25-ers";

open SEQ, '< :raw', $ARGV[ 1 ] or die "$ARGV[ 1 ] : $!";

while( my $seq = <SEQ> ) {
    chomp $seq;

    for my $offset ( 0 .. length( $seq ) - 25 ) { 
        printf "\rProcessing sequence %5d offset %05d", $., $offset;

        for my $fuz ( keys %fuz ) {
            my $m =  grep $_, unpack 'C*', $fuz ^ substr( $seq, $offse
+t, 25 );

            if( $m <= $FUZZY ) {
                
                ## This stores the lineno/offset/fuzziness where each 
+25-er
                ##  matched in a compact form for further process; sor
+ting etc.
                $fuz{ $fuz } .= pack 'nnn', $., $offset, $m;
                
                ## Or just print out the data to a file.
#                print "\nMatched '$fuz' -v- '", 
#                    substr( $seq, $offset, 25 ),
#                "' in line: $. @ $offset with fuzziness: ", $m;
            }
        }
    }
}

close SEQ;
[download]

This process is currently running 500,000 25-ers against a file 30,000 x ave. 1000 characters (500 - 1500) sequences (all randomly generated), at the approximate rate of 3.5 seconds per offset, whilst recording the lineno/offset/fuzziness for all matches where the fuzziness is 10 or less. The advantage may be that this process will give you an accurate fuzziness factor for every 25-er matched against every position in a single pass.

I don't have any feel for how this compares to using the regex engine to perform the matching, but I think it may run quicker.

If you could indicate how many 100s of 1000s of 25's you are playing with, plus the average length of the 30,000 sequences, it would be possible to calculate an approximate time for the process that you could then compare with other approaches?

Examine what is said, not who speaks.

"Efficiency is intelligent laziness." -David Dunham
"Think for yourself!" - Abigail
"Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon

Comment on Re: Fuzzy Searching: Optimizing Algorithm Selection Select or Download Code

Replies are listed 'Best First'.
Re^2: Fuzzy Searching: Optimizing Algorithm Selection by tachyon (Chancellor) on Nov 11, 2004 at 03:46 UTC
At 3.5 seconds per offset it will complete in roughly 3 years ;-) Nonetheless XOR works quickly in Perl. A small optimisation is to take the substr outside the inner keys loop as you are doing it 499,999 more times than required per offset. Real sample data would be nice. The benefits from indexing depend on the alphabet size and the number of find strings. cheers tachyon	[reply]
Re^3: Fuzzy Searching: Optimizing Algorithm Selection by BrowserUk (Patriarch) on Nov 11, 2004 at 03:57 UTC
Update: Indeed. The two optimisations reduce the 500,000 comparisons time from ~3.5 seconds to + .5 of a second. That reduces the projected overall runtime of my test scenario from 3+ years to under half a year. Worth doing :) Yes. I performed the same calculation based upon my decision to use 500,000 25-ers. Hence my decision to ask for clarification. There are several ways to speed up the processing. Using `( substr( $seq, $offset, 25 ) ^ $25er ) =~ tr[\0][\0];` [download] to do the counting, rather than `grep $_, unpack 'C*, ...` [download] is (probably) another. I'm just waiting for a limited benchmark I have running to complete before trying several things. I had thought that by avoiding actually copying each 25 char string out of the sequence I might save some time/memory, but now you've pointed it out, I realise that I can create an LVALUE ref to the substring and avoid 500,000 calls to substr for each inner loop. Thanks. Examine what is said, not who speaks. "Efficiency is intelligent laziness." -David Dunham "Think for yourself!" - Abigail "Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon	[reply] [d/l] [select]
Re^2: Fuzzy Searching: Optimizing Algorithm Selection by kscaldef (Pilgrim) on Nov 11, 2004 at 03:28 UTC
There are proprietary libraries (in C) that do this sort of thing in a couple milliseconds per query string. The basic idea is that you use your dictionary of sequences to construct a finite automaton, which either accepts or rejects your query string. This does require some cleverness to do fuzzy matching, and I'm afraid I don't know enough of the details to help you out there.	[reply]
Re^3: Fuzzy Searching: Optimizing Algorithm Selection by demerphq (Chancellor) on Nov 12, 2004 at 18:44 UTC
A DFA implmentation is the next step from trie. In fact a trie can be viewed as an alternative representation of a DFA with anchoring at the string start. --- demerphq	[reply]
Re^2: Fuzzy Searching: Optimizing Algorithm ( A few improvements). by BrowserUk (Patriarch) on Dec 04, 2004 at 08:56 UTC
In case Itatsumaki should ever come back. Here's a somewhat improved implementation of my Xor algorithm. The original projection of 3 1.2 years runtime to process 100,000 x 1,000 x 30,000 is now reduced to 7.6 days: Update (2004/12/08): Updated code to correct an error that manifested itself when the sequences were not an exact multiple of the key length. (As noted below by demerphq) #! perl -slw use strict; use bytes; our $FUZZY \|\|= 2; open KEYS, '<', $ARGV[ 0 ] or die "$ARGV[ 0 ] : $!"; my @keys = <KEYS>; close KEYS; chomp @keys; warn "Loaded ${ \scalar @keys } keys"; open SEQ, '<', $ARGV[ 1 ] or die "$ARGV[ 1 ] : $!"; my( $masked, $pos ); my $totalLen = 0; my $count = 0; while( my $seq = <SEQ> ) { chomp $seq; my $seqLen = length $seq; $totalLen += $seqLen; for my $key ( @keys ) { my $keyLen = length $key; my $mask = $key x ( int( $seqLen / $keyLen ) + 1 ); my $maskLen = length $mask; my $minZeros = chr( 0 ) x int( $keyLen / ( $FUZZY + 1 ) ); my $minZlen = length $minZeros; for my $offset1 ( 0 .. $keyLen-1 ) { $masked = $mask ^ substr( $seq, $offset1, $maskLen ); $pos = 0; while( $pos = 1+index $masked, $minZeros, $pos ) { $pos--; my $offset2 = $pos - ($pos % $keyLen ); last unless $offset1 + $offset2 + $keyLen <= $seqLen; my $fuz = $keyLen - ( substr( $masked, $offset2, $keyLen ) =~ tr[\0] +[\0] ); if( $fuz <= $FUZZY ) { printf "\tFuzzy matched key:'$key' -v- '%s' in lin +e:" . "%2d @ %6d (%6d+%6d) with fuzziness: %d\n", + substr( $seq, $offset1 + $offset2, $keyLen ), $., $offset1 + $offset2, $offset1, $offset2, $ +fuz; } $pos = $offset2 + $keyLen; } } } } warn "\n\nProcessed $. sequences"; warn "Average length: ", $totalLen / $.; close SEQ; [download] A coupe of runs (on single sequences for timing purposes) on data comparable (produced by the same code) as other timings published elsewhere: [ 6:57:56.46] P:\test\demerphq> ..\406836-3 Fuzz-Words-W0025-S100000-WC100000-SC0001.fuzz Fuzz-Strings-W0025-S100000-WC100000-SC0001.fuzz Loaded 100000 keys at P:\test\406836-3.pl line 12. seq:00001 (100000) 1 @ 69364 ( 14+ 69350) with fuzziness: 0 1 @ 24886 ( 11+ 24875) with fuzziness: 0 1 @ 40056 ( 6+ 40050) with fuzziness: 0 1 @ 68870 ( 20+ 68850) with fuzziness: 0 1 @ 3264 ( 14+ 3250) with fuzziness: 0 1 @ 8744 ( 19+ 8725) with fuzziness: 0 1 @ 7493 ( 18+ 7475) with fuzziness: 0 1 @ 28209 ( 9+ 28200) with fuzziness: 0 1 @ 91337 ( 12+ 91325) with fuzziness: 0 1 @ 63018 ( 18+ 63000) with fuzziness: 0 1 @ 61025 ( 0+ 61025) with fuzziness: 0 1 @ 32114 ( 14+ 32100) with fuzziness: 0 1 @ 30461 ( 11+ 30450) with fuzziness: 0 1 @ 59174 ( 24+ 59150) with fuzziness: 0 1 @ 74084 ( 9+ 74075) with fuzziness: 0 1 @ 58322 ( 22+ 58300) with fuzziness: 0 1 @ 78465 ( 15+ 78450) with fuzziness: 0 1 @ 56190 ( 15+ 56175) with fuzziness: 0 1 @ 14968 ( 18+ 14950) with fuzziness: 0 1 @ 31986 ( 11+ 31975) with fuzziness: 0 1 @ 60748 ( 23+ 60725) with fuzziness: 0 1 @ 93369 ( 19+ 93350) with fuzziness: 0 1 @ 6242 ( 17+ 6225) with fuzziness: 0 1 @ 15282 ( 7+ 15275) with fuzziness: 0 1 @ 13293 ( 18+ 13275) with fuzziness: 0 Processed 1 sequences at P:\test\406836-3.pl line 57, <SEQ> line 1. Average length: 100000 at P:\test\406836-3.pl line 58, <SEQ> line 1. [ 7:28:22.37] P:\test\demerphq> [ 8:36:32.71] P:\test\demerphq> ..\406836-3 Fuzz-Words-W0025-S1000-WC100000-SC0010.fuzz Fuzz-Strings-W0025-S1000-WC100000-SC0010.fuzz Loaded 100000 keys at P:\test\406836-3.pl line 12. seq:00001 (01000) 1 @ 94 ( 19+ 75) with fuzziness: 0 1 @ 692 ( 17+ 675) with fuzziness: 0 1 @ 326 ( 1+ 325) with fuzziness: 0 1 @ 35 ( 10+ 25) with fuzziness: 0 1 @ 826 ( 1+ 825) with fuzziness: 0 1 @ 598 ( 23+ 575) with fuzziness: 0 1 @ 860 ( 10+ 850) with fuzziness: 0 1 @ 489 ( 14+ 475) with fuzziness: 0 1 @ 370 ( 20+ 350) with fuzziness: 0 1 @ 745 ( 20+ 725) with fuzziness: 0 1 @ 297 ( 22+ 275) with fuzziness: 0 1 @ 415 ( 15+ 400) with fuzziness: 0 1 @ 119 ( 19+ 100) with fuzziness: 0 1 @ 957 ( 7+ 950) with fuzziness: 0 1 @ 646 ( 21+ 625) with fuzziness: 0 1 @ 779 ( 4+ 775) with fuzziness: 0 Processed 1 sequences at P:\test\406836-3.pl line 57, <SEQ> line 1. Average length: 1000 at P:\test\406836-3.pl line 58, <SEQ> line 1. [ 8:36:54.42] P:\test\demerphq> [download] Examine what is said, not who speaks. "But you should never overestimate the ingenuity of the sceptics to come up with a counter-argument." -Myles Allen "Think for yourself!" - Abigail "Time is a poor substitute for thought"--theorbtwo "Efficiency is intelligent laziness." -David Dunham "Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon	[reply] [d/l] [select]
Re^3: Fuzzy Searching: Optimizing Algorithm ( A few improvements). by demerphq (Chancellor) on Dec 08, 2004 at 18:14 UTC
As far as I can tell this code fails the "A" x 10/"A" x 11 test that you challenged my code with in a different thread. I was able to fix the bug by changing the `if ($fuz <= $FUZZY)` test to the following: `if( $fuz <= $FUZZY and $offset1+$offset2+$keyLen<=$seqLen) {` [download] I have put together a test harness and framework for developing Fuzzy::Matcher implementations which I will post in a new thread (when i get sufficient time) as IMO this one has become too polluted with acrimony to be worth continuing. Your code as massaged to fit into this framework (along with the baseclass) is in the following readmore. Note that the test harness monitors memory utilization post-prepare() which is why the default prepare() is the way it is (to reduce memory overhead). Read more... Fuzzy::Matcher::Xor2 (2 kB) Read more... Fuzzy::Matcher (base class and interface spec) (3 kB) --- demerphq	[reply] [d/l] [select]
Re^4: Fuzzy Searching: Optimizing Algorithm ( A few improvements). by BrowserUk (Patriarch) on Dec 08, 2004 at 23:46 UTC
... this code fails the "A" x 10/"A" x 11 test... Yes. As I pointed out to ysth offline a couple of days ago, if the sequences are not an exact multiple of the key length, then the code produces some extra, erroneous matches. I've update the post and code above to note and correct that. The math you do to correct this is the same as I had, but your positioning of it (in the if condition) is not. Once the math determines that the end of the sequence has been reached, there is no point in allowing the loop to continue. It simply produces further bad matches and wastes time. The conditional last statement in the corrected code above does this. A few further comments re the test harness. If you intend comparing my code against other algorithms that rely upon all the keys in any given test having the same length, as your original does, and the version I've seen of ysth's extremely clever Xor/hash index hybrid does, then for a fair comparison, I should supply you with an updated version of mine that also relies upon this assumption. As coded above, the algorithm recalculates various pieces of information derived from the length of the keys, on each iteration of the key-processing loop. If the keys will be invarient length, and the other algorithms make this assumption, then that code need only be executed once, rather ~~100,000~~ `scalar @keys` times, and so should be lifted out of the `for my $keys ( @keys )...` loop. I applaud that you are avoiding producing a big array of lots of little arrays, by stacking the recorded information (offset/fuzziness/keymatched) sequentially. However, returning the results as a list, rather than by reference, seems a retrograde step as this implies replication of the results array at least once, with all the memory allocation and timecosts that implies, which from my breif purusal of the test harness, will be incorporated into the timings. Probably pretty even for all algorithms on any given run, but implying a disproportionate overhead on the quicker algorithms on any high density tests. Whilst I have had the same results from randomly generated sequences as ysth reported: that matches are sparsely populated in these randomly generated sequences. In a generic fuzzy-match application unrelated to the OPs problem, this may not be the case. So any comparisons should also include high density datasets also. My algorithm does particularly well in this regard relative to others, so this is me watching out for number one here, but I think a generic solution should cater for this scenario.. Couldn't you record the index of the key, rather than it's value? Wouldn't either packing or stringifying the 3 records values reduce the overhead of the results array by 2/3rds and make for easy comparison? I agree that further investigations of this should be done in a new thread ,divorced from association with some parts of this one. I'll let you start it and I'll post an, adapted to your testharness, version of the latest incarnation of my code there, once I see what if any changes occur in the test harness that I need to adapt it to. Examine what is said, not who speaks. "But you should never overestimate the ingenuity of the sceptics to come up with a counter-argument." -Myles Allen "Think for yourself!" - Abigail "Time is a poor substitute for thought"--theorbtwo "Efficiency is intelligent laziness." -David Dunham "Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon	[reply] [d/l] [select]
Re^5: Fuzzy Searching: Optimizing Algorithm ( A few improvements). by demerphq (Chancellor) on Dec 09, 2004 at 09:00 UTC
Re^6: Fuzzy Searching: Optimizing Algorithm ( A few improvements). by ysth (Canon) on Dec 09, 2004 at 11:42 UTC
Some notes below your chosen depth have not been shown here
Re^6: Fuzzy Searching: Optimizing Algorithm ( A few improvements). by BrowserUk (Patriarch) on Dec 09, 2004 at 10:11 UTC
Some notes below your chosen depth have not been shown here