Beefy Boxes and Bandwidth Generously Provided by pair Networks
Syntactic Confectionery Delight
 
PerlMonks  

Re^3: Fuzzy Searching: Optimizing Algorithm ( A few improvements).

by demerphq (Chancellor)
on Dec 08, 2004 at 18:14 UTC ( [id://413284]=note: print w/replies, xml ) Need Help??


in reply to Re^2: Fuzzy Searching: Optimizing Algorithm ( A few improvements).
in thread Fuzzy Searching: Optimizing Algorithm Selection

As far as I can tell this code fails the "A" x 10/"A" x 11 test that you challenged my code with in a different thread. I was able to fix the bug by changing the if ($fuz <= $FUZZY) test to the following:

if( $fuz <= $FUZZY and $offset1+$offset2+$keyLen<=$seqLen) {

I have put together a test harness and framework for developing Fuzzy::Matcher implementations which I will post in a new thread (when i get sufficient time) as IMO this one has become too polluted with acrimony to be worth continuing. Your code as massaged to fit into this framework (along with the baseclass) is in the following readmore. Note that the test harness monitors memory utilization post-prepare() which is why the default prepare() is the way it is (to reduce memory overhead).

Fuzzy::Matcher::Xor2

package Fuzzy::Matcher::Xor2; use strict; use warnings; use Fuzzy::Matcher; use vars qw/$VERSION @ISA/; @ISA=qw(Fuzzy::Matcher); $VERSION=0.01; # Original implementation by [BrowserUk] # bugfixes and modified to fit Fuzzy::Matcher interface by [demerphq] sub fuzz_search { my ($self,$seq)=@_; use bytes; my $FUZZY = $self->{fuzz}; my $seqLen = length $seq; my $keyLen = $self->{strlen}; my ($masked,$pos); my @matches; for my $key ( @{$self->{str_array}} ) { my $mask = $key x ( int( $seqLen / $keyLen ) + 1 ); my $maskLen = length $mask; my $minZeros = chr( 0 ) x int( $keyLen / ( $FUZZY + 1 ) ); my $minZlen = length $minZeros; for my $offset1 ( 0 .. $keyLen-1 ) { $masked = $mask ^ substr( $seq, $offset1, $maskLen ); $pos = 0; while( $pos = 1+index $masked, $minZeros, $pos ) { $pos--; my $offset2 = $pos - ($pos % $keyLen ); my $fuz = $keyLen - ( substr( $masked, $offset2, $keyLen ) =~ tr[\0][\0] ); if( $fuz <= $FUZZY and $offset1+$offset2+$keyLen<=$seqLen) + { push @matches,$offset1 + $offset2,$fuz,$key; #printf "\tFuzzy matched key:'$key' -v- '%s' in line: +" # . "%2d @ %6d (%6d+%6d) with fuzziness: %d\n +", # substr( $seq, $offset1 + $offset2, $keyLen ), # $., , $offset1, $offset2, $fuz; } $pos = $offset2 + $keyLen; } } } return @matches; } 1;

Base Class for Fuzzy::Matcher

package Fuzzy::Matcher; use strict; use warnings; use Carp qw(croak confess); use vars qw/$VERSION/; $VERSION=0.01; # This is a base class for fuzzy matchers to inherit. # Its where stuff that will be common to all matchers # is located. It also defines the interface that all # matchers will have to follow. # # Usage Sample: # # my $matcher=$class->new(2,25); # $matcher->fuzz_store($_) for @words; # $matcher->prepare() # for my $strings (@strings) { # my @res=$matcher->fuzz_search($strings); # } # # ---------------------------------------------------- # Constructor CLASS->new($fuzz,$strlen) # # Takes the amount of fuzz to use for matching # and the length of the strings to be matched. # # Should not be overriden. # sub new { my $class = shift; my $fuzz = shift; my $strlen = shift; my $self=bless { fuzz => $fuzz||0, strlen => $strlen, },$class; croak "Failed build!" unless $self; $self->_init(@_); return $self; } # # $obj->_init() # # This is a hook for subclass to override without # having to override the default object creation # process. It is called in void context before the # object is returned to the user with any args # remaining after the default ($fuzz,$strlen) # # By default it is a No-Op. # sub _init { } # # $obj->fuzz_store($string) # # Store a string into the object for fuzzy matching # later. # # Default behaviour is to build a hash of stored strings # for dupe checking and a corresponding array of strings. # The array is named fuzz_strings and the hash is named # str_hash. # # sub fuzz_store { my ($self,$str)=@_; push @{$self->{str_array}},$str unless $self->{str_hash}{$str}++; } # # $obj->prepare($string) # # If necessary a subclass may define this sub so # that any actions that need to occur after # adding the words but before search starts. # # By default it deletes the str_hash entry from the object to # preserve memory. # sub prepare { my ($self,$str)=@_; delete $self->{str_hash}; } # # $obj->fuzz_search($string) # # Search a string for results and return # a list of matches. The list will be # of triples so that the first match returns: # my ($match_ofs,$chars_diff,$string_matched)= # $obj->fuzz_search($string) # # Must be overriden # sub fuzz_search { confess((caller(0))[3],"() method must be overriden in ". ref($_[0])); } 1;
---
demerphq

Replies are listed 'Best First'.
Re^4: Fuzzy Searching: Optimizing Algorithm ( A few improvements).
by BrowserUk (Patriarch) on Dec 08, 2004 at 23:46 UTC
    ... this code fails the "A" x 10/"A" x 11 test...

    Yes. As I pointed out to ysth offline a couple of days ago, if the sequences are not an exact multiple of the key length, then the code produces some extra, erroneous matches.

    I've update the post and code above to note and correct that.

    The math you do to correct this is the same as I had, but your positioning of it (in the if condition) is not. Once the math determines that the end of the sequence has been reached, there is no point in allowing the loop to continue. It simply produces further bad matches and wastes time. The conditional last statement in the corrected code above does this.

    A few further comments re the test harness.

    1. If you intend comparing my code against other algorithms that rely upon all the keys in any given test having the same length, as your original does, and the version I've seen of ysth's extremely clever Xor/hash index hybrid does, then for a fair comparison, I should supply you with an updated version of mine that also relies upon this assumption.

      As coded above, the algorithm recalculates various pieces of information derived from the length of the keys, on each iteration of the key-processing loop. If the keys will be invarient length, and the other algorithms make this assumption, then that code need only be executed once, rather 100,000 scalar @keys times, and so should be lifted out of the for my $keys ( @keys )... loop.

    2. I applaud that you are avoiding producing a big array of lots of little arrays, by stacking the recorded information (offset/fuzziness/keymatched) sequentially.

      However, returning the results as a list, rather than by reference, seems a retrograde step as this implies replication of the results array at least once, with all the memory allocation and timecosts that implies, which from my breif purusal of the test harness, will be incorporated into the timings.

      Probably pretty even for all algorithms on any given run, but implying a disproportionate overhead on the quicker algorithms on any high density tests.

      Whilst I have had the same results from randomly generated sequences as ysth reported: that matches are sparsely populated in these randomly generated sequences. In a generic fuzzy-match application unrelated to the OPs problem, this may not be the case. So any comparisons should also include high density datasets also.

      My algorithm does particularly well in this regard relative to others, so this is me watching out for number one here, but I think a generic solution should cater for this scenario..

    3. Couldn't you record the index of the key, rather than it's value?
    4. Wouldn't either packing or stringifying the 3 records values reduce the overhead of the results array by 2/3rds and make for easy comparison?

    I agree that further investigations of this should be done in a new thread ,divorced from association with some parts of this one. I'll let you start it and I'll post an, adapted to your testharness, version of the latest incarnation of my code there, once I see what if any changes occur in the test harness that I need to adapt it to.


    Examine what is said, not who speaks.
    "But you should never overestimate the ingenuity of the sceptics to come up with a counter-argument." -Myles Allen
    "Think for yourself!" - Abigail        "Time is a poor substitute for thought"--theorbtwo         "Efficiency is intelligent laziness." -David Dunham
    "Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon

      The math you do to correct this is the same as I had, but your positioning of it (in the if condition) is not.

      I wasn't able to spend much time in analysing the optimal solution to the bug. Sorry.

      If you intend comparing my code against other algorithms that rely upon all the keys in any given test having the same length, as your original does, and the version I've seen of ysth's extremely clever Xor/hash index hybrid does, then for a fair comparison, I should supply you with an updated version of mine that also relies upon this assumption.

      Feel free to provide an updated version. If I havent started the new thread by the time you are ready to post then do it here. Also, while actually ysths solution and my current solution can both be modified to handle variable length string sets I dont think either of us have the time to actually put the changes into place so I would prefer that we stick to fixed length words. However if you want to post a version supporting variable length words then please treat the size provided to the constructor as the MINIMUM size of any string to be stored and I suppose youll get bonus points or something in the comparison. :-)

      If the keys will be invarient length, and the other algorithms make this assumption, then that code need only be executed once, rather 100,000 scalar @keys times, and so should be lifted out of the for my $keys ( @keys )... loop.

      I think this could either be done in an overloaded prepare() method if it only needs to be done once and should be done before the searching starts but after all the words to match are loaded. Since this is an OO framework you should be able to work out some form of caching pretty easily.

      However, returning the results as a list, rather than by reference, seems a retrograde step as this implies replication of the results array at least once, with all the memory allocation and timecosts that implies, which from my breif purusal of the test harness, will be incorporated into the timings.

      Ive changed the test framework a fair bit I think so wait for that. Regarding the list return I would have thought that since each routine would have basically the same overhead that it wouldnt matter that much. Nevertheless I'll try to switch things over if I have time. Howabout we say that fuzz_search should do a wantarray ? @results : \@results so that if I get the time to change the harness and the running implementations i have whatever you might be working on will take advantage of it?

      Couldn't you record the index of the key, rather than it's value? Wouldn't either packing or stringifying the 3 records values reduce the overhead of the results array by 2/3rds and make for easy comparison?

      I was actually thinking of using packed records as the return, so that each record would be a pack "NNA*",$string but for various reasons I think its not appropriate for these tests. And yes it does make it a lot easier for results comparison. But ive got that in the test harness and not in the object. :-)

      You mentioned returning an index instead of a string which i avoided just for simplicty sakes. I think that the only way that would be workable would be to add an accessor method to the object for converting the number to a string. Which was one reason I avoided it. However if you are ok with that provision then I don't see why not. The base class can be updated to just do a lookup into the str_array, and then your solutions dont need an overloaded method for it at all. Doing the same to mine and ysths might be moderately harder but still doable. Ill work it in. (So the list will be triplets of longs) If we decide we want to use packed results for high density tests we can switch to pack "NNN". (BTW I say 'N' because it has the useful property of sorting lexicographically.)

      I'm hoping I can get the tme to start the new thread this evening. Till then hopefully you have enough info to slot into what i publish when i do.

      ---
      demerphq

        Also, while actually ysths solution and my current solution can both be modified to handle variable length string sets I dont think either of us have the time to actually put the changes into place so I would prefer that we stick to fixed length words.
        FWIW, my solution will lose efficiency with mixed length words.

        I would actually like to see us come up with a module providing all three methods, possibly including variants supporting fixed vs. mixed length (assuming there's at least something that each has as an advantage, of which I'm not sure as regards mine). I'm kind of expecting BrowserUk's to best handle high amounts of fuzz.

        Re: packing the return values; I seriously doubt a pack in pure perl is going to be a net win over just returning multiple elements, so I think that would benefit only the XS solution. It's also a lousy interface for a perl module to use. As far as returning an index goes, I suppose that's possible, but as is my algorithm has no need to keep the array around.

        However if you want to post a version supporting variable length words...

        You misunderstand (or more likely, I wasn't clear), every version I've posted so far quite happily accepts mixed length keys (strings). My point was that if we're getting into hard core performance testing, and I'm up against code that only handles fixed length keys, then I can gain a little by following suit.

        I'll wait and see what the final test framework, and you guys stuff, looks like before going any further.


        Examine what is said, not who speaks.        The end of an era!
        "But you should never overestimate the ingenuity of the sceptics to come up with a counter-argument." -Myles Allen
        "Think for yourself!" - Abigail        "Time is a poor substitute for thought"--theorbtwo         "Efficiency is intelligent laziness." -David Dunham
        "Memory, processor, disk in that order on the hardware side. Algorithm, algorithm, algorithm on the code side." - tachyon

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://413284]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others pondering the Monastery: (2)
As of 2024-04-20 03:18 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found