comment on

The best approach may simply be a straight string scan fail fast approach. You can get this far more optimized than the RE engine could manage. It should certainly be in C for speed.

Breaking the dictionary into all 25 char long patterns serves no useful purpose - it simply increases the disk I/O by a factor of 25x. Ideally you would want the entire dictionary in memory anyway so you totally avoid any disk I/O bottlenecks.

The type of indexing you may or may not be able to do depends on the data. You can form a useful index based on the fact, that as you note, even an off by 2 match must have at least 7 consecutive matching chars - in fact I beleive it is impossible to have less than 8 as 25-2 = 23 Thus the worst case min lengths possible are 7+8+8=23 so you must have at least 2 8+ char stings that match.

If you only have 2 possible values in your strings the probability of randonly finding 8 consecutive chars in a given order is 2^8 or 1/256, If you are talking 4 chars ie CGAT then it is 1/65536. This should dramatically shrink the search space as you only need to scan 17 chars upstream and downstream of this sequence. Each search string can be broken into 18 8 char possibles. The problem with this may be (assuming CGAT alphabet) and 100 search strings you will have nearly 1800 unique index strings giving an average spacing in the search string of 65536/1800 ~ 36 chars. Now we have to search 17+8+17 = 42 chars so in a nutshell you still probably have to scan the whole search string every time. The advantage is limited to having to perform the scan on a limited subset of the find strings. There is of course a price for the hashing and lookup. If the alphabet (A) is larger than 4 chars then you should see very significant benefits due the to relative scarcity of A^8 substrings. I would quite possibly be worth precomputing the hash values for all the 8 char strings in your find an search spaces to speed your many to many search.

By way of the worst case scenario here is a brute force fail fast

my @find = map { [ split // ] }qw ( ABCDE FGHJJ KLMMM );
my $find_len = 5;
my $fuzzy    = 2;

while( my $search = <DATA>) {
    chomp $search;
    # this should be in C with pointers for speed, not messing with pe
+rl arrays
    my $search = [split //, $search];
    for my $i ( 0..@$search-$find_len ) {
        FIND:
        for my $find ( @find ) {
            my $misses = 0;
            for my $j ( 0..$find_len-1 ) {
                $misses++ if $search->[$i+$j] ne $find->[$j];
                next FIND if $misses > $fuzzy;
            }
            print "Str $. Match ($misses) miss for @$find at $i\n";
        }
    }
}  


__DATA__
ABCDEFGHIJKLMNOPQRSTUVWXYZ
AAABBCDEFGHIJKLMN
[download]

cheers

tachyon

In reply to Re: Fuzzy Searching: Optimizing Algorithm Selection by tachyon
in thread Fuzzy Searching: Optimizing Algorithm Selection by Itatsumaki

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


go ahead... be a heretic
	PerlMonks