http://qs321.pair.com?node_id=406836

Itatsumaki has asked for the wisdom of the Perl Monks concerning the following question:

Hello,

I'm well-aware that fuzzy-matching (e.g. Levenstein and String::Approx) have been discussed here several times. I need to do a very large quantity of fuzzy matching, and I'm looking for some theoretical suggestions on how to start developing. This is to be design-time optimization, not premature optimization, I hope.

The basic problem:
Matching a 25-letter string against a dictionary of about 30 thousand variable length sequences. I need to find all:

within the dictionary.

Performance matters because I have a library of several hundred thousand 25-letter strings. And because I'll be repeating this analysis for all pairwise combinations of about 10 search and 20 target dictionaries.

Three basic approaches come to mind for the fuzzy-matching:

  1. Dynamically construct 25 separate regex's, one for each possible mismatch position. For instance:
    my @letters = split(//, $string); my $i = 0; my @regexes; for (my $j = 0; $j < scalar(@letters); $j++) ( for (my $i = 0; $i < scalar(@letters); $k++) { if ($i != $j) { $regexes[$i] .= $letters[$j]; } else { $regexes[$i] .= '.{1}'; } } }
    I'm sure that can be tuned substantially, just trying to give the general idea of what I had in mind.
  2. On the other hand, because I know that my search sequence will always be 25 letters long I know that every fuzzy-by-one match will contain a 12-letter exact match, and any fuzzy-by-two match will contain an 7-letter exact match so I could index the large library for all possible 12-mers. I would then break each sequence into the 13 possible 12-mers and extend on either side looking for fuzzy-matches as above. The advantage of this approach would be that I am only searching regions of the library that have a theoretical chance of matching my string.
  3. Finally, since I have large libraries of both search strings (25-letters) and target strings (variable length), it might be advantageous to just parse out all the possible 25-length strings from the target library once. Once I have this index, identifying exact or fuzzy matches would be trivial.

So you can see the reason I'm looking for suggestions now -- only one of these three approaches works on single-strings. The performance of the other two won't seem to scale with the search and target library sizes in any trivial way. I'm not too sure how to figure out which one will be optimal for any given situation.

Any suggestions on how to determine which might be optimal for a given situation? And are there other (potentially faster) approaches I might have missed?

One note: I *am* aware that BLAST can do option #2 above -- I'm more trying to determine which option will be faster to run than worrying about ease-of-implementation right now.

-Tats