Re: Fuzzy Searching: Optimizing Algorithm Selection

Update:Please dont forget to note the sentence at the end which points out that I got muddled and was thinking of only when the string was matched at the start. Sorry.

I might approach the problem by first converting my search string into all of the possible other strings that I will consider to be a fuzzy match. Then build a trie to represent this structure. Then use the trie to search the dictionaries. (You may want to code the Trie in Inline::C.) This will outperform regexes in my experience (Trie lookup is pretty well optimal, O(KeyLen)~O(1),). If i had to do these searches a lot and the size of my alphabet was reasonable, i would look at producing fingerprints of my dictionary (ie, all words with the same letters in them (regardless of counts) would have the same fingerprint"). Assuming you have a fast way to get the list associated with a given fingerprint you should have cut your search space vastly as a single XOR against the fingerprint will tell you if a match is possible. Even if the alphabet was small (like ACTG) you could use tuples for your fingerprinting algorithm, so for instance assuming you use an integer for the fingerprint you could assign one bit to each possible combination.

I could probably put together an implementation that worked on a base 10 alphabet, (ie Numeric Digits) and searched a 30 million record dictionary for matches if you provided a routine to generate the match list. Unfortunately i couldnt show you the code, but the results may tell you if the techniques were appropriate.

Incidentally afaict the upper time of this approach would be 25 * N where N is the number of words in your dictionary. Assuming you fingerprint then the time will be much lower than that. Thus the algorithm would be O(N). Which im pretty sure is faster than anything else posted here.

An alternative approach would be to store all of your dictionaries in one great big trie and then reverse the process, looking up each fuzzy word in the trie and recording the words matched. This would only be suitable if the density of the data stored in the dictionaries was high and/or memory was abundant, but would have the advantage of being almost instantaneous to search. Assuming that 1 word had 300 fuzzy equivelents the structure should be searchable in 300*25 timesteps regardless of dictionary size.

Gah, I jsut realized that the above only applies if you need to match from the front. If you need to match from any position in the dictionary words then you need to go beyond a trie.

---
demerphq

Comment on Re: Fuzzy Searching: Optimizing Algorithm Selection


Problems? Is your data what you think it is?
	PerlMonks