Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??

Update:Please dont forget to note the sentence at the end which points out that I got muddled and was thinking of only when the string was matched at the start. Sorry.

I might approach the problem by first converting my search string into all of the possible other strings that I will consider to be a fuzzy match. Then build a trie to represent this structure. Then use the trie to search the dictionaries. (You may want to code the Trie in Inline::C.) This will outperform regexes in my experience (Trie lookup is pretty well optimal, O(KeyLen)~O(1),). If i had to do these searches a lot and the size of my alphabet was reasonable, i would look at producing fingerprints of my dictionary (ie, all words with the same letters in them (regardless of counts) would have the same fingerprint"). Assuming you have a fast way to get the list associated with a given fingerprint you should have cut your search space vastly as a single XOR against the fingerprint will tell you if a match is possible. Even if the alphabet was small (like ACTG) you could use tuples for your fingerprinting algorithm, so for instance assuming you use an integer for the fingerprint you could assign one bit to each possible combination.

I could probably put together an implementation that worked on a base 10 alphabet, (ie Numeric Digits) and searched a 30 million record dictionary for matches if you provided a routine to generate the match list. Unfortunately i couldnt show you the code, but the results may tell you if the techniques were appropriate.

Incidentally afaict the upper time of this approach would be 25 * N where N is the number of words in your dictionary. Assuming you fingerprint then the time will be much lower than that. Thus the algorithm would be O(N). Which im pretty sure is faster than anything else posted here.

An alternative approach would be to store all of your dictionaries in one great big trie and then reverse the process, looking up each fuzzy word in the trie and recording the words matched. This would only be suitable if the density of the data stored in the dictionaries was high and/or memory was abundant, but would have the advantage of being almost instantaneous to search. Assuming that 1 word had 300 fuzzy equivelents the structure should be searchable in 300*25 timesteps regardless of dictionary size.

Gah, I jsut realized that the above only applies if you need to match from the front. If you need to match from any position in the dictionary words then you need to go beyond a trie.

---
demerphq


In reply to Re: Fuzzy Searching: Optimizing Algorithm Selection by demerphq
in thread Fuzzy Searching: Optimizing Algorithm Selection by Itatsumaki

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others meditating upon the Monastery: (4)
As of 2024-04-19 21:05 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found