Beefy Boxes and Bandwidth Generously Provided by pair Networks
go ahead... be a heretic
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??
Well, that depends upon how you define "best match." The first step in tackling any problem is defining it clearly. By best match, do you mean the most letters in common? If so, you could potentially takes all strings and break their letters into a hash with the value being the occurence of each letter and do a foreach loop over the keys and keep a count of the differences. Of course, this doesn't take into account the "order" of the letters. It would find "notnilC" as matching "Clinton."

If you do take the hash approach, you might want to consider letter frequency. Since the letter 's' is more frequent than the letter 'q', does that mean that 'said' is a closer match to 'laid' than 'qaid'? (yes, that's a word)

Also, do you know that with String::Approx that you can adjust the number of "edits"? For example, for a word with only two characters of difference, you can specify:

my @catches = amatch("plugh", ['2'], @inputs);
You could set the number of edits to 1 and if that doesn't return a list to examine, just keep increasing the number of edits until you get something.

You may also want to check out Text::Soundex which will encode words into four character strings that represents what they "sound" like. Then, you can compare the shorter strings. I don't know how reliable this is and it's only for the English language.

A final option to consider is Text::Metaphone, which does phonetic encoding of words. You could then check to see if words sound the same (yeah, I know, this is a longshot). I do not know if this is for languages other than English.

Since you have a "fuzzy" problem, there is going to be no simple solution to this problem and you will have quite a time working with this, I'm sure. However, it might make a nifty module for CPAN, when finished.

Cheers,
Ovid


In reply to (Ovid) Re: Fuzzy Strings by Ovid
in thread Fuzzy Strings by orthanc

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others exploiting the Monastery: (4)
As of 2024-04-19 04:39 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found