comment on

Text::Levenshtein would give you a numerical way of comparing two strings, but requires you to compare the new string against each of the tests strings each time and isn't quick.

Probably the best way would be to create an inverted index of the words (or preferable the stems) against the DB phrases and then look each word (or stem) in the new phrase against this index. This gives you a count of the number of common words between the new phrase and the DB phrases. Sort those highest first and you have the most likely candidates for your further examination. I don't know of a module that does this, but parts of it (the inversion, stemming etc.) could be done with various modules.

Sounds like a fun project. Good luck:)

Examine what is said, not who speaks.

"Efficiency is intelligent laziness." -David Dunham
"When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller
If I understand your problem, I can solve it! Of course, the same can be said for you.

In reply to Re: calculate matching words/sentence by BrowserUk
in thread calculate matching words/sentence by anocelot

Are you posting in the right place? Check out Where do I post X? to know for sure.
Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
<code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
Want more info? How to link or How to display code and escape characters are good places to start.


good chemistry is complicated, and a little bit messy -LW
	PerlMonks