in reply to Re^4: String Comparison & Equivalence Challenge
in thread String Comparison & Equivalence Challenge
But the explanation is good and there are plenty of more articles in the web.
The basic idea is simple:
For a each searchterm like God you'll calculate tf(God) for each other "document" and multiply it with the globally precalculated idf(God) of your "corpus".
Tf-idf (term,doc) = tf (term,doc) * idf (term,corpus)
God is a very frequent term hence it's idf will be low. Gomorrah is far less frequent hence it's idf will be high near 1. A document with no mention of God will have a tf(God) = 0
Here:
- Docs = verse
- Corpus = bible
$rank += tf-idf($_) foreach @term
Tf-idf is a cornerstone of NLP the majority of search engines use it.
The model is simple, robust and will lead quickly to good results. But you may need to adjust it to your needs for better results.
Cheers Rolf
(addicted to the Perl Programming Language :)
Wikisyntax for the Monastery
|
---|