in reply to Re^4: String Comparison & Equivalence Challenge
in thread String Comparison & Equivalence Challenge

It only looks complicated because the wp-article lists multiple options for both tf and idf in order to adjust for different use cases.

But the explanation is good and there are plenty of more articles in the web.

The basic idea is simple:

For a each searchterm like God you'll calculate tf(God) for each other "document" and multiply it with the globally precalculated idf(God) of your "corpus".

Tf-idf (term,doc) = tf (term,doc) * idf (term,corpus)

God is a very frequent term hence it's idf will be low. Gomorrah is far less frequent hence it's idf will be high near 1. A document with no mention of God will have a tf(God) = 0


A ranking function will combine the tf-idf for all relevant terms, e.g. most trivialy by summation

$rank += tf-idf($_) foreach @term

Tf-idf is a cornerstone of NLP the majority of search engines use it.

The model is simple, robust and will lead quickly to good results. But you may need to adjust it to your needs for better results.

Cheers Rolf
(addicted to the Perl Programming Language :)
Wikisyntax for the Monastery