|We don't bite newbies here... much|
Re^5: String Comparison & Equivalence Challenge (tf-idf)by LanX (Sage)
|on Mar 14, 2021 at 16:10 UTC||Need Help??|
It only looks complicated because the wp-article lists multiple options for both tf and idf in order to adjust for different use cases.
But the explanation is good and there are plenty of more articles in the web.
The basic idea is simple:
For a each searchterm like God you'll calculate tf(God) for each other "document" and multiply it with the globally precalculated idf(God) of your "corpus".
Tf-idf (term,doc) = tf (term,doc) * idf (term,corpus)
God is a very frequent term hence it's idf will be low. Gomorrah is far less frequent hence it's idf will be high near 1. A document with no mention of God will have a tf(God) = 0
$rank += tf-idf($_) foreach @term
Tf-idf is a cornerstone of NLP the majority of search engines use it.
The model is simple, robust and will lead quickly to good results. But you may need to adjust it to your needs for better results.