note
LanX
It only looks complicated because the wp-article lists multiple options for both <C>tf</C> and <C>idf</C> in order to adjust for different use cases.<P>
But the explanation is good and there are plenty of more articles in the web. <P>
The basic idea is simple:<P>
For a each searchterm like <C>God</C> you'll calculate <C>tf(God)</C> for each other "document" and multiply it with the globally precalculated <C>idf(God)</C> of your "corpus". <P>
<C>Tf-idf (term,doc) = tf (term,doc) * idf (term,corpus)</C><P>
<C>God</C> is a very frequent term hence it's idf will be low.
<C>Gomorrah</C> is far less frequent hence it's idf will be high near 1. A document with no mention of God will have a
<C>tf(God) = 0</C> <P>
Here:
<UL>
<LI> Docs = verse
<LI> Corpus = bible
</UL>
A <I>ranking function</I> will combine the tf-idf for all relevant terms, e.g. most trivialy by summation <P>
<C>$rank += tf-idf($_) foreach @term</C><P>
Tf-idf is a cornerstone of [wp://NLP] the majority of search engines use it.<P>
The model is simple, robust and will lead quickly to good results. But you may need to adjust it to your needs for better results. <P>
<div class="pmsig"><div class="pmsig-708738">
<!--nowiki--><p>Cheers Rolf<br>
<sub>(addicted to the Perl Programming Language :)
<br> <i> [id://1153804|Wikisyntax for the Monastery]</i>
</sub>
<!--nowiki-->
</div></div><!-- Wiki2Monks {"version":1.16} -->
11129602
11129609