Beefy Boxes and Bandwidth Generously Provided by pair Networks
Do you know where your variables are?

Re^5: String Comparison & Equivalence Challenge (tf-idf)

by LanX (Sage)
on Mar 14, 2021 at 16:10 UTC ( #11129613=note: print w/replies, xml ) Need Help??

in reply to Re^4: String Comparison & Equivalence Challenge
in thread String Comparison & Equivalence Challenge

It only looks complicated because the wp-article lists multiple options for both tf and idf in order to adjust for different use cases.

But the explanation is good and there are plenty of more articles in the web.

The basic idea is simple:

For a each searchterm like God you'll calculate tf(God) for each other "document" and multiply it with the globally precalculated idf(God) of your "corpus".

Tf-idf (term,doc) = tf (term,doc) * idf (term,corpus)

God is a very frequent term hence it's idf will be low. Gomorrah is far less frequent hence it's idf will be high near 1. A document with no mention of God will have a tf(God) = 0


  • Docs = verse
  • Corpus = bible
A ranking function will combine the tf-idf for all relevant terms, e.g. most trivialy by summation

$rank += tf-idf($_) foreach @term

Tf-idf is a cornerstone of NLP the majority of search engines use it.

The model is simple, robust and will lead quickly to good results. But you may need to adjust it to your needs for better results.

Cheers Rolf
(addicted to the Perl Programming Language :)
Wikisyntax for the Monastery

Log In?

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://11129613]
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others studying the Monastery: (4)
As of 2022-05-17 07:03 GMT
Find Nodes?
    Voting Booth?
    Do you prefer to work remotely?

    Results (65 votes). Check out past polls.