davido wrote:

"Oh, I thought that part was made obvious in the documentation of the source code:

"Now we want to find a "score" for this paragraph, finding the best set of keywords which "apply" to it. We favour keyword sets which have a large number of matches (obviously a paragraph is better if it matches "a" and "c" than if it just matches "a") and with multi-word keywords. (A paragraph which matches "fresh cheese sandwiches" en bloc is worth picking out, even if it has no other matches.)"

It seems the intent is to find out how powerful the keyword is within a given paragraph. More matches means a better fit, more relevancy."

That's where I have trouble understanding. How does the number of words in the paragraph have anything to do with the quality of the match? It seems to me like the documentation and implied intent don't match the code. If you think its correct, can you explain what it does using different words perhaps?

If speed is of concern, profile and find where the bottleneck is.

Indeed, but it's correctness rather than performance that concern me, though the performance got me started investigating. I posted a summary of my NYTProf results to its RT queue a few days ago.


In reply to Re^4: Text::Context or alternatives? by Dave Howorth
in thread Text::Context or alternatives? by Dave Howorth

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.