Dave Howorth has asked for the wisdom of the Perl Monks concerning the following question:

I started using Text::Context and after experiencing some performance oddities, started reading the code. I'm having some trouble understanding it, so I'm wondering whether either there's some other package that people recommend, or whether anybody uses it and can help me with how it works?

For example, I'm looking at score_para() and wondering about:

$word_score += 1 + ($content =~ tr/ / /) if $content =~ /\b\Q$word\E\b/i;

It seems to be adding the number of words in the paragraph to the score, for a reason I don't understand, and it seems to be recalculating the number of words repeatedly, which seems completely unnecessary. (Similarly it recomputes permute_keywords())

Can any monks enlighten me?

Replies are listed 'Best First'.
Re: Text::Context or alternatives?
by davido (Cardinal) on Nov 08, 2011 at 17:12 UTC

    This seems to provide more context (no pun intended) for the code snippet you showed:

    for my $word (@{ $self->{keywords} }) { my $word_score = 0; $word_score += 1 + ($content =~ tr/ / /) if $content =~ /\b\Q$ +word\E\b/i; $matches{$word} = $word_score; }

    That seems to be iterating over the list of keywords, and calculating a score per keyword.

    It might be that the same could be accomplished with greater efficiency if the algorithm were turned onto the words in $content rather than the keywords, and then determine if each word in $content matches a keyword from the hash. If so, then apply the tr/// count.


    Dave

      It looks to me as if $content =~ tr/ / / could be calculated once outside the loop, and be kept in a variable.

      Still I find the number of blanks to be a rather dubious metric (what about all those other whitespace characters? Do two blanks in a row still make sense to count double?)

        Yes, I too think it could be calculated outside the loop. But what merit does any statistic of the paragraph have as a score for a match?

      Yes, I didn't provide the context because I suppose monks will have their own ideas about how much is relevant.

      My question is rather, what relevance do the number of words in the paragraph (i.e. 1 + the tr///) have to do with a meaningful score?

      It's now occurred to me that perhaps that should read

      ($word =~ tr/ / /)

        Oh, I thought that part was made obvious in the documentation of the source code:

        "Now we want to find a "score" for this paragraph, finding the best set of keywords which "apply" to it. We favour keyword sets which have a large number of matches (obviously a paragraph is better if it matches "a" and "c" than if it just matches "a") and with multi-word keywords. (A paragraph which matches "fresh cheese sandwiches" en bloc is worth picking out, even if it has no other matches.)"

        It seems the intent is to find out how powerful the keyword is within a given paragraph. More matches means a better fit, more relevancy.

        And on second thought, there's really nothing to be gained by turning the algorithm on its side. It's utilizing Perl's strengths already.

        If speed is of concern, profile and find where the bottleneck is. Tom Duff (of Duff's Device) said this:

        "If your code is too slow, you must make it faster. If no better algorithm is available, you must trim cycles."

        Step one: Figure out where the trouble really is (profile). Step two, try to devise a better algorithm for that particular segment of code. Step three (if two fails): Remove cycles. That may be easier said than done, but unless you're already certain this particular loop is your problem we can't be sure.

        The source code for the module itself gives a clue immediately following that loop:

        #XXX : Possible optimization: Give up if there are no matches


        Dave