in reply to Re: Re: Context search term highlighting - Perl is too slow
in thread Context search term highlighting - Perl is too slow

That's really hard, though.

I figured it must be, or you would have done it already. However, by parsing these words at request time you are moving something that is intentionally done ahead (cached, basically) into the request handling. It makes sense that you would pay a performance penalty for that.

First, swish keeps track of word position for phrase matches. But, all sorts of things will bump the position counter, special chars, some html tags, and so on.

What I had in mind was keeping a character index into the original documents, not a a word index.

Right about /o in the regexp. See my comments (and I guess confusion) in my example code...

Sorry, I don't see it. If you need help with /o, there are some very good regex folks on here. I also liked the discussion in the Perl Cookbook about this.

  • Comment on Re: Re: Re: Context search term highlighting - Perl is too slow

Replies are listed 'Best First'.
Re: Re: Re: Re: Context search term highlighting - Perl is too slow
by moseley (Acolyte) on Dec 21, 2001 at 05:54 UTC
    Just to follow up.

    The actual parsing of the source into a structure that can be searched for phrase matches should be cached. I've played with that a little using Storable. It's a bit of a problem with Swish, since all we have is a swish index file and the script to do the highlighting. Swish stores the text of the document gzipped in its index, and since swish is written in C it would be a specialized funtion to get a perl data structure stored in the index. Doable, but not real general purpose.

    But, the real problem is actually finding the phrases. If you look back at the profile output you can see that the parsing takes time, but it's not much compared to the work of finding the matches. Splitting the text up could probably be optimized a little by using a stream parser and then assume in many cases only part of the doc needs to be parsed (if only displaying the first, say, five matches).

    I like the idea of storing the character offset in the index. Would need to store the offset and length of the original word since the indexed word might be different.

    It will eat RAM during indexing, though. Indexing my /usr/doc indexes 270K unique "words", but 20M individual word positions. Say five bytes to store the offset and length, and we just ate 100MB of RAM.

    I also liked the idea of doing some general "qualification" test as dws suggested. I'm just not sure how to apply that in this case.

    Oh, and about /o, I meant the complete code example I posted on my machine. It basically checks for $ENV{MOD_PERL} to decide /o or not.

    I've never really understood why compiled regexp's can't fall out of scope if you want them too. I'm sure there's a way, but I haven't thought about it for a while. It's fun tracing down those /o when first running something under mod_perl!

    Thanks for everyone's time, and Happy New Year.

      I still can't see the code example you referred to, but if you want an alternative to /o that can be used under mod_perl, take a look at one of merlyn's columns here.
        The URL was:
        http://hank.org/modules/PhraseTest.pm

        Thanks for merlyn's article reference. /o isn't an issue for my testing which is not under mod_perl, but I still need to revisit /o and qr// -- and more specifically using qr// in combination with /o.

        I really do agree that my nested loops are killing my speed. But I can't imagine how to avoid processing one word at a time.

        Someday if I have that thing I used to call free time I'll try a different approach from within Swish and in C.

        This would be an easier discussion over beer and pool, Perrin. When will you be out West again?