in reply to Re: Context search term highlighting - Perl is too slow
in thread Context search term highlighting - Perl is too slow

Perrin writes:
But ultimately, my advice would be to change the Swish-e index so that it can tell you not only what document the word is in but where in the document it was found. Then you can avoid doing this expensive parsing at request time.

Perrin,

That's really hard, though. First, swish keeps track of word position for phrase matches. But, all sorts of things will bump the position counter, special chars, some html tags, and so on. Trying to match swish-e's position data with what I could parse would be hard. It's hard enought matching up the text. So if swish told me to highlight word 243, it would be very lucky if I knew what that word was.

The other problem is that you can imagine the volume of data that might be returned for a wildcard search like s*. Tens of thousand word positions for a few hundred results.

But, probably my solution, if possible is to have swish store the source document, and with each word the character offset. Then for each word hit return the character offsets. Argh. I can see where phrases would be tough, too.

Right about /o in the regexp. See my comments (and I guess confusion) in my example code...

thanks,

  • Comment on Re: Re: Context search term highlighting - Perl is too slow

Replies are listed 'Best First'.
Re: Re: Re: Context search term highlighting - Perl is too slow
by perrin (Chancellor) on Dec 21, 2001 at 00:06 UTC
    That's really hard, though.

    I figured it must be, or you would have done it already. However, by parsing these words at request time you are moving something that is intentionally done ahead (cached, basically) into the request handling. It makes sense that you would pay a performance penalty for that.

    First, swish keeps track of word position for phrase matches. But, all sorts of things will bump the position counter, special chars, some html tags, and so on.

    What I had in mind was keeping a character index into the original documents, not a a word index.

    Right about /o in the regexp. See my comments (and I guess confusion) in my example code...

    Sorry, I don't see it. If you need help with /o, there are some very good regex folks on here. I also liked the discussion in the Perl Cookbook about this.

      Just to follow up.

      The actual parsing of the source into a structure that can be searched for phrase matches should be cached. I've played with that a little using Storable. It's a bit of a problem with Swish, since all we have is a swish index file and the script to do the highlighting. Swish stores the text of the document gzipped in its index, and since swish is written in C it would be a specialized funtion to get a perl data structure stored in the index. Doable, but not real general purpose.

      But, the real problem is actually finding the phrases. If you look back at the profile output you can see that the parsing takes time, but it's not much compared to the work of finding the matches. Splitting the text up could probably be optimized a little by using a stream parser and then assume in many cases only part of the doc needs to be parsed (if only displaying the first, say, five matches).

      I like the idea of storing the character offset in the index. Would need to store the offset and length of the original word since the indexed word might be different.

      It will eat RAM during indexing, though. Indexing my /usr/doc indexes 270K unique "words", but 20M individual word positions. Say five bytes to store the offset and length, and we just ate 100MB of RAM.

      I also liked the idea of doing some general "qualification" test as dws suggested. I'm just not sure how to apply that in this case.

      Oh, and about /o, I meant the complete code example I posted on my machine. It basically checks for $ENV{MOD_PERL} to decide /o or not.

      I've never really understood why compiled regexp's can't fall out of scope if you want them too. I'm sure there's a way, but I haven't thought about it for a while. It's fun tracing down those /o when first running something under mod_perl!

      Thanks for everyone's time, and Happy New Year.

        I still can't see the code example you referred to, but if you want an alternative to /o that can be used under mod_perl, take a look at one of merlyn's columns here.