Just to follow up.

The actual parsing of the source into a structure that can be searched for phrase matches should be cached. I've played with that a little using Storable. It's a bit of a problem with Swish, since all we have is a swish index file and the script to do the highlighting. Swish stores the text of the document gzipped in its index, and since swish is written in C it would be a specialized funtion to get a perl data structure stored in the index. Doable, but not real general purpose.

But, the real problem is actually finding the phrases. If you look back at the profile output you can see that the parsing takes time, but it's not much compared to the work of finding the matches. Splitting the text up could probably be optimized a little by using a stream parser and then assume in many cases only part of the doc needs to be parsed (if only displaying the first, say, five matches).

I like the idea of storing the character offset in the index. Would need to store the offset and length of the original word since the indexed word might be different.

It will eat RAM during indexing, though. Indexing my /usr/doc indexes 270K unique "words", but 20M individual word positions. Say five bytes to store the offset and length, and we just ate 100MB of RAM.

I also liked the idea of doing some general "qualification" test as dws suggested. I'm just not sure how to apply that in this case.

Oh, and about /o, I meant the complete code example I posted on my machine. It basically checks for $ENV{MOD_PERL} to decide /o or not.

I've never really understood why compiled regexp's can't fall out of scope if you want them too. I'm sure there's a way, but I haven't thought about it for a while. It's fun tracing down those /o when first running something under mod_perl!

Thanks for everyone's time, and Happy New Year.


In reply to Re: Re: Re: Re: Context search term highlighting - Perl is too slow by moseley
in thread Context search term highlighting - Perl is too slow by moseley

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.