Greetings,

I have a large CMS that I just upgraded from Plucene to KinoSearch. Whenever the site changes, I crawl it and index the various pages and files.

When I try to index a file on the order of 3 MB, the $index->add_doc($doc) command spins the CPU for over 30 minutes before completing.

The fields are speced as:

$index->spec_field(name => 'url', analyzed => 0, vectorized => 0); $index->spec_field(name => 'filetype', indexed => 0, analyzed => 0, ve +ctorized => 0); $index->spec_field(name => 'title', boost => 3, vectorized => 0); $index->spec_field(name => 'section', boost => 3, vectorized => 0); $index->spec_field(name => 'content');

In these cases all of the fields, except content, are nominal. Content may be several MBs in size. In the one example of this problem, I use wvWare to convert a 3MB DOC to text ($content). ($indexer is an object that extracts information from some source file).

my $doc = $index->new_doc; $doc->set_value(url => $url || ''); $doc->set_value(title => $indexer->title || ''); $doc->set_value(filetype => $indexer->fileType || ''); $doc->set_value(section => $indexer->section || ''); my $content = $indexer->content; $content =~ s/ / /g; $content =~ s/\xA0/ /g; $doc->set_value(content => $content || ''); $index->add_doc($doc);

At 3 MB, the CPU spins at add_doc for over 30 wallclock minutes before I give up. If I substring $content smaller, the time reduces (almost exponentially). 512 KB is on the order of 2 minutes.

My server is a dual 3 GHz Intel with 1 GB DDR2 RAM.

So, am I doing something wrong? Or am I nuts for trying to index a 3MB document at a time?

Thanks,


UPDATE:

creamygoodness said I should look for use of $&, $`, and $'. A quick search against my codebase revealed that some of my older code (+6 years now) was using it. At the time, I thought, "Well, if I am going to accept the penalty, I may as well use it!" And I did.

I didn't realize the penalty was so severe. In fact, it really never came up until I was trying to tokenize a 3 MB string.

By removing all uses of $&, $`, and $', I took indexing time for an entire site (with a couple of 3 MB docs) down from over an hour, to under a minute!!! I will add a warning about this to my note How to build a Search Engine..

This is one of those cases where I wish I could ++ creamygoodness more than once. Maybe we need a new field added to our profiles; "Where to send beer". Thanks again!

Ted Young

($$<<$$=>$$<=>$$<=$$>>$$) always returns 1. :-)

In reply to KinoSearch & Large Documents by TedYoung

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.