TedYoung has asked for the wisdom of the Perl Monks concerning the following question:
Greetings,
I have a large CMS that I just upgraded from Plucene to KinoSearch. Whenever the site changes, I crawl it and index the various pages and files.When I try to index a file on the order of 3 MB, the $index->add_doc($doc) command spins the CPU for over 30 minutes before completing.
The fields are speced as:
$index->spec_field(name => 'url', analyzed => 0, vectorized => 0); $index->spec_field(name => 'filetype', indexed => 0, analyzed => 0, ve +ctorized => 0); $index->spec_field(name => 'title', boost => 3, vectorized => 0); $index->spec_field(name => 'section', boost => 3, vectorized => 0); $index->spec_field(name => 'content');
In these cases all of the fields, except content, are nominal. Content may be several MBs in size. In the one example of this problem, I use wvWare to convert a 3MB DOC to text ($content). ($indexer is an object that extracts information from some source file).
my $doc = $index->new_doc; $doc->set_value(url => $url || ''); $doc->set_value(title => $indexer->title || ''); $doc->set_value(filetype => $indexer->fileType || ''); $doc->set_value(section => $indexer->section || ''); my $content = $indexer->content; $content =~ s/ / /g; $content =~ s/\xA0/ /g; $doc->set_value(content => $content || ''); $index->add_doc($doc);
At 3 MB, the CPU spins at add_doc for over 30 wallclock minutes before I give up. If I substring $content smaller, the time reduces (almost exponentially). 512 KB is on the order of 2 minutes.
My server is a dual 3 GHz Intel with 1 GB DDR2 RAM.
So, am I doing something wrong? Or am I nuts for trying to index a 3MB document at a time?
Thanks,
UPDATE:
creamygoodness said I should look for use of $&, $`, and $'. A quick search against my codebase revealed that some of my older code (+6 years now) was using it. At the time, I thought, "Well, if I am going to accept the penalty, I may as well use it!" And I did.
I didn't realize the penalty was so severe. In fact, it really never came up until I was trying to tokenize a 3 MB string.
By removing all uses of $&, $`, and $', I took indexing time for an entire site (with a couple of 3 MB docs) down from over an hour, to under a minute!!! I will add a warning about this to my note How to build a Search Engine..
This is one of those cases where I wish I could ++ creamygoodness more than once. Maybe we need a new field added to our profiles; "Where to send beer". Thanks again!
Ted Young
($$<<$$=>$$<=>$$<=$$>>$$) always returns 1. :-)
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: KinoSearch & Large Documents
by creamygoodness (Curate) on Feb 11, 2007 at 04:57 UTC | |
by TedYoung (Deacon) on Feb 11, 2007 at 13:37 UTC | |
|
Re: KinoSearch & Large Documents
by Khen1950fx (Canon) on Feb 10, 2007 at 16:59 UTC | |
by TedYoung (Deacon) on Feb 10, 2007 at 17:28 UTC | |
by Khen1950fx (Canon) on Feb 10, 2007 at 19:53 UTC | |
by TedYoung (Deacon) on Feb 11, 2007 at 01:30 UTC |