wow. you could write a book (and some have) on search technology.
IMHO, there are no good perl search modules. (Please someone prove me wrong). Check out
searchtools for a pretty
comprehensive list of available apps/libraries. A lot of the products there cost ($$) and a lot of them focus on the spidering
of information vice the indexing/searching but you should find it
a good starting point. Just be prepared to spend a significant
chunk of time integrating.
The main problem with your approach is it will not scale well. It
may work fine for your current doc set but add a few more thousand and it will become unbearably slow. Also doing all that regex work in realtime will become burdensome. Most approach this problem by indexing offline and then using those indexes for searching. You run the risk of stale searches if you have extremely dynamic docs but most people don't - so indexing on a aperiodic basis (weekly) will do the trick.
An example of a perl library found at searchtools would be
perlfect.
-derby
update: Thanks perrin. I'll look into Search::InvertedIndex. I've looked at DBIx::FullTextSearch before but didn't want the MySql
overhead.
update again: Just to clarify, I would really like
to see soemthing like lucene in perl world.
update yet again: perrin is right. I need to look
at CPAN more closely. Besides the two mentiond below, WAIT is a perl/XS implementation of the once ubiquitous
WAIS.
Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
Read Where should I post X? if you're not absolutely sure you're posting in the right place.
Please read these before you post! —
Posts may use any of the Perl Monks Approved HTML tags:
- a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
| |
For: |
|
Use: |
| & | | & |
| < | | < |
| > | | > |
| [ | | [ |
| ] | | ] |
Link using PerlMonks shortcuts! What shortcuts can I use for linking?
See Writeup Formatting Tips and other pages linked from there for more info.