The first being the roll-your-own solution, which while being prehaps the most involved, that guaranteed to be most customised to your needs :-) For this, I would advise either using an indexing module such as my own Local::SiteRobot or WWW::SimpleRobot (the author of which has updated based on patches submitted). In following this path, I would advise you to do a bit of research and code audit of some of the existing search and indexing solutions - For example, a flat file text storage base simply won't scale, the better option, employed by most other scripts of this nature, is tied hash or DBM file storage. For content and meta-tag following, some of the existing modules such as HTML::TokeParser will shorten your development time and programmer migraines immensely.
The other way is to explore some of the existing solutions, one of the better options which I found when I was looking into this issue was the Perlfect Search script which offers HTTP and file system indexing, PDF text extraction, meta-tag following, ranking support and template-system output. I haven't as yet had a chance to set this up on one of my development boxes and while a preliminary code review has found a couple of quirky style and indexing issues, this package looks fairly solid.
Good luck ... And feel free to /msg me if you have any questions.
perl -e 's&&rob@cowsnet.com.au&&&split/[@.]/&&s&.com.&_&&&print'
In reply to Re: Script generated site index db
by rob_au
in thread Script generated site index db
by S_Shrum
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |