in reply to Script generated site index db

There are a couple of ways which you can go with this ...

The first being the roll-your-own solution, which while being prehaps the most involved, that guaranteed to be most customised to your needs :-) For this, I would advise either using an indexing module such as my own Local::SiteRobot or WWW::SimpleRobot (the author of which has updated based on patches submitted). In following this path, I would advise you to do a bit of research and code audit of some of the existing search and indexing solutions - For example, a flat file text storage base simply won't scale, the better option, employed by most other scripts of this nature, is tied hash or DBM file storage. For content and meta-tag following, some of the existing modules such as HTML::TokeParser will shorten your development time and programmer migraines immensely.

The other way is to explore some of the existing solutions, one of the better options which I found when I was looking into this issue was the Perlfect Search script which offers HTTP and file system indexing, PDF text extraction, meta-tag following, ranking support and template-system output. I haven't as yet had a chance to set this up on one of my development boxes and while a preliminary code review has found a couple of quirky style and indexing issues, this package looks fairly solid.

Good luck ... And feel free to /msg me if you have any questions.

 

perl -e 's&&rob@cowsnet.com.au&&&split/[@.]/&&s&.com.&_&&&print'

Replies are listed 'Best First'.
Re: Re: Script generated site index db
by S_Shrum (Pilgrim) on Mar 19, 2002 at 10:12 UTC

    Meta-tag following?!?!? Nah! Nothing that complex.

    My only concern at present would be to get the links out of the document for following. Basically, anything that can be seen publically in the document is what I want to index on. This will allow me to have pages on my site that don't have referring links to them and will not be indexed for general public viewing.

    Some "off-the-top-of-my-head" planning
    =====================================
    What I could use (and I think I saw in CPAN somewhere) is a mod that extracts all the links on a specified URL. I could then check, via a loop, that the link string begins/contains/ends with a user defined string (ie: "http://www.shrum.net") to keep tranversals under control.

    I could then search a hash where I store links in to make sure that duplicates link entries aren't created and introduce the new links into the hash.

    At the same time, the contents of the page will be stripped of their HTML (I think I saw a mod that does this too at CPAN) and the raw document data will be stored into the hash associated with it's cooresponding URI.

    I guess I would need a INDEXED flag for each URI to indicate that the URI has been visited and it's page content copied. This would be done last to visit any pages that had links to them but were skipped during the first tranversal.

    Once all the INDEXED flags are taken care of, the data in the hash would be written to a flat-file that could then be searched with whatever db engine you wanted.

    ==========================

    This is just my first-pass line of thinking. I will sleep on it tonight and digest about it some more tomorrow. If you have any suggestions, I am all ears.

    Thanx for the help so far!

    ======================
    Sean Shrum
    http://www.shrum.net

      I write in further support of Perlfect Search,
      It will let you create a list of excluded folders/files, that won't get indexed,
      and just because it has complex features that you don't need (at the moment),
      doesn't mean you have to use them.

      I use Perlfect on one of my sites, and it's very good.
      The only minus point, for me, is that is doesn't support live searches, so it's necessary to create a cron job to regularly re-index the site.