S_Shrum has asked for the wisdom of the Perl Monks concerning the following question:

I am at a junction....I looked around and read the node about Local::SiteRobot and also looked in CPAN about all the various tranversal modules but I am not seeing anything that looks like it does what I am going to propose. I am looking for input as to if this is a good idea or if there is something out there that does this already and for free. REMEMBER: Google's SiteSearch costs $ and they "reserve the right to place ads in the results".

Background
==========
Currently, most of my content is in a single db (docs.dat) but I am beginning to deplore this. There are a number of problems with this but they are not important at current so I digress.

I am looking at leaving my content in the original html pages instead of my current "cut-n-paste-into-my-db" method. This would allow me to use the content db as a document compiler as well as a content repository. The problem with this idea is that html file content will not be in the content db and therefore any search on it will not return any results for those pages.

What I am looking to do
=======================
I want to tranverse the links within a site (http://www.mysite.com) and create a raw text db of the content. As the robot traverses the site, the information it collects will be stored into a flat-file db in the following (preliminary) fashion:

URI|Title|Content where:

The URI will be used to make sure the robot isn't backtracking by checking to see if the URI already exists in the file/hash.

The Title will be used in the search result template and will be wrapped in the URI later.

The Content will be a raw text dump (sans HTML) of the content that was on the page located under the URI.

As long as all the pages have links to each other, no document will be left out and the file will be (in essence) a snapshot of the entire website (content-wise, at least).

The resultant file could then be used with whatever db tool you want to (in my case the DBI and AnyData) to do site content searches.

The script will be set up with a parameter that would allow for db (re)creation so if you update your site with new data or remove pages (links), the content db will be up-to-date. Since the db's sole purpose is for searching, it does not impact the content during the rebuild process (only searching will be down until the tranversal is complete).

Does this sound like something that already exists or should I start writting my own? Does this seem like a good idea?

As always, pro/con input appreciated.

======================
Sean Shrum
http://www.shrum.net

Replies are listed 'Best First'.
Re: Script generated site index db
by simon.proctor (Vicar) on Mar 19, 2002 at 09:28 UTC
    I did this recently for a site that uses the template toolkit. I used HTML::Tokeparser to parse the HTML and grab the text to index and a combination of MLDBM and Storable to build the search index. You also can use this or HTML::LinkExtor to extract the links from your pages. Just don't parse them out by hand (please).

    I used stemming to reduce the size of the content and saved the relevant info about a page (the meta info) into a separate DBM cache. I did this so I could display the results with some of the page meta info (though on reflection this may not have been necessary).

    If you don't already know, Stemming reduces words to a shorter form based on a set of rules. Thus trains, trainers and training could all be considered to be 'train'. I used Paice-Husk stemming although you could consider using Lingua::Stem which uses Porters algorithm.

    The search index used weighted term based indexing.

    If you would rather just get one pre-built then go to Perlfect as thats pretty good. I didn't use it because it didn't use Stemming (that I could see) and it didn't do phrase searching, which mine does.

    I mention all of this because your search index is not at all optimised and some pre-processing would speed things up.

    I would recommend this even though you are wanting to go about fetching stuff via a robot.

    HTH

    Simon

      Stemming, eh? Hmmm...interesting concept. I'll research it.

      Overall, the total size of my current content file (even if I add in all the html template pages) is just over a meg. This generates over 100 pages of content at my site currently.

      If the content size was, say, 10 times greater, I would agree with you as searching a 10 meg file would take time. I think for the first version I will probably not stem the content...or at least not until I have a better grasp on the specifics of it. Also, by leaving things in this way, anybody using whatever db engine they wanted to use could do searches on the data, including phrase searching say like thru DBI, DBD::AnyData, and SQL::Statement. ;D

      Thanks for that bit of info about stemming...I will look into it.

      ======================
      Sean Shrum
      http://www.shrum.net

        I have posted a node enquiring into stemming algorithms in Perl here, in particular, the Porter algorithm employed in Lingua::Stem, which may be of interest to you.

         

Re: Script generated site index db
by rob_au (Abbot) on Mar 19, 2002 at 09:28 UTC
    There are a couple of ways which you can go with this ...

    The first being the roll-your-own solution, which while being prehaps the most involved, that guaranteed to be most customised to your needs :-) For this, I would advise either using an indexing module such as my own Local::SiteRobot or WWW::SimpleRobot (the author of which has updated based on patches submitted). In following this path, I would advise you to do a bit of research and code audit of some of the existing search and indexing solutions - For example, a flat file text storage base simply won't scale, the better option, employed by most other scripts of this nature, is tied hash or DBM file storage. For content and meta-tag following, some of the existing modules such as HTML::TokeParser will shorten your development time and programmer migraines immensely.

    The other way is to explore some of the existing solutions, one of the better options which I found when I was looking into this issue was the Perlfect Search script which offers HTTP and file system indexing, PDF text extraction, meta-tag following, ranking support and template-system output. I haven't as yet had a chance to set this up on one of my development boxes and while a preliminary code review has found a couple of quirky style and indexing issues, this package looks fairly solid.

    Good luck ... And feel free to /msg me if you have any questions.

     

    perl -e 's&&rob@cowsnet.com.au&&&split/[@.]/&&s&.com.&_&&&print'

      Meta-tag following?!?!? Nah! Nothing that complex.

      My only concern at present would be to get the links out of the document for following. Basically, anything that can be seen publically in the document is what I want to index on. This will allow me to have pages on my site that don't have referring links to them and will not be indexed for general public viewing.

      Some "off-the-top-of-my-head" planning
      =====================================
      What I could use (and I think I saw in CPAN somewhere) is a mod that extracts all the links on a specified URL. I could then check, via a loop, that the link string begins/contains/ends with a user defined string (ie: "http://www.shrum.net") to keep tranversals under control.

      I could then search a hash where I store links in to make sure that duplicates link entries aren't created and introduce the new links into the hash.

      At the same time, the contents of the page will be stripped of their HTML (I think I saw a mod that does this too at CPAN) and the raw document data will be stored into the hash associated with it's cooresponding URI.

      I guess I would need a INDEXED flag for each URI to indicate that the URI has been visited and it's page content copied. This would be done last to visit any pages that had links to them but were skipped during the first tranversal.

      Once all the INDEXED flags are taken care of, the data in the hash would be written to a flat-file that could then be searched with whatever db engine you wanted.

      ==========================

      This is just my first-pass line of thinking. I will sleep on it tonight and digest about it some more tomorrow. If you have any suggestions, I am all ears.

      Thanx for the help so far!

      ======================
      Sean Shrum
      http://www.shrum.net

        I write in further support of Perlfect Search,
        It will let you create a list of excluded folders/files, that won't get indexed,
        and just because it has complex features that you don't need (at the moment),
        doesn't mean you have to use them.

        I use Perlfect on one of my sites, and it's very good.
        The only minus point, for me, is that is doesn't support live searches, so it's necessary to create a cron job to regularly re-index the site.
Re: Script generated site index db
by shotgunefx (Parson) on Mar 19, 2002 at 08:48 UTC
    If you have access to your document root, why go through a bot if you have access to the document files? It seems like it would be easier to just grep through all your html files in htdocs or wherever they are than to deal with error handling, following off site links, etc you might need to deal with a bot. Quicker too.

    Just a thought.

    -Lee

    "To be civilized is to deny one's nature."

      The problem there is that my site uses nested html documents...the URI will be a perl script call that will take the following:

      Page = General site identity document
      Table = Content header (table results, document abstract info, etc.)
      Record = Document content (tabled data, documents, etc.)

      ...and nest them in the following order: Record -> Table -> Page

      Just searching the files would create a list of all the words at the site however it will not take into consideration the layout of the files and their nesting order...the URI is the key.

      ======================
      Sean Shrum
      http://www.shrum.net

        You could expand the filename. I am assuming when you say nested you mean subdirectories.

        Let's say your document root is /usr/local/apache/htdocs
        #!/usr/bin/perl use File::Find; my $filedir = '/usr/local/apache/htdocs'; my $baseurl = "http://someplace.com/shrum"; my @docs = (); sub process_file { return if -d; # Skip directories. push @docs, [$File::Find::dir, $_]; } find(\&process_file, ($filedir) ); foreach $doc (@docs){ $doc->[0]=~s/$filedir/$baseurl/o; #[0] is dir [1] is filename. print "URL is ".$doc->[0].'/'.$doc->[1],"\n"; }


        -Lee

        "To be civilized is to deny one's nature."
Re: Script generated site index db
by S_Shrum (Pilgrim) on Mar 19, 2002 at 09:08 UTC

    A few other points to clarify...

    Some of the site content is dynamically generated through perl scripts. Using a robot would allow me to index this content as well. As Shotgunefx pointed out, a low-level file dump would not produce the results I am looking for.

    In any event, the script I am thinking of making would allow for things like multi-domain searches (like if you had "http://www.mysite.com" and "http://search.mysite.com" and etc.) but will also allow you to index remote sites as well (if you wanted to; probably piss off the domain owner if they found out).

    If this sounds like a good idea and you happen to have some first-hand experience with a mod(s) that might help me achieve my proposed end-result, let me know. Also, if you have sample code to invoke those mods, please pass that along too! Thanx.

    ======================
    Sean Shrum
    http://www.shrum.net

      Generically indexing dynamic pages might be tricky. For a start on the indexing, you may want to look at merlyns parallel link checker column.

      -Lee

      "To be civilized is to deny one's nature."

      Maybe not directly a Perl solution, but you might want to check SWISH-E