in reply to Script generated site index db

If you have access to your document root, why go through a bot if you have access to the document files? It seems like it would be easier to just grep through all your html files in htdocs or wherever they are than to deal with error handling, following off site links, etc you might need to deal with a bot. Quicker too.

Just a thought.

-Lee

"To be civilized is to deny one's nature."

Replies are listed 'Best First'.
Re: Re: Script generated site index db
by S_Shrum (Pilgrim) on Mar 19, 2002 at 08:55 UTC

    The problem there is that my site uses nested html documents...the URI will be a perl script call that will take the following:

    Page = General site identity document
    Table = Content header (table results, document abstract info, etc.)
    Record = Document content (tabled data, documents, etc.)

    ...and nest them in the following order: Record -> Table -> Page

    Just searching the files would create a list of all the words at the site however it will not take into consideration the layout of the files and their nesting order...the URI is the key.

    ======================
    Sean Shrum
    http://www.shrum.net

      You could expand the filename. I am assuming when you say nested you mean subdirectories.

      Let's say your document root is /usr/local/apache/htdocs
      #!/usr/bin/perl use File::Find; my $filedir = '/usr/local/apache/htdocs'; my $baseurl = "http://someplace.com/shrum"; my @docs = (); sub process_file { return if -d; # Skip directories. push @docs, [$File::Find::dir, $_]; } find(\&process_file, ($filedir) ); foreach $doc (@docs){ $doc->[0]=~s/$filedir/$baseurl/o; #[0] is dir [1] is filename. print "URL is ".$doc->[0].'/'.$doc->[1],"\n"; }


      -Lee

      "To be civilized is to deny one's nature."
        This is still very much a half-solution for sites that incorporate dynamic content or server-side includes. For sites such as these, for which I first looked into this issue, a local-based HTTP indexing engine is an absolute must - It should also be noted that indexing can be scheduled for low-utilisation times and the impact on the server is minimal.

        The other advantage which this approach offers is the ability to incorporate web site maintenance such as broken link checking and content-auditing into the same process.

         

        perl -e 's&&rob@cowsnet.com.au&&&split/[@.]/&&s&.com.&_&&&print'

        My bad...

        By nested, I literally mean nested in that:

        I have a script that makes a LWP::Simple call to the Page, Table, and Record files specified in the script call and stores them in variables. These files are templates into which I do substitutions of the site content data into. Once the substitutions are completed, I place the record template into the table template into the page template.

        To give you a better idea here is an example of a completed URL from my site.

        The above URI uses 3 template files:

        Page
        Table, and
        Record

        The majority of the page content that you see (at present) is from the docs.dat.

        Currently, this setup allows me to create webpages with either content from the dac.dat or by simply specifying a document html page as the RECORD in which case the user is no the wiser...the page displays as if it had come from the docs.dat (even though it's not). Hence the problem...the document content (in this case) will not be in the docs.dat (only the reference to the document will be listed in the URI).

        It doesn't make for the cleanest HTML but it works 99% of the time. ;D

        ======================
        Sean Shrum
        http://www.shrum.net