Meta-tag following?!?!? Nah! Nothing that complex.

My only concern at present would be to get the links out of the document for following. Basically, anything that can be seen publically in the document is what I want to index on. This will allow me to have pages on my site that don't have referring links to them and will not be indexed for general public viewing.

Some "off-the-top-of-my-head" planning
=====================================
What I could use (and I think I saw in CPAN somewhere) is a mod that extracts all the links on a specified URL. I could then check, via a loop, that the link string begins/contains/ends with a user defined string (ie: "http://www.shrum.net") to keep tranversals under control.

I could then search a hash where I store links in to make sure that duplicates link entries aren't created and introduce the new links into the hash.

At the same time, the contents of the page will be stripped of their HTML (I think I saw a mod that does this too at CPAN) and the raw document data will be stored into the hash associated with it's cooresponding URI.

I guess I would need a INDEXED flag for each URI to indicate that the URI has been visited and it's page content copied. This would be done last to visit any pages that had links to them but were skipped during the first tranversal.

Once all the INDEXED flags are taken care of, the data in the hash would be written to a flat-file that could then be searched with whatever db engine you wanted.

==========================

This is just my first-pass line of thinking. I will sleep on it tonight and digest about it some more tomorrow. If you have any suggestions, I am all ears.

Thanx for the help so far!

======================
Sean Shrum
http://www.shrum.net


In reply to Re: Re: Script generated site index db by S_Shrum
in thread Script generated site index db by S_Shrum

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.