Meta-tag following?!?!? Nah! Nothing that complex.
My only concern at present would be to get the links out of the document for following. Basically, anything that can be seen publically in the document is what I want to index on. This will allow me to have pages on my site that don't have referring links to them and will not be indexed for general public viewing.
Some "off-the-top-of-my-head" planning
=====================================
What I could use (and I think I saw in CPAN somewhere) is a mod that extracts all the links on a specified URL. I could then check, via a loop, that the link string begins/contains/ends with a user defined string (ie: "http://www.shrum.net") to keep tranversals under control.
I could then search a hash where I store links in to make sure that duplicates link entries aren't created and introduce the new links into the hash.
At the same time, the contents of the page will be stripped of their HTML (I think I saw a mod that does this too at CPAN) and the raw document data will be stored into the hash associated with it's cooresponding URI.
I guess I would need a INDEXED flag for each URI to indicate that the URI has been visited and it's page content copied. This would be done last to visit any pages that had links to them but were skipped during the first tranversal.
Once all the INDEXED flags are taken care of, the data in the hash would be written to a flat-file that could then be searched with whatever db engine you wanted.
==========================
This is just my first-pass line of thinking. I will sleep on it tonight and digest about it some more tomorrow. If you have any suggestions, I am all ears.
Thanx for the help so far!
======================
Sean Shrum
http://www.shrum.net
In reply to Re: Re: Script generated site index db
by S_Shrum
in thread Script generated site index db
by S_Shrum
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |