comment on

I am at a junction....I looked around and read the node about Local::SiteRobot and also looked in CPAN about all the various tranversal modules but I am not seeing anything that looks like it does what I am going to propose. I am looking for input as to if this is a good idea or if there is something out there that does this already and for free. REMEMBER: Google's SiteSearch costs $ and they "reserve the right to place ads in the results".

Background
==========
Currently, most of my content is in a single db (docs.dat) but I am beginning to deplore this. There are a number of problems with this but they are not important at current so I digress.

I am looking at leaving my content in the original html pages instead of my current "cut-n-paste-into-my-db" method. This would allow me to use the content db as a document compiler as well as a content repository. The problem with this idea is that html file content will not be in the content db and therefore any search on it will not return any results for those pages.

What I am looking to do
=======================
I want to tranverse the links within a site (http://www.mysite.com) and create a raw text db of the content. As the robot traverses the site, the information it collects will be stored into a flat-file db in the following (preliminary) fashion:

URI|Title|Content where:

The URI will be used to make sure the robot isn't backtracking by checking to see if the URI already exists in the file/hash.

The Title will be used in the search result template and will be wrapped in the URI later.

The Content will be a raw text dump (sans HTML) of the content that was on the page located under the URI.

As long as all the pages have links to each other, no document will be left out and the file will be (in essence) a snapshot of the entire website (content-wise, at least).

The resultant file could then be used with whatever db tool you want to (in my case the DBI and AnyData) to do site content searches.

The script will be set up with a parameter that would allow for db (re)creation so if you update your site with new data or remove pages (links), the content db will be up-to-date. Since the db's sole purpose is for searching, it does not impact the content during the rebuild process (only searching will be down until the tranversal is complete).

Does this sound like something that already exists or should I start writting my own? Does this seem like a good idea?

As always, pro/con input appreciated.

======================
Sean Shrum
http://www.shrum.net

In reply to Script generated site index db by S_Shrum

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.