comment on

Hi,

I am hoping to add a module to CPAN and I was hoping to get some feedback/comments/ideas.The functionality of this Module is detailed below. I am not really sure what I should call it. So far I have the following options in mind:

WWW::Spider::URI_Detail
URI::Detail
URI::URL::Detail

The module is intended to be used as part of a web crawler although I have found myself using parts of it elsewhere.

The basic functionality of the proposed module will include:

Given the HTML of a page
- Find all anchor elements - broken into "this domain links" and "other domain links".
- Find the Title, description and other such meta data.
Split up an anchor tag into : The URL, the alt text and the anchor text.
Given a potentially relative URL and the current URL, returns the absolute URL.
Given a potential redirecting URL, returns the final destination URL.
Breaks up a URL into Protocol, domain and URI
"Clean" a URL so it can be used as a string ( in say Regular expressions or MySql insert statements ).

I intend to make these functions available both independently and together in an Object Oriented structure.

The OO part would look something like this:

my $b = new foo::bar {

   CURRENT_URL              => 'www.site_i_am_crawling.com/page_i_am_c
+rawling.html', ## New will croak if this is not provided.
   FIND_CONTAINED_URLS      => 1 , ## Default 1
   BREAK_CONTAINED_URLS    => 1 , ## Default 1
   ABSOLUTE_CONTAINED_URLS  => 1 , ## Default 1
   CLEAN_URLS               => 1 , ## Default 1

   CURRENT_URL_HTML         => "long string here", ## Optional, will b
+e extracted if this is not provided.

   USER-AGENT                => '' ,
   TIMEOUT                  => 5  ,

   DEBUG                    => 0

}
   
$b->get_url_info(

    ## Can reset object parameters here.
    ## All processing will be performed only when this function is cal
+led.

);


my @array_of_urls = $b->get_contained_urls();



## ALSO for NON-OO

my @array_of_urls = get_contained_urls( URL => '', HTML => '' );
...



my $all_results = $b->get_all_results();
[download]

The following is a list of existing CPAN modules that are similar to the one proposed here.

WWW::Spider - Far too advanced to be used in this context.

Similar to "Find absolute"

HTML::ResolveLink
URI
URI::URL
URI::ImpliedBase
URI::SmartURI
URI::WithBase

Get the html ( and find elements )

URI::Title::HTML - No POD, gets titles only.
HTML::HeadParser - Parses only the HEAD.
HTML::TreeBuilder - Overkill?

Clean string ( for MySql, and RegEx )

CGI::Untaint - Indirect use.

Break contained URLs

URI - There are several ways to achieve this including a simple RegEx. This functionality is included here for completeness.

In reply to RFC: URI::URL::Detail by tmharish

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.