RFC: URI::URL::Detail

Hi,

I am hoping to add a module to CPAN and I was hoping to get some feedback/comments/ideas.The functionality of this Module is detailed below. I am not really sure what I should call it. So far I have the following options in mind:

WWW::Spider::URI_Detail
URI::Detail
URI::URL::Detail

The module is intended to be used as part of a web crawler although I have found myself using parts of it elsewhere.

The basic functionality of the proposed module will include:

Given the HTML of a page
- Find all anchor elements - broken into "this domain links" and "other domain links".
- Find the Title, description and other such meta data.
Split up an anchor tag into : The URL, the alt text and the anchor text.
Given a potentially relative URL and the current URL, returns the absolute URL.
Given a potential redirecting URL, returns the final destination URL.
Breaks up a URL into Protocol, domain and URI
"Clean" a URL so it can be used as a string ( in say Regular expressions or MySql insert statements ).

I intend to make these functions available both independently and together in an Object Oriented structure.

The OO part would look something like this:

my $b = new foo::bar {

   CURRENT_URL              => 'www.site_i_am_crawling.com/page_i_am_c
+rawling.html', ## New will croak if this is not provided.
   FIND_CONTAINED_URLS      => 1 , ## Default 1
   BREAK_CONTAINED_URLS    => 1 , ## Default 1
   ABSOLUTE_CONTAINED_URLS  => 1 , ## Default 1
   CLEAN_URLS               => 1 , ## Default 1

   CURRENT_URL_HTML         => "long string here", ## Optional, will b
+e extracted if this is not provided.

   USER-AGENT                => '' ,
   TIMEOUT                  => 5  ,

   DEBUG                    => 0

}
   
$b->get_url_info(

    ## Can reset object parameters here.
    ## All processing will be performed only when this function is cal
+led.

);


my @array_of_urls = $b->get_contained_urls();



## ALSO for NON-OO

my @array_of_urls = get_contained_urls( URL => '', HTML => '' );
...



my $all_results = $b->get_all_results();
[download]

The following is a list of existing CPAN modules that are similar to the one proposed here.

WWW::Spider - Far too advanced to be used in this context.

Similar to "Find absolute"

HTML::ResolveLink
URI
URI::URL
URI::ImpliedBase
URI::SmartURI
URI::WithBase

Get the html ( and find elements )

URI::Title::HTML - No POD, gets titles only.
HTML::HeadParser - Parses only the HEAD.
HTML::TreeBuilder - Overkill?

Clean string ( for MySql, and RegEx )

CGI::Untaint - Indirect use.

Break contained URLs

URI - There are several ways to achieve this including a simple RegEx. This functionality is included here for completeness.

Comment on RFC: URI::URL::Detail Download Code

Replies are listed 'Best First'.
Re: RFC: URI::URL::Detail by moritz (Cardinal) on Aug 07, 2009 at 13:26 UTC
I wanted to propose HTML::LinkExtract as a name, but then I found HTML::LinkExtor and HTML::LinkExtractor - maybe you can build it on top of one of those modules, or maybe extend them, or maybe they already do what you want? "Clean" a URL so it can be used as a string ( in say Regular expressions or MySql insert statements ). I don't think you need that. For regular expressions you just use `/\Q$url\E/` or quotemeta, and for SQL inserts you should use placeholders anyway, no need to escape or clean anything. HTML::TreeBuilder - Overkill? It is never overkill to use a proper HTML parser for such a task. I don't know if that's the best for this task, but it should certainly work.	[reply] [d/l]
Re^2: RFC: URI::URL::Detail by tmharish (Friar) on Aug 07, 2009 at 14:09 UTC
Thanks moritz. I wanted to propose HTML::LinkExtract as a name, but then I found HTML::LinkExtor and HTML::LinkExtractor - maybe you can build it on top of one of those modules, or maybe extend them, or maybe they already do what you want? I believe HTML::LinkExtract does a fair bit of what I am proposing, but I would like to provide for some additional functionality and subtly different access methods. I find that these access methods make the writing of a crawler slightly easier. Maybe it could be "HTML::LinkExtor::Simple" or "HTML::LinkExtor::Spider". I am not sure if it should be altogether different and be named "HTML::Spider::LinkExtor". Also I am not sure I understand the difference between extend and build on top of - If I am looking at say "HTML::LinkExtor::Simple" should I get in touch with the author of "HTML::LinkExtor"? With regard to the rest (Clean and TreeBuilder) - it makes perfect sense, Thanks for that.	[reply]
Re^3: RFC: URI::URL::Detail by moritz (Cardinal) on Aug 07, 2009 at 16:04 UTC
Also I am not sure I understand the difference between extend and build on top of That wasn't very precise of me. What I meant was either to inherit from the classes and add your own methods (extending), or use it as a backend but using your own API instead (building on top). If I am looking at say "HTML::LinkExtor::Simple" should I get in touch with the author of "HTML::LinkExtor"? Yes.	[reply]
Re: RFC: URI::URL::Detail by Anonymous Monk on Aug 08, 2009 at 10:05 UTC
I think you are describing WWW::Mechanize/WWW::Mechanize::Link.	[reply]
Re^2: RFC: URI::URL::Detail by tmharish (Friar) on Aug 11, 2009 at 11:43 UTC
Thanks Anonymous Monk I will work with WWW::Mechanize when coding.	[reply]
Re: RFC: URI::URL::Detail by tmharish (Friar) on Aug 09, 2009 at 08:27 UTC
Hey Guys Thanks for the replies. I went through all the modules that were suggested - I will build on top of a couple of them, have also gotten in touch with the authors of some of these modules. I believe "HTML::ElementExtractor" is a good name for the module. I understand that that would make it similar to TreeBuilder but I think the difference in functionality will make up for that. Hoping for further comments!!	[reply]
Re^2: RFC: URI::URL::Detail by Jenda (Abbot) on Aug 10, 2009 at 09:51 UTC
If it gets named "::ElementExtractor" I would expect it to be able to extract the info pertaining to any element, not just links. Will there be a way to extract for example the list of <IMG> tags with the attributes? Jenda Enoch was right! Enjoy the last years of Rome.	[reply]
Re^3: RFC: URI::URL::Detail by tmharish (Friar) on Aug 11, 2009 at 08:14 UTC
Hey Jenda, Actually yes - here is a list of features the module will have in V 0.01: Given the HTML of a page Find all anchor elements - broken into "this domain links" and "other domain links". Find images on a page - broken as above. Find the Title, description and other such meta data. Find meta keywords and description of the page. Extract lists (ul and ol) from the HTML of a page Find RSS Feeds of a page, if any. anything else I / You guys can think of ... Split up an anchor tag into : The URL, the alt text and the anchor text. Given possible anchor/alt text find the related link. [Given Home - `<a href=""> home page </a>` will be extracted. Given a potentially relative URL and the current URL, returns the absolute URL. Given a potential redirecting URL, returns the final destination URL. Breaks up a URL into Protocol, domain and URI I am still looking for additional features I can add. So please do suggest anything else you can think of	[reply] [d/l]
Re^4: RFC: URI::URL::Detail by Anonymous Monk on Aug 11, 2009 at 08:27 UTC