Hi,
I am hoping to add a module to CPAN and I was hoping to get some feedback/comments/ideas.The functionality of this Module is detailed below. I am not really sure what I should call it. So far I have the following options in mind:
- WWW::Spider::URI_Detail
- URI::Detail
- URI::URL::Detail
The module is intended to be used as part of a web crawler although I have found myself using parts of it elsewhere.
The basic functionality of the proposed module will include:
- Given the HTML of a page
- Find all anchor elements - broken into "this domain links" and "other domain links".
- Find the Title, description and other such meta data.
- Split up an anchor tag into : The URL, the alt text and the anchor text.
- Given a potentially relative URL and the current URL, returns the absolute URL.
- Given a potential redirecting URL, returns the final destination URL.
- Breaks up a URL into Protocol, domain and URI
- "Clean" a URL so it can be used as a string ( in say Regular expressions or MySql insert statements ).
I intend to make these functions available both independently and together in an Object Oriented structure.
The OO part would look something like this:
my $b = new foo::bar {
CURRENT_URL => 'www.site_i_am_crawling.com/page_i_am_c
+rawling.html', ## New will croak if this is not provided.
FIND_CONTAINED_URLS => 1 , ## Default 1
BREAK_CONTAINED_URLS => 1 , ## Default 1
ABSOLUTE_CONTAINED_URLS => 1 , ## Default 1
CLEAN_URLS => 1 , ## Default 1
CURRENT_URL_HTML => "long string here", ## Optional, will b
+e extracted if this is not provided.
USER-AGENT => '' ,
TIMEOUT => 5 ,
DEBUG => 0
}
$b->get_url_info(
## Can reset object parameters here.
## All processing will be performed only when this function is cal
+led.
);
my @array_of_urls = $b->get_contained_urls();
## ALSO for NON-OO
my @array_of_urls = get_contained_urls( URL => '', HTML => '' );
...
my $all_results = $b->get_all_results();
The following is a list of existing CPAN modules that are similar to the one proposed here.
- WWW::Spider - Far too advanced to be used in this context.
Similar to "Find absolute"
- HTML::ResolveLink
- URI
- URI::URL
- URI::ImpliedBase
- URI::SmartURI
- URI::WithBase
Get the html ( and find elements )
- URI::Title::HTML - No POD, gets titles only.
- HTML::HeadParser - Parses only the HEAD.
- HTML::TreeBuilder - Overkill?
Clean string ( for MySql, and RegEx )
- CGI::Untaint - Indirect use.
Break contained URLs
- URI - There are several ways to achieve this including a simple RegEx. This functionality is included here for completeness.
Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
Read Where should I post X? if you're not absolutely sure you're posting in the right place.
Please read these before you post! —
Posts may use any of the Perl Monks Approved HTML tags:
- a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
| |
For: |
|
Use: |
| & | | & |
| < | | < |
| > | | > |
| [ | | [ |
| ] | | ] |
Link using PerlMonks shortcuts! What shortcuts can I use for linking?
See Writeup Formatting Tips and other pages linked from there for more info.