Hi,

I am hoping to add a module to CPAN and I was hoping to get some feedback/comments/ideas.The functionality of this Module is detailed below. I am not really sure what I should call it. So far I have the following options in mind:

The module is intended to be used as part of a web crawler although I have found myself using parts of it elsewhere.


The basic functionality of the proposed module will include:


I intend to make these functions available both independently and together in an Object Oriented structure.

The OO part would look something like this:

my $b = new foo::bar { CURRENT_URL => 'www.site_i_am_crawling.com/page_i_am_c +rawling.html', ## New will croak if this is not provided. FIND_CONTAINED_URLS => 1 , ## Default 1 BREAK_CONTAINED_URLS => 1 , ## Default 1 ABSOLUTE_CONTAINED_URLS => 1 , ## Default 1 CLEAN_URLS => 1 , ## Default 1 CURRENT_URL_HTML => "long string here", ## Optional, will b +e extracted if this is not provided. USER-AGENT => '' , TIMEOUT => 5 , DEBUG => 0 } $b->get_url_info( ## Can reset object parameters here. ## All processing will be performed only when this function is cal +led. ); my @array_of_urls = $b->get_contained_urls(); ## ALSO for NON-OO my @array_of_urls = get_contained_urls( URL => '', HTML => '' ); ... my $all_results = $b->get_all_results();

The following is a list of existing CPAN modules that are similar to the one proposed here.

Similar to "Find absolute"
Get the html ( and find elements )
Clean string ( for MySql, and RegEx )
Break contained URLs

In reply to RFC: URI::URL::Detail by tmharish

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.