Hi,

I am hoping to add a module to CPAN and I was hoping to get some feedback/comments/ideas.The functionality of this Module is detailed below. I am not really sure what I should call it. So far I have the following options in mind:

The module is intended to be used as part of a web crawler although I have found myself using parts of it elsewhere.


The basic functionality of the proposed module will include:


I intend to make these functions available both independently and together in an Object Oriented structure.

The OO part would look something like this:

my $b = new foo::bar { CURRENT_URL => 'www.site_i_am_crawling.com/page_i_am_c +rawling.html', ## New will croak if this is not provided. FIND_CONTAINED_URLS => 1 , ## Default 1 BREAK_CONTAINED_URLS => 1 , ## Default 1 ABSOLUTE_CONTAINED_URLS => 1 , ## Default 1 CLEAN_URLS => 1 , ## Default 1 CURRENT_URL_HTML => "long string here", ## Optional, will b +e extracted if this is not provided. USER-AGENT => '' , TIMEOUT => 5 , DEBUG => 0 } $b->get_url_info( ## Can reset object parameters here. ## All processing will be performed only when this function is cal +led. ); my @array_of_urls = $b->get_contained_urls(); ## ALSO for NON-OO my @array_of_urls = get_contained_urls( URL => '', HTML => '' ); ... my $all_results = $b->get_all_results();

The following is a list of existing CPAN modules that are similar to the one proposed here.

Similar to "Find absolute"
Get the html ( and find elements )
Clean string ( for MySql, and RegEx )
Break contained URLs

Replies are listed 'Best First'.
Re: RFC: URI::URL::Detail
by moritz (Cardinal) on Aug 07, 2009 at 13:26 UTC

    I wanted to propose HTML::LinkExtract as a name, but then I found HTML::LinkExtor and HTML::LinkExtractor - maybe you can build it on top of one of those modules, or maybe extend them, or maybe they already do what you want?

    "Clean" a URL so it can be used as a string ( in say Regular expressions or MySql insert statements ).

    I don't think you need that. For regular expressions you just use /\Q$url\E/ or quotemeta, and for SQL inserts you should use placeholders anyway, no need to escape or clean anything.

    HTML::TreeBuilder - Overkill?

    It is never overkill to use a proper HTML parser for such a task. I don't know if that's the best for this task, but it should certainly work.

      Thanks moritz.
      I wanted to propose HTML::LinkExtract as a name, but then I found HTML::LinkExtor and HTML::LinkExtractor - maybe you can build it on top of one of those modules, or maybe extend them, or maybe they already do what you want?

      I believe HTML::LinkExtract does a fair bit of what I am proposing, but I would like to provide for some additional functionality and subtly different access methods.

      I find that these access methods make the writing of a crawler slightly easier.

      Maybe it could be "HTML::LinkExtor::Simple" or "HTML::LinkExtor::Spider". I am not sure if it should be altogether different and be named "HTML::Spider::LinkExtor".

      Also I am not sure I understand the difference between extend and build on top of - If I am looking at say "HTML::LinkExtor::Simple" should I get in touch with the author of "HTML::LinkExtor"?

      With regard to the rest (Clean and TreeBuilder) - it makes perfect sense, Thanks for that.

        Also I am not sure I understand the difference between extend and build on top of

        That wasn't very precise of me. What I meant was either to inherit from the classes and add your own methods (extending), or use it as a backend but using your own API instead (building on top).

        If I am looking at say "HTML::LinkExtor::Simple" should I get in touch with the author of "HTML::LinkExtor"?

        Yes.

Re: RFC: URI::URL::Detail
by Anonymous Monk on Aug 08, 2009 at 10:05 UTC
      Thanks Anonymous Monk
      I will work with WWW::Mechanize when coding.
Re: RFC: URI::URL::Detail
by tmharish (Friar) on Aug 09, 2009 at 08:27 UTC
    Hey Guys

    Thanks for the replies. I went through all the modules that were suggested - I will build on top of a couple of them, have also gotten in touch with the authors of some of these modules.

    I believe "HTML::ElementExtractor" is a good name for the module. I understand that that would make it similar to TreeBuilder but I think the difference in functionality will make up for that.

    Hoping for further comments!!

      If it gets named "::ElementExtractor" I would expect it to be able to extract the info pertaining to any element, not just links. Will there be a way to extract for example the list of <IMG> tags with the attributes?

      Jenda
      Enoch was right!
      Enjoy the last years of Rome.

        Hey Jenda,

        Actually yes - here is a list of features the module will have in V 0.01:

        1. Given the HTML of a page
          1. Find all anchor elements - broken into "this domain links" and "other domain links".
          2. Find images on a page - broken as above.
          3. Find the Title, description and other such meta data.
          4. Find meta keywords and description of the page.
          5. Extract lists (ul and ol) from the HTML of a page
          6. Find RSS Feeds of a page, if any.
          7. anything else I / You guys can think of ...
        2. Split up an anchor tag into : The URL, the alt text and the anchor text.
        3. Given possible anchor/alt text find the related link. [Given Home -  <a href=""> home page </a> will be extracted.
        4. Given a potentially relative URL and the current URL, returns the absolute URL.
        5. Given a potential redirecting URL, returns the final destination URL.
        6. Breaks up a URL into Protocol, domain and URI

        I am still looking for additional features I can add. So please do suggest anything else you can think of