in reply to Crawling Relative Links from Webpages

Mech provides for this. There should be no need for HTML::LinkExtor or any hackery. :)

use warnings; no warnings "uninitialized"; use strict; use WWW::Mechanize; use MIME::Types; my $mt = MIME::Types->new; my $mech = WWW::Mechanize->new(); $mech->get("http://dspace.mit.edu/handle/1721.1/53720"); for my $link ( $mech->links() ) { my $uri = $link->url_abs(); print $uri, $/ if $mt->mimeTypeOf($uri->path) eq "application/pdf"; } # http://dspace.mit.edu/bitstream/handle/1721.1/53720/MIT-CSAIL-TR-201 +0-018.pdf?sequence=1

(Update: please only use these tools under the Terms of Service of any given site. The better hackers behave, the more likely we are to get open resources facing us.)

(Update #2: I should add the .pdf is a fairly artificial way to check. It could serve something else and something like /taco could serve a PDF. You'll have to do actual GET|HEAD requests to find out the reputed mime type of any given resource and verify it after receiving it.)

Replies are listed 'Best First'.
Re^2: Crawling Relative Links from Webpages
by listanand (Sexton) on May 08, 2010 at 13:41 UTC
    OK so maybe I am missing something here, because I am just unable to understand what's being said :(

    $mech above uses a hard coded link, which would of course work for this page. What about those from other domains (say "xyz.com")?

    How do I make the method generalizable?

      There is only one hard-coded address in the code:

      my $mech = WWW::Mechanize->new(); $mech->get("http://dspace.mit.edu/handle/1721.1/53720");

      If you want to make that variable, maybe you want to pass the starting link from the command line? It will then be available via @ARGV:

      my $mech = WWW::Mechanize->new(); warn "Fetching $ARGV[0]\n"; $mech->get($ARGV[0]);

      Call it as

      perl -w listanand.pl http://google.com
        Ah yes of course. What was I even saying. I get it now.

        Thank you very much everyone. This has solved my problem !

        Although I still get a warning "Use of uninitialized value in string eq at crawler.pl line <line where I check for pdf mime type>". Makes me wonder...

        Andy