Re: Crawling Relative Links from Webpages

Mech provides for this. There should be no need for HTML::LinkExtor or any hackery. :)

use warnings;
no warnings "uninitialized";
use strict;
use WWW::Mechanize;

use MIME::Types;
my $mt = MIME::Types->new;

my $mech = WWW::Mechanize->new();
$mech->get("http://dspace.mit.edu/handle/1721.1/53720");

for my $link ( $mech->links() )
{
    my $uri = $link->url_abs();
    print $uri, $/
        if $mt->mimeTypeOf($uri->path) eq "application/pdf";
}

# http://dspace.mit.edu/bitstream/handle/1721.1/53720/MIT-CSAIL-TR-201
+0-018.pdf?sequence=1
[download]

(Update: please only use these tools under the Terms of Service of any given site. The better hackers behave, the more likely we are to get open resources facing us.)

(Update #2: I should add the .pdf is a fairly artificial way to check. It could serve something else and something like /taco could serve a PDF. You'll have to do actual GET|HEAD requests to find out the reputed mime type of any given resource and verify it after receiving it.)

Comment on Re: Crawling Relative Links from Webpages Download Code

Replies are listed 'Best First'.
Re^2: Crawling Relative Links from Webpages by listanand (Sexton) on May 08, 2010 at 13:41 UTC
OK so maybe I am missing something here, because I am just unable to understand what's being said :( $mech above uses a hard coded link, which would of course work for this page. What about those from other domains (say "xyz.com")? How do I make the method generalizable?	[reply]
Re^3: Crawling Relative Links from Webpages by Corion (Patriarch) on May 08, 2010 at 14:34 UTC
There is only one hard-coded address in the code: `my $mech = WWW::Mechanize->new(); $mech->get("http://dspace.mit.edu/handle/1721.1/53720");` [download] If you want to make that variable, maybe you want to pass the starting link from the command line? It will then be available via `@ARGV`: `my $mech = WWW::Mechanize->new(); warn "Fetching $ARGV[0]\n"; $mech->get($ARGV[0]);` [download] Call it as `perl -w listanand.pl http://google.com` [download]	[reply] [d/l] [select]
Re^4: Crawling Relative Links from Webpages by listanand (Sexton) on May 08, 2010 at 15:32 UTC
Ah yes of course. What was I even saying. I get it now. Thank you very much everyone. This has solved my problem ! Although I still get a warning "Use of uninitialized value in string eq at crawler.pl line <line where I check for pdf mime type>". Makes me wonder... Andy	[reply]
Re^5: Crawling Relative Links from Webpages by Your Mother (Archbishop) on May 08, 2010 at 17:04 UTC
Re^6: Crawling Relative Links from Webpages by listanand (Sexton) on May 09, 2010 at 01:12 UTC