in reply to Crawling Relative Links from Webpages
Mech provides for this. There should be no need for HTML::LinkExtor or any hackery. :)
use warnings; no warnings "uninitialized"; use strict; use WWW::Mechanize; use MIME::Types; my $mt = MIME::Types->new; my $mech = WWW::Mechanize->new(); $mech->get("http://dspace.mit.edu/handle/1721.1/53720"); for my $link ( $mech->links() ) { my $uri = $link->url_abs(); print $uri, $/ if $mt->mimeTypeOf($uri->path) eq "application/pdf"; } # http://dspace.mit.edu/bitstream/handle/1721.1/53720/MIT-CSAIL-TR-201 +0-018.pdf?sequence=1
(Update: please only use these tools under the Terms of Service of any given site. The better hackers behave, the more likely we are to get open resources facing us.)
(Update #2: I should add the .pdf is a fairly artificial way to check. It could serve something else and something like /taco could serve a PDF. You'll have to do actual GET|HEAD requests to find out the reputed mime type of any given resource and verify it after receiving it.)
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re^2: Crawling Relative Links from Webpages
by listanand (Sexton) on May 08, 2010 at 13:41 UTC | |
by Corion (Patriarch) on May 08, 2010 at 14:34 UTC | |
by listanand (Sexton) on May 08, 2010 at 15:32 UTC | |
by Your Mother (Archbishop) on May 08, 2010 at 17:04 UTC | |
by listanand (Sexton) on May 09, 2010 at 01:12 UTC |