in reply to Re: Crawling Relative Links from Webpages
in thread Crawling Relative Links from Webpages

OK so maybe I am missing something here, because I am just unable to understand what's being said :(

$mech above uses a hard coded link, which would of course work for this page. What about those from other domains (say "xyz.com")?

How do I make the method generalizable?

  • Comment on Re^2: Crawling Relative Links from Webpages

Replies are listed 'Best First'.
Re^3: Crawling Relative Links from Webpages
by Corion (Patriarch) on May 08, 2010 at 14:34 UTC

    There is only one hard-coded address in the code:

    my $mech = WWW::Mechanize->new(); $mech->get("http://dspace.mit.edu/handle/1721.1/53720");

    If you want to make that variable, maybe you want to pass the starting link from the command line? It will then be available via @ARGV:

    my $mech = WWW::Mechanize->new(); warn "Fetching $ARGV[0]\n"; $mech->get($ARGV[0]);

    Call it as

    perl -w listanand.pl http://google.com
      Ah yes of course. What was I even saying. I get it now.

      Thank you very much everyone. This has solved my problem !

      Although I still get a warning "Use of uninitialized value in string eq at crawler.pl line <line where I check for pdf mime type>". Makes me wonder...

      Andy

        I still get a warning "Use of uninitialized value in string eq at crawler.pl

        This line-

        no warnings "uninitialized";

        -isn't for show. :) A path that is "dir" -- like / -- will not have a mime type and various other paths will fail to be found too.