jaytan has asked for the wisdom of the Perl Monks concerning the following question:

ok so i want a script that will go to a url,index of page and download x amount of pdf,zip,rar,chm files. here is what i have the regex is copied not sure if it covers all file types. Let me know if this will work. use strict;
use WWW::Mechanize; my $start = "http://www.domain.com"; my $mech = WWW::Mechanize->new( autocheck => 1 ); $mech->get( $start ); my @links = $mech->find_all_links( url_regex => qr/\d+.+\.pdf$/ ); for my $link ( @links ) { my $url = $link->url_abs; my $filename = $url; $filename =~ s[^.+/][]; print "Fetching $url"; $mech->get( $url, ':content_file' => $filename ); print " ", -s $filename, " bytes\n"; }

Replies are listed 'Best First'.
Re: www::mechanize file download script
by ELISHEVA (Prior) on Feb 15, 2009 at 05:30 UTC
    I have to agree with Anonymous Monk - "test it and see". Unfortunately, testing this sort of code takes a bit of work. If you are new to testing, what follows might feel a bit overwhelming. If so, read Test::Simple and feel free to ask lots of questions.

    So, some tips:

    • Define some sample dummy input - in this case one or more dummy pdf files.
    • For each dummy file, define what links should be found.
    • Move your code from a script to subroutines that allow you to test each stage of your algorithm - this makes it easier to compare inputs and outputs. Also during debugging you will find it easier to pinpoint the source of a problem.
    • Use Test::More to compare actual to expected outputs from each subroutine.

    I've included an example of what I mean. First, here's what your script might look like after its been broken up into subroutines. I've put in two: one to find the links (getAllLinks(...)) and one to retrieve the byte count with each link (getByteCount(...)). I've done it this way because the techniques for testing those two parts of your script are very different. Please forgive typos: this is only a reorganization for demonstration purposes. It hasn't been run through a compiler.

    Now, here's an example of a test script. A test script is just a plain old script that ends, by convention, with .t rather than .pl. What this test script does is pass various combinations of inputs to the subroutines getAllLinks(...) and getByteCount(...). To compare the actual outputs of those functions with the expected outputs, we wrap each subroutine call with one of two special testing functions: is(...), is_deeply(...).

    Your test script might look something like this. Again, this code hasn't been run through a compiler - consider it more as a demonstration of how to use Test::More:

    use strict; use warnings; use Test::More qw(no_plan); #imports testing tools use MyModule; #that's your code my $mech = WWW::Mechanize->new( autocheck => 1 ); #call repeatedly with various values of $start, $regex # is_deeply compares data structures element by element # is_deeply($got, $expected, $description_of_test) my $start = "http://www.domain.com"; my $regex = qr/\d+.+\.pdf$/; my $aExpected = [ 'foo.pdf', 'baz.pdf' ]; is_deeply(getAllLinks($mech, $start, $regex), $aExpected , "getAllLinks: start=$start, regex=$regex"); #call repeatedly with various urls # is compares simple scalars # is($got, $expected, $description_of_test) is(getByteCount($mech, $url), $iExpected , "getByteCount: url=$url");

    Best, beth

Re: www::mechanize file download script
by CountZero (Bishop) on Feb 15, 2009 at 07:24 UTC
    If you want to get all of the pdf, zip, rar and chm files, you will have to change your regex. As it is now it will only grab some of the pdf files, namely those with one or more digits somewhere in the file-name, then something else and ending with '.pdf'.

    To get all of the file types use (?:pdf|zip|rar|chm)$.

    ALthough it is rare, nothing guarantees you that the pdf, zip, rar and chm files on the web-pages will have a pdf, zip, rar or chm extension anywhere in the url. HTTP being what it is, one can serve arbitrary files with an arbitrary URL. The only way to make sure is to follow every link, check what gets send back and hope they did not goof up the headers.

    CountZero

    A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

Re: www::mechanize file download script
by Anonymous Monk on Feb 15, 2009 at 03:00 UTC
    Let me know if this will work
    Test it and see :)