Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

I would like to mirror a site which has a structured url-form. Actually it looks like http://www.my-site.com/mydir/?Page=x where x is from 1 to 273. I would like to download all this pages (mirroring) without needing to click on each site. Do you know maybe a perl script (or another free software) which could do this? Thank you.

Replies are listed 'Best First'.
Re: Mirroring Script
by moodster (Hermit) on Feb 05, 2002 at 14:29 UTC
    Using the GNU utility wget will probably do the trick. It has tons of features and switches and will download whole sites for your (the switch --mirror is probably what you're looking for). It's included with most linux distributions and with cygwin.

    Of course, another way to do it would be using LWP::UserAgent, but I'm way to lazy for that.

    Cheers,
    -- moodster

Re: Mirroring Script
by Dominus (Parson) on Feb 06, 2002 at 03:11 UTC
    For stuff like that, I use something like:
    for i in `seq 273`; do GET "http://www.my-site.com/mydir/?Page=$i" > $i.html && echo $i done
    GET is very handy. It comes with the Perl LWP modules. If you don't have seq, you should write it; it's a three-line Perl program.

    --
    Mark Dominus
    Perl Paraphernalia

Re: Mirroring Script
by BazB (Priest) on Feb 05, 2002 at 14:38 UTC

    The dirtiest way to do this is:

    #!/usr/bin/perl -w use strict; for (1..273) { `wget http://www.my-site.com/mydir/?Page=$_`; }

    Note: this code is untested, and is missing any sort of error checking and assumes you're using a UNIX like system with the wget command installed. Not much of a solution.
    Do not trust this code, but you get the idea.

    A better way would be to have a look at the modules and utilities in Bundle::LWP and base a more robust script on those.

    Implementation is left as a task for the reader.

    Update: Beaten to it by moodster! :-) Great minds think alike. ;)

    Further update: I feel quite honored to have merlyn's attention :-)
    I posted the incorrect/dirty/horrific snippet as a basis of one solution, LWP::Simple is clearly a better solution - and I made a point of mentioning Bundle::LWP, which includes the mirror functionality that merlyn has demostrated.

    So, ignoring the obvious benefits of LWP::Simple, I have seen the error of my ways with the use of backticks here - system() should have been used instead of backticks to fork wget.
    That method still isn't fanstatic - I stand by my disclaimer :-)

      The dirtiest way to do this is:
      #!/usr/bin/perl -w use strict; for (1..273) { `wget http://www.my-site.com/mydir/?Page=$_`; }
      Ouch! You're not kidding. Using backquotes in a void context! Not only is it messy because it forks needlessly (thanks to the question mark), but you're also capturing the output just to discard it!

      Here's code that will be much saner and faster. In fact, you can run it multiple times, and it downloads only the changed files, if "if-modified-since" is supported by the server:

      use LWP::Simple qw(mirror); for (1..273) { mirror "http://www.my-site.com/mydir/?Page=$_", "file$_"; }

      -- Randal L. Schwartz, Perl hacker

        Note: this code is untested, and is missing any sort of error checking and assumes you're using a UNIX like system with the wget command installed. Not much of a solution.
        Do not trust this code, but you get the idea.

        I think this disclaimer shows how little faith I put in that snippet :-)

        Certainly LWP::Simple is the way to go. merlyn++

      Infact LWP includes an lwp-mirror that may well do what is wanted

      /J\