in reply to Best way to recursively grab a website

maybe i'm just not as techy as you guys, but that first bit was double dutch to me :)

you could try using wget, that's a program i've used in the past for mirroring websites, otherwise there is a mirror routine inthe LWP::Simple module on cpan which might do it.

http://search.cpan.org/~gaas/libwww-perl-5.803/lib/LWP/Simple.pm
mirror($url, $file) Get and store a document identified by a URL, using If-modified-since, + and checking the Content-Length. Returns the HTTP response code.

Hope that's some help.

Learning without thought is labor lost; thought without learning is perilous. - Confucius
WebChalkboard.com | For the love of art...

Replies are listed 'Best First'.
Re^2: Best way to recursively grab a website
by ghenry (Vicar) on Mar 29, 2005 at 10:54 UTC

    Yeah, sorry. I added the background of the problem, just so people who are familiar with Subversion, understand where I am coming from.

    I have looked at LWP::Simple, but the mirror function looks like it is per-url, not per website?

    I could use wget, but I want to do it all in perl, in case wget is not available.

    Thanks.

    Walking the road to enlightenment... I found a penguin and a camel on the way..... Fancy a yourname@perl.me.uk? Just ask!!!
      LWP ships with a number of example applications / utilities such as GET, POST etc. One of these utilities is lwp-mirror.

      From the documentation:

      This program can be used to mirror a document from a WWW server. The document is only transfered if the remote copy is newer than the local copy. If the local copy is newer nothing happens.

        Or lwp-rget

        This program will retrieve a document and store it in a local file. It will follow any links found in the document and store these documents as well, patching links so that they refer to these local copies. This process continues until there are no more unvisited links or the process is stopped by the one or more of the limits which can be controlled by the command line arguments.
        --
        b10m

        All code is usually tested, but rarely trusted.