Preceptor has asked for the wisdom of the Perl Monks concerning the following question:

I've been using wget to download a site - one of the things that's bothering me about it, is that it's 'reject' option _first_ downloads a page, and then deletes it.

With this particular site, pretty much every page is reproduced multiple times - with CGI urls (e.g. page.html?thread=2012 or post.html?mode=reply).

I'm sure that I do not want all these, but wget is quite insistent that I must download them and then delete them.

So I was wondering - could anyone point me in the right direction for solving this problem - recursive http download, using cookies, and giving me capability to filter URLs before download. (Or coalesce the duplicates). And ideally, allow me to 'skip' anything that I've already got downloaded, by allowing comparing with an on disk structure. (rsync style)

I'd be thinking something along the lines of File::Find, but for HTTP. Before I charge off re-inventing the wheel, does such a beast exist?

  • Comment on Recursive HTTP Downloads - without using WGET

Replies are listed 'Best First'.
Re: Recursive HTTP Downloads - without using WGET
by MidLifeXis (Monsignor) on Jul 15, 2012 at 11:48 UTC

    If you are rejecting the URL based on the URL itself, then LWP::UserAgent may help you create something like what you describe. On the other hand, if you are rejecting the URL based on the contents of it, then the WGET method as you describe is how it would need to be done anyway.

    --MidLifeXis

      Mostly, I can discard based on URL - I've got a lot of links to pages with CGI variables set, which I know are duplicates or otherwise irrelevant.

      So I'll have a look at doing something with LWP::UserAgent, and be careful and hope the recursion gremlins don't get me :)

Re: Recursive HTTP Downloads - without using WGET
by BrowserUk (Patriarch) on Jul 15, 2012 at 12:27 UTC

    You could try using HEADs first and comparing the meta data provided by the headers: size; expiry date etc. Perhaps coalescing all the non-variable headers into a single string and then md5 hashing it.

    But, on most servers, it costs almost the same in terms of resources to respond to a HEAD request as it does to a GET -- indeed, often the only difference is that the body of the request is discarded after its size has been measured. And for those you wish to keep, you then have to do the GET anyway and so the net cost is greater.

    That's why WGET does it the way it does. Other than for large binary downloads -- images, video music etc. -- the net cost of doing heads rather than a mix of HEADs abd GETs is less with the former.

    As you cannot measure the size of the content, or checksum it, until you have downloaded it; there is little better you can do unless you are entirely comfortable that rejecting links on the basis of their urls is a viable option for you.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

    The start of some sanity?

Re: Recursive HTTP Downloads - without using WGET
by zentara (Cardinal) on Jul 15, 2012 at 12:58 UTC
Re: Recursive HTTP Downloads - without using WGET
by Cody Fendant (Hermit) on Jul 15, 2012 at 11:58 UTC
    curl has a recursive-downloading syntax, not based on spidering but on patterns you specify: http://www.numericals.com/file[1-100:10].txt is a valid curl URL which will download files 1 to 100, incrementing by 10. See http://curl.haxx.se/docs/manpage.html.
Re: Recursive HTTP Downloads - without using WGET
by Anonymous Monk on Jul 15, 2012 at 15:58 UTC

    May I recommend Higher Order Perl by Mark Jason Dominus (it is freely available, but also purchasable). Chapter 4.7 "An extended example: Web spiders"

    The code he presents uses LWP::Simple and lacks cookie support, but I don't think it would be difficult to modify it to fit your needs.