Recursive HTTP Downloads - without using WGET

Preceptor has asked for the wisdom of the Perl Monks concerning the following question:

I've been using wget to download a site - one of the things that's bothering me about it, is that it's 'reject' option _first_ downloads a page, and then deletes it.

With this particular site, pretty much every page is reproduced multiple times - with CGI urls (e.g. page.html?thread=2012 or post.html?mode=reply).

I'm sure that I do not want all these, but wget is quite insistent that I must download them and then delete them.

So I was wondering - could anyone point me in the right direction for solving this problem - recursive http download, using cookies, and giving me capability to filter URLs before download. (Or coalesce the duplicates). And ideally, allow me to 'skip' anything that I've already got downloaded, by allowing comparing with an on disk structure. (rsync style)

I'd be thinking something along the lines of File::Find, but for HTTP. Before I charge off re-inventing the wheel, does such a beast exist?

Comment on Recursive HTTP Downloads - without using WGET

Replies are listed 'Best First'.
Re: Recursive HTTP Downloads - without using WGET by MidLifeXis (Monsignor) on Jul 15, 2012 at 11:48 UTC
If you are rejecting the URL based on the URL itself, then LWP::UserAgent may help you create something like what you describe. On the other hand, if you are rejecting the URL based on the contents of it, then the WGET method as you describe is how it would need to be done anyway. --MidLifeXis	[reply]
Re^2: Recursive HTTP Downloads - without using WGET by Preceptor (Deacon) on Jul 16, 2012 at 07:36 UTC
Mostly, I can discard based on URL - I've got a lot of links to pages with CGI variables set, which I know are duplicates or otherwise irrelevant. So I'll have a look at doing something with LWP::UserAgent, and be careful and hope the recursion gremlins don't get me :)	[reply]
Re: Recursive HTTP Downloads - without using WGET by BrowserUk (Patriarch) on Jul 15, 2012 at 12:27 UTC
You could try using HEADs first and comparing the meta data provided by the headers: size; expiry date etc. Perhaps coalescing all the non-variable headers into a single string and then md5 hashing it. But, on most servers, it costs almost the same in terms of resources to respond to a HEAD request as it does to a GET -- indeed, often the only difference is that the body of the request is discarded after its size has been measured. And for those you wish to keep, you then have to do the GET anyway and so the net cost is greater. That's why WGET does it the way it does. Other than for large binary downloads -- images, video music etc. -- the net cost of doing heads rather than a mix of HEADs abd GETs is less with the former. As you cannot measure the size of the content, or checksum it, until you have downloaded it; there is little better you can do unless you are entirely comfortable that rejecting links on the basis of their urls is a viable option for you. With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. The start of some sanity?	[reply]
Re: Recursive HTTP Downloads - without using WGET by zentara (Cardinal) on Jul 15, 2012 at 12:58 UTC
You can build a web crawler with WWW::Mechanize. See perl web crawler which you can modify for your needs, or Google for WWW::Mechanize web crawler for more. I'm not really a human, but I play one on earth. Old Perl Programmer Haiku ................... flash japh	[reply]
Re: Recursive HTTP Downloads - without using WGET by Cody Fendant (Hermit) on Jul 15, 2012 at 11:58 UTC
curl has a recursive-downloading syntax, not based on spidering but on patterns you specify: `http://www.numericals.com/file[1-100:10].txt` is a valid curl URL which will download files 1 to 100, incrementing by 10. See http://curl.haxx.se/docs/manpage.html.	[reply] [d/l]
Re: Recursive HTTP Downloads - without using WGET by Anonymous Monk on Jul 15, 2012 at 15:58 UTC
May I recommend Higher Order Perl by Mark Jason Dominus (it is freely available, but also purchasable). Chapter 4.7 "An extended example: Web spiders" The code he presents uses LWP::Simple and lacks cookie support, but I don't think it would be difficult to modify it to fit your needs.	[reply]