Preceptor has asked for the wisdom of the Perl Monks concerning the following question:
With this particular site, pretty much every page is reproduced multiple times - with CGI urls (e.g. page.html?thread=2012 or post.html?mode=reply).
I'm sure that I do not want all these, but wget is quite insistent that I must download them and then delete them.
So I was wondering - could anyone point me in the right direction for solving this problem - recursive http download, using cookies, and giving me capability to filter URLs before download. (Or coalesce the duplicates). And ideally, allow me to 'skip' anything that I've already got downloaded, by allowing comparing with an on disk structure. (rsync style)
I'd be thinking something along the lines of File::Find, but for HTTP. Before I charge off re-inventing the wheel, does such a beast exist?
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Recursive HTTP Downloads - without using WGET
by MidLifeXis (Monsignor) on Jul 15, 2012 at 11:48 UTC | |
by Preceptor (Deacon) on Jul 16, 2012 at 07:36 UTC | |
|
Re: Recursive HTTP Downloads - without using WGET
by BrowserUk (Patriarch) on Jul 15, 2012 at 12:27 UTC | |
|
Re: Recursive HTTP Downloads - without using WGET
by zentara (Cardinal) on Jul 15, 2012 at 12:58 UTC | |
|
Re: Recursive HTTP Downloads - without using WGET
by Cody Fendant (Hermit) on Jul 15, 2012 at 11:58 UTC | |
|
Re: Recursive HTTP Downloads - without using WGET
by Anonymous Monk on Jul 15, 2012 at 15:58 UTC |