Re^3: Parallel downloading under Win32?

Replies are listed 'Best First'.
Re^4: Parallel downloading under Win32? by Xenofur (Monk) on Apr 29, 2009 at 21:36 UTC
Oh, i had no doubt that it worked. I had trouble understanding how it worked. I write my perl in a very declarative and verbose manner, have never had reason to use map before, didn't know you could string commands together with commas to act on $_ without wrapping it in braces and didn't know why you were pushing undefs into the array. In short: The syntax and lack of any explanation completely stumped me. Either way, i have to admit that it is a superior solution to the wget method, as long as enough ram is available. Getting it to run enough threads to run at comparable speed to the wget method required 300 mb. However, due to the fact that it actually is possible to keep control of the ram use and that it runs entirely with Perl modules it is the better solution. As such, thanks a lot. :) FWIF, this is how i'm using it now: Read more... (965 Bytes)	[reply] [d/l]
Re^5: Parallel downloading under Win32? by BrowserUk (Patriarch) on Apr 30, 2009 at 02:17 UTC
Getting it to run enough threads to run at comparable speed to the wget method required 300 mb. How many wget instances were you running? I'd be really surprised if it is necessary to run 20 threads in order to saturate your bandwidth. Unless the server you are connecting to is severally restricting the throughput of individual connections. And when that happens--for example if the site is using thttpd or similar--unless the webmaster is very naive, they ensure that the throttling rates apply across all concurrent connections from any given ip. Running 2 or 3 connections concurrently usually serves to maximise throughput. Beyond that, thread thrash tends to deteriorate throughput rather than increase it. Threads newbies tend to think: 'more is better', but the reality is, That is rarely the case. Especially with tcp connections. TCP has been tuned over decades to utilise as much bandwidth as is available for each connection. Whilst using two concurrent connnections will usually allow the second to 'mop up' any bandwidth under-utilised by the first, unless you have more than one processor/core, a third thread will usually impact the performance of the first two through thread thrash. (Assuming unrestrictied and infinite bandwidth from the server.) As a rule of thumb, I would suggest that you set `$T` (or `$thread_count` as you would have it :), to no more than 2 * NoOfCores (sorry `$no_of_cores` :). Caveat: From the code you posted, you are pushing your entire url list onto the queue, prior to staring your threads. If your url list is relatively small--say < ne3--no harm done. But...if your url list is bigger than that, the I would highly recommend starting your threads first and including a call to `yield()` in your url enqueue loop. Caveat 2: If your are seriously seeking to minimise memory usage, then you should consider starting your threads pool prior to loading (useing) the vast majority of whatever code or modules are needed by the main body of your application. The reason for this advice, is that for good or bad, the original author(s) of threads decided that each spawned thread would inherit everything already loaded by the main thread at the point of thread creation(). (eg. he/they decided to emulate the fork way of working!) By starting your worker threads early--remember that use is a compile-time enacted opcode--you can minimise the size of the primary thread and therefore, the size of every subsequently spawed thread. () Yes. I know it is dumb, but you try convincing those that have the power to change things of that! Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."	[reply] [d/l] [select]
Re^6: Parallel downloading under Win32? by Xenofur (Monk) on Apr 30, 2009 at 09:30 UTC
I've been running 40 instances of wget at a time, with this used to monitor network activity: http://www.hageltech.com/dumeter/ This is opposed to 20 threads with your solution. If you want to try it out for yourself, i'm loading from this url: http://api.eve-central.com/api/quicklook?typeid=24312 , with the parameter cycling through these indexes: <Reveal this spoiler or all in this thread> Regarding the preloading of URLs: The maximum amount of urls i'll need to load is ~10000. From what i can tell the overhead of pre-loading is neglible in contrast to the actual downloading itself. Plus, as it is it makes reading the code easier for me. :) Memory use itself is not THAT much of an issue. I'm fine with taking up half a GB, what i was not fine with were other solutions that would quickly balloon to 1.5 GB. I know that the best way to handle threads is to create them at the start of the app in a begin block, but that isn't really an option here, as it's a CGI::App web application and there isn't really a way to know whether it'll actually do the downloading without actually loading the CGI::App stuff as well. Thanks for the information and advice in either case, i'll keep them in mind. :)	[reply]
Re^7: Parallel downloading under Win32? by BrowserUk (Patriarch) on Apr 30, 2009 at 09:44 UTC
Re^8: Parallel downloading under Win32? by Xenofur (Monk) on Apr 30, 2009 at 12:06 UTC
Some notes below your chosen depth have not been shown here