in reply to Re^2: Parallel downloading under Win32?
in thread Parallel downloading under Win32?

Seriously, it looks like you wrote that with the intent to make it as unreadable as possible.

Is that a request for clarification?

Suggestion: Run it standalone as posted first, to convince yourself that it actually works on your system.


Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
"Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."
  • Comment on Re^3: Parallel downloading under Win32?

Replies are listed 'Best First'.
Re^4: Parallel downloading under Win32?
by Xenofur (Monk) on Apr 29, 2009 at 21:36 UTC
    Oh, i had no doubt that it worked. I had trouble understanding *how* it worked. I write my perl in a very declarative and verbose manner, have never had reason to use map before, didn't know you could string commands together with commas to act on $_ without wrapping it in braces and didn't know why you were pushing undefs into the array.

    In short: The syntax and lack of any explanation completely stumped me.

    Either way, i have to admit that it is a superior solution to the wget method, as long as enough ram is available. Getting it to run enough threads to run at comparable speed to the wget method required 300 mb. However, due to the fact that it actually is possible to keep control of the ram use and that it runs entirely with Perl modules it is the better solution.

    As such, thanks a lot. :)

    FWIF, this is how i'm using it now:
      Getting it to run enough threads to run at comparable speed to the wget method required 300 mb.

      How many wget instances were you running?

      I'd be really surprised if it is necessary to run 20 threads in order to saturate your bandwidth. Unless the server you are connecting to is severally restricting the throughput of individual connections. And when that happens--for example if the site is using thttpd or similar--unless the webmaster is very naive, they ensure that the throttling rates apply across all concurrent connections from any given ip.

      Running 2 or 3 connections concurrently usually serves to maximise throughput. Beyond that, thread thrash tends to deteriorate throughput rather than increase it. Threads newbies tend to think: 'more is better', but the reality is, That is rarely the case.

      Especially with tcp connections. TCP has been tuned over decades to utilise as much bandwidth as is available for each connection. Whilst using two concurrent connnections will usually allow the second to 'mop up' any bandwidth under-utilised by the first, unless you have more than one processor/core, a third thread will usually impact the performance of the first two through thread thrash. (Assuming unrestrictied and infinite bandwidth from the server.)

      As a rule of thumb, I would suggest that you set $T (or $thread_count as you would have it :), to no more than 2 * NoOfCores (sorry $no_of_cores :).


      Caveat: From the code you posted, you are pushing your entire url list onto the queue, prior to staring your threads. If your url list is relatively small--say < ne3--no harm done. But...if your url list is bigger than that, the I would highly recommend starting your threads first and including a call to yield() in your url enqueue loop.

      Caveat 2: If your are seriously seeking to minimise memory usage, then you should consider starting your threads pool prior to loading (useing) the vast majority of whatever code or modules are needed by the main body of your application.

      The reason for this advice, is that for good or bad, the original author(s) of threads decided that each spawned thread would inherit everything already loaded by the main thread at the point of thread creation(*). (eg. he/they decided to emulate the fork way of working!) By starting your worker threads early--remember that use is a compile-time enacted opcode--you can minimise the size of the primary thread and therefore, the size of every subsequently spawed thread.

      (*) Yes. I know it is dumb, but you try convincing those that have the power to change things of that!


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
        I've been running 40 instances of wget at a time, with this used to monitor network activity: http://www.hageltech.com/dumeter/ This is opposed to 20 threads with your solution.

        If you want to try it out for yourself, i'm loading from this url: http://api.eve-central.com/api/quicklook?typeid=24312 , with the parameter cycling through these indexes:
        Regarding the preloading of URLs: The maximum amount of urls i'll need to load is ~10000. From what i can tell the overhead of pre-loading is neglible in contrast to the actual downloading itself. Plus, as it is it makes reading the code easier for me. :)

        Memory use itself is not THAT much of an issue. I'm fine with taking up half a GB, what i was not fine with were other solutions that would quickly balloon to 1.5 GB. I know that the best way to handle threads is to create them at the start of the app in a begin block, but that isn't really an option here, as it's a CGI::App web application and there isn't really a way to know whether it'll actually do the downloading without actually loading the CGI::App stuff as well.

        Thanks for the information and advice in either case, i'll keep them in mind. :)