Re: Getting/handling big files w/ perl

Although the computer that you are using has apparently-beefy “specs,” a laptop-class computer does not have nearly the same data-throughput capabilities of a rack-style Macintosh server with a comparable CPU. This machine has an impressive CPU capability, but that really isn’t what is going to most-affect the completion time of a workload of this sort. The biggest impact, I expect, will come from the slowest component ... the network ... and, in this case, with the manner in which that network is now being used. (For instance, is the HTTP data-stream “gzip-ped?”) The performance characteristics of SSD’s can also surprise you.

Beyond that, I would look at bringing additional computers into the mix ... if the local network can support it ... and consider re-defining the problem itself, if that is possible. For example, if the 0.5GB download leads to the 49 files, could the source instead provide (say, 5...) multiple files, each of them pre-compressed, that (say, 5 ...) multiple computers could simultaneously download, decompress locally, and then move to the destination location?

From your description, I doubt that the process could be tremendously improved as it stands: the process is I/O-bound and the I/O capabilities of the machine are lackluster. The process could be realistically (but, perhaps significantly) improved by re-defining it and then, as others suggested, “throwing silicon at” (the re-defined process).

Replies are listed 'Best First'.
Re^2: Getting/handling big files w/ perl by roboticus (Chancellor) on Nov 17, 2014 at 12:38 UTC
sundialsvc4: The Mac Pro is a rather odd-looking laptop! In all seriousness, the OP didn't say laptop anywhere, so you should at least check the basics before making a silly guess. The MacPro is a serious server, at the top of the apple line, not a laptop. ...roboticus When your only tool is a hammer, all problems look like your thumb.	[reply]
Re^2: Getting/handling big files w/ perl by BrowserUk (Patriarch) on Nov 17, 2014 at 11:48 UTC
From your description, I doubt that the process could be tremendously improved as it stands: the process is I/O-bound and the I/O capabilities of the machine are lackluster. The process could be realistically (but, perhaps significantly) improved by re-defining it and then, as others suggested, “throwing silicon at” (the re-defined process). Now to debunk Yet Another of your Inglorious Theories. This shows a perl program downloading an 11MB file using 1,2,4,8,16 & 32 concurrent streams, on my 4 core CPU, across my relatively tardy 20Mb/s connection: C:\test>1107326 -T=1 http://**********/goldenPath/hg19/chromosomes/c +hr21.fa.gz Fetching 11549785 bytes in 1 x 11549786 streams Received 11549785 bytes Took 55.765514851 secs C:\test>1107326 -T=2 http://********/goldenPath/hg19/chromosomes/c +hr21.fa.gz Fetching 11549785 bytes in 2 x 5774893 streams Received 11549785 bytes Took 27.654243946 secs C:\test>1107326 -T=4 http://********/goldenPath/hg19/chromosomes/c +hr21.fa.gz Fetching 11549785 bytes in 4 x 2887447 streams Received 11549785 bytes Took 15.210910082 secs C:\test>1107326 -T=8 http://********/goldenPath/hg19/chromosomes/c +hr21.fa.gz Fetching 11549785 bytes in 8 x 1443724 streams Received 11549785 bytes Took 9.515606880 secs C:\test>1107326 -T=16 http://********/goldenPath/hg19/chromosomes/ +chr21.fa.gz Fetching 11549785 bytes in 16 x 721862 streams Received 11549785 bytes Took 8.902327061 secs C:\test>1107326 -T=32 http://********/goldenPath/hg19/chromosomes/ +chr21.fa.gz Fetching 11549785 bytes in 32 x 360931 streams Received 11549785 bytes Took 16.386690140 secs [download] As you can see, you get diminishing returns from the concurrency, but over provisioning the 4 core CPU to manage 16 concurrent IO-bounds threads results in the best throughput. And how does that compare to using WGET and a single thread on the same connection and processor: C:\test>wget http://********/goldenPath/hg19/chromosomes/chr21.fa +.gz --2014-11-17 11:38:37-- http://********/goldenPath/hg19/chromosom +es/chr21.fa.gz Resolving ********... 128.114.119.* Connecting to **********\|128.114.119.\|:80... connected. HTTP request sent, awaiting response... 200 OK Length: 11549785 (11M) [application/x-gzip] Saving to: `chr21.fa.gz' 100%[================================================================= +===================>] 11,549,785 379K/s in 32s 2014-11-17 11:39:30 (352 KB/s) - `chr21.fa.gz' saved [11549785/1154978 +5] [download] It beats it hands down!* ************ server name redacted to discourage the world+dog from hitting them by way of comparison. With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday' Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply] [d/l] [select]