Gisel has asked for the wisdom of the Perl Monks concerning the following question:

I am currently rewriting/moosing a very old perl script (did I really write this horrid code?) that glues together a numerical weather prediction system. (BTW, perl rocks for this application!)

One of tasks here is to use ”wget” to download a 0.5 Gb file. Another is to compress/uncompress 49 files, each of which is on the order of 300Mb. This is currently implemented using syscalls to wget/gzip/gunzip. The forecast model (FORTRAN,C,C++) itself is run as multiple parallel processes on several machines using MPI. The file handling however is NOT parallelized— a single machine is responsible for this task.

This was all conceived and constructed in an era (2004) when hardware was much less muscular. These days, my master node is an 8-core 64-Gb MacPro w/ 2 Tb of SSD. During the file getting/manipulation phases of the master process, this is all the machine is doing. I suspect that some latent compute capability could be used to enhance/speed-up the file manipulation process.

Speed is everything for this application, and a few minutes saved is worth a lot. Should I manipulate files within perl (perhaps avoiding things like unnecessary IO buffering) rather than do the sys calls? (Obviously network speed remains a wild card here.)

I have researched this a bit and already have some (possibly erroneous) thoughts, but thought I would toss the global concept out there to my perlish betters. This may save me some spurious bunny trails. Not that I don't like bunnies…

The difficulty lies, not in thinking the new ideas, but in escaping from the old ones.

Replies are listed 'Best First'.
Re: Getting/handling big files w/ perl
by BrowserUk (Patriarch) on Nov 16, 2014 at 05:34 UTC
    download a 0.5 Gb file.

    There are three limiting factors to the potential to speed up the download:

    1. Do you have the bandwidth at your machine to carry 2 or more parallel download streams?
    2. Does the server have the capacity to serve 2 or more parallel download streams?
    3. Will requesting parallel download streams break the serving sites terms and conditions? (Or just upset them?)

    If you can answer yes to the first two and (honestly) no to the latter, then you might be able to decrease the download time by concurrently requesting two or more partial downloads using the range: bytes=<start>-<end> header.

    This (i fuzzily recall) can be achieved using LWP, though I found it both easier and faster to use IO::Socket and raw http.

    Another is to compress/uncompress 49 files, each of which is on the order of 300Mb.

    Which is it? Compress them or uncompress them? Or both (in that order or the reverse)? And are they anything to do with the 0.5GB file?

    On the surface is seems likely there is potential to overlap at least some of this work; but you'll need to make it a lot clearer what you are actually doing.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      BrowserUK writes: (1) Do you have the bandwidth at your machine to carry 2 or more parallel download streams? (2) Does the server have the capacity to serve 2 or more parallel download streams? (3) Will requesting parallel download streams break the serving sites terms and conditions? (Or just upset them?)

      Theoretically yes on (1) and (2) with the caveat that I am in Alaska, and even though I am on a university trunk, realizable (vs. theoretical) band width can at times be sub-optimal. Depends on time of day, day of month, tide height…

      For (3), the source site is a “.gov” site and has different rules for different parties, so I am not sure where I stand. Personal contacts have advised “try it and see what happens and who (if anyone) squawks, and work it from there”. So, I am ready to give it a go. For example, several simultaneous “curls” seem to work OK.

      ”you might be able to decrease the download time by concurrently requesting two or more partial downloads using the range…” The idea of directly decomposing the file and simultaneously downloading component parts is an idea I had not thought of. Actually, I had kind of thought of it but had hoped that such code might already exist. If someone has done it already, why recreate this wheel…

      Which is it? Compress them or uncompress them? Or both (in that order or the reverse)? And are they anything to do with the 0.5GB file? On the surface is seems likely there is potential to overlap at least some of this work; but you'll need to make it a lot clearer what you are actually doing.

      Both, actually‐ though not sequentially. The big (0.5 Gb) file is used to initialize a numerical weather prediction (NWP) model. (Actually, I need several of these.) The output from the model are hourly forecast states (~300 Mb each, in NetCDF format), 1 for each hour of the 48 h forecast period, plus one for an initial state, hence 49 files. Each file gets “post-processed” upon output (one about every 7 minutes). Ultimately, the output files get gzipped (either as a collection at the end of the run or one at a time immediately after processing.)

      The files are then finally moved out of the working directory and written to RAID. Rinse. Wash. Repeat. 4 to 6 times each day, every day. So keeping the working directory clean of uncompressed files is essential and the most likely place for failure of the whole process.

      Uncompression: In the course of research, the RAID-archived .gz files are not infrequently uncompressed into a working area (usually as a 49-file batch) for further interrogation. If several minutes can be saved (some how) in uncompressing said 49 files (and this might need to be done for a 30 day period: 30 x 49 ~ 1500 files), a few seconds for each file might really add up.

      I doubt that sys calls to "gzip" and "gunzip" are optimal for this. There has to be a big IO buffering price to pay here.

      We are all vivified… only to be ultimately garbage-collected.

        The idea of directly decomposing the file and simultaneously downloading component parts is an idea I had not thought of. Actually, I had kind of thought of it but had hoped that such code might already exist. If someone has done it already, why recreate this wheel…

        I'm sure the code is out there somewhere, but its not that hard to implement. I have code that worked for my purposes a couple of years since, but I am reluctant to pass it along because Iknow that it worked for one server, but froze when I tried it on a couple of others -- for reasons I never bothered to investigate. (If you /msg me a url for one of your files I can dig the code out and try it.)

        You might also look at "download managers" eg. uGet or similar. Several of those available for Windows will do multiple concurrent streams. I don't use them, but they might be worth investigating/comparing to wget for the servers you use.

        Uncompression: In the course of research, the RAID-archived .gz files are not infrequently uncompressed into a working area (usually as a 49-file batch) for further interrogation. If several minutes can be saved (some how) in uncompressing said 49 files (and this might need to be done for a 30 day period: 30 x 49 ~ 1500 files), a few seconds for each file might really add up.

        The most time efficient (and cost effective) way to speed up your compression/decompression, would be to skip it entirely!

        30 * 49 * 300GB = 441TB. Whilst that is a substantial amount of disk; you're only likely to cut it in half using per-file compression.

        However, if your raid drives do block-level de-duplication, you're likely to save more space by allowing those algorithms to operate upon (and across the whole set of) uncompressed files; than you will by doing individual file compression and asking them to dedup those, because the compression on the individual files tends to mask the cross-file commonalities.

        This is especially true for generational datasets (which it sounds like your may be). That is to say, where the second file produced represents the same set of data points as the first, but having moved on through some elapsed process. Eg. weather over time. This type of dataset often has considerable commonalities between successive generations, which block-level deduping can exploit to achieve seemingly miraculous space saving.

        Worth a look at least.

        If that is not possible, or proves to be less effective, then the next thing I would look at is doing the compression on-the-fly, on output; and the decompression on-the-fly on input; rather than writing uncompressed to disk and the reading it back/processing/writing it again.

        The best way to go about that will depend upon whether the writing processes are perl/C/other languages. If they are Perl, then it may be as simple as adding an appropriate IO-layer when you open the output file and the compression will be otherwise transparent to the writing processes, thus requiring minimal changes. But there is little point in speculating further without you give more information.

        I strongly recommend you investigate the actual on-disk space requirements of storing your files uncompressed -- assuming that your hardware/software supports on-the-fly dedupe, which most NAS/SAS etc. produced in the last 5 or so years does.

        I've spent a good part of my career optimising processes and the single, most effective optimisation you can ever do: is to avoid doing things that aren't absolutely necessary. In your case, compressing to later have to decompress, is not necessary! It is a (cost) optimisation that itself can be be optimised, by not doing it.


        With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.

        Gisel:

        In addition to BrowserUk's tips, one I found useful was to filter the data if you don't need it all. I had to deal with processing a horrendous amount of credit card transaction information in the past, and filtering out the data I didn't need allowed me to save quite a bit of storage space[*]. So if the resulting files have a large amount of data in them you won't ever use, you may find it worth while to filter the data before storing it.

        You mention that the input files are in NetCDF format, so I did a quick surf to Wikipedia's NetCDF article, and see that there are some unix command-line tools for file surgery already available. So if you know the items you need from the files, you may be able to chop out a good bit of data from them and avoid compression altogether. If you're storing the files locally, you can probably avoid the time cost of filtering the data by using your filtering operation as the operation you use to copy to long-term storage (saving some network traffic to your SAN in the bargain).

        *: My original purpose wasn't to save the disk space, but to use a single file format for my process. The incoming data was in multiple very different format types. (About 15 different file formats, IIRC.) The processor needed the files sorted and in a different format. The resulting space savings (Substantial!) was just a product of the input file format.

        Update: Fixed acronym... (I wonder what IIRS might mean? D'oh!)

        ...roboticus

        When your only tool is a hammer, all problems look like your thumb.

Re: Getting/handling big files w/ perl
by fishmonger (Chaplain) on Nov 16, 2014 at 00:05 UTC
    The first step in optimizing any perl script is to profile it to find out where it's spending most its time and focus on those parts before moving on. The best module that I'm aware of to do the profiling is Devel::NYTProf.

      I am a big booster of Devel::NYTProf, and have found some surprising and counter-intuitive results from it. However, in the present case my time is currently being spend outside of perl, in system calls to other executables that I do not have control of (and can’t profile inside of). I do use Timer::Simple and Time::HiRes to keep track of time spent on system calls.

      My lack of knowledge is along the lines of: are “wget (downloaded and compiled) and /usr/bin/gzip (part of distro) blunt tools for my needs”, or are they sharp tools via pedigree and refinement. My sense is that since they are designed to address a whole class of problems— vs. a perl module (or family of modules) designed to address a specific problem. Ergo, a more efficient perish solution is probably worth the effort. Unless I hear otherwise.

      Also, given the CPAN’s numerous code trees, (seemingly) providing different approaches, a nudge in the most productive direction would be appreciated.

      Thanx
Re: Getting/handling big files w/ perl
by karlgoethebier (Abbot) on Nov 16, 2014 at 12:48 UTC

    HTTP::Range (OK, unmaintained for 10 years) as well as LWP::Parallel::UserAgent might be worth to take a look at.

    Update: I just reread the curl manpage.

    Curl has a --range option:

    host:~ # curl --range 0-79 -o out.html perlmonks.org % Total % Received % Xferd Average Speed Time Time Time + Current Dload Upload Total Spent Left + Speed 0 80 0 80 0 0 368 0 --:--:-- --:--:-- --:--: +-- 0 host:~ # wc -c out.html 80 out.html host:~ # cat out.html <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">

    See also WWW::Curl

    Regards, Karl

    «The Crux of the Biscuit is the Apostrophe»

      Karl—

      Actually I have been looking pretty hard at LWP::Parallel::UserAgent. It might be the tool I am looking for, but I am just not clear yet on how I would fit my problem into it.

      Also, this whole "information transmission" issue is not one I really have a firm footing on. The spin-up is taking me a while...

      Thanx for your input.

Re: Getting/handling big files w/ perl
by flexvault (Monsignor) on Nov 16, 2014 at 15:20 UTC

    Gisel,

    First -- Welcome!

      One of tasks here is to use ”wget” to download a 0.5 Gb file. Another is to compress/uncompress 49 files, each of which is on the order of 300Mb. This is currently implemented using syscalls to wget/gzip/gunzip.

    Now, this is a guess, but it sounds like (from your description) the compressed files you need are static and the only dynamic file is the 'wget ... 0.5 GB file'. In my experience, syscalls become academic if the file size is larger than 2MB, so continue to use the syscalls. But if the 49 files are static, then forget about the gzip/gunzip steps and leave them as raw data files. With your current equipment that should be easy!

    If all files are dynamic, then I would spend the time updating the network (if possible). GigE is inexpensive today also.

    And I agree with the earlier suggestion to use the 'Devel::NYTProf' profiler to find any *real* bottlenecks.

    Good Luck...Ed

    "Well done is better than well said." - Benjamin Franklin

Re: Getting/handling big files w/ perl
by oiskuu (Hermit) on Nov 16, 2014 at 20:07 UTC

    Batch processing, sequential tasks with dependencies? Sounds like an excellent candidate for make automation. Makefile recipes specify dependencies and the necessary build steps. Going parallel can be as easy as make -j8.

    Probably the foremost design concern is to think of your data as streams. How fast can you stream over the net and to the disk, what is the (aggregate) bandwidth of decompression. Dimension the pipes and assemble accordingly.

    There are other tools besides wgetrsync is efficient, flexible and can do on-the-fly compression. Might be applicable to your situation, but we're lacking the details.

Re: Getting/handling big files w/ perl
by Anonymous Monk on Nov 16, 2014 at 14:38 UTC
    My guess is that you should throw silicon at it. The bottleneck whatever it is is probably hardware, not software ... download speed; the fact that you are using WGET (thus HTTP encoding/decoding) ... the speed of the storage subsystem. Parallelizing the operations of the CPU, even though there is more than one core, probably will not improve the situation. Measure to prove otherwise before proceeding.
      "...throw silicon at it..."

      He throws already: "8-core 64-Gb MacPro w/ 2 Tb of SSD".

      Regards, Karl

      P.S.: I wish this nice gear for christmas ;-)

      «The Crux of the Biscuit is the Apostrophe»

      My guess is that you should throw silicon at it.

      So intuitive you are. Not!

      He has the hardware. What he's asking, is how can he make good use of it.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
Re: Getting/handling big files w/ perl
by locked_user sundialsvc4 (Abbot) on Nov 17, 2014 at 04:10 UTC

    Although the computer that you are using has apparently-beefy “specs,” a laptop-class computer does not have nearly the same data-throughput capabilities of a rack-style Macintosh server with a comparable CPU.   This machine has an impressive CPU capability, but that really isn’t what is going to most-affect the completion time of a workload of this sort.   The biggest impact, I expect, will come from the slowest component ... the network ... and, in this case, with the manner in which that network is now being used.   (For instance, is the HTTP data-stream “gzip-ped?”)   The performance characteristics of SSD’s can also surprise you.

    Beyond that, I would look at bringing additional computers into the mix ... if the local network can support it ... and consider re-defining the problem itself, if that is possible.   For example, if the 0.5GB download leads to the 49 files, could the source instead provide (say, 5...) multiple files, each of them pre-compressed, that (say, 5 ...) multiple computers could simultaneously download, decompress locally, and then move to the destination location?

    From your description, I doubt that the process could be tremendously improved as it stands:   the process is I/O-bound and the I/O capabilities of the machine are lackluster.   The process could be realistically (but, perhaps significantly) improved by re-defining it and then, as others suggested, “throwing silicon at” (the re-defined process).

      sundialsvc4:

      The Mac Pro is a rather odd-looking laptop!

      In all seriousness, the OP didn't say laptop anywhere, so you should at least check the basics before making a silly guess. The MacPro is a serious server, at the top of the apple line, not a laptop.

      ...roboticus

      When your only tool is a hammer, all problems look like your thumb.

      From your description, I doubt that the process could be tremendously improved as it stands: the process is I/O-bound and the I/O capabilities of the machine are lackluster. The process could be realistically (but, perhaps significantly) improved by re-defining it and then, as others suggested, “throwing silicon at” (the re-defined process).

      Now to debunk Yet Another of your Inglorious Theories.

      This shows a perl program downloading an 11MB file using 1,2,4,8,16 & 32 concurrent streams, on my 4 core CPU, across my relatively tardy 20Mb/s connection:

      C:\test>1107326 -T=1 http://************/goldenPath/hg19/chromosomes/c +hr21.fa.gz Fetching 11549785 bytes in 1 x 11549786 streams Received 11549785 bytes Took 55.765514851 secs C:\test>1107326 -T=2 http://************/goldenPath/hg19/chromosomes/c +hr21.fa.gz Fetching 11549785 bytes in 2 x 5774893 streams Received 11549785 bytes Took 27.654243946 secs C:\test>1107326 -T=4 http://************/goldenPath/hg19/chromosomes/c +hr21.fa.gz Fetching 11549785 bytes in 4 x 2887447 streams Received 11549785 bytes Took 15.210910082 secs C:\test>1107326 -T=8 http://************/goldenPath/hg19/chromosomes/c +hr21.fa.gz Fetching 11549785 bytes in 8 x 1443724 streams Received 11549785 bytes Took 9.515606880 secs C:\test>1107326 -T=16 http://************/goldenPath/hg19/chromosomes/ +chr21.fa.gz Fetching 11549785 bytes in 16 x 721862 streams Received 11549785 bytes Took 8.902327061 secs C:\test>1107326 -T=32 http://************/goldenPath/hg19/chromosomes/ +chr21.fa.gz Fetching 11549785 bytes in 32 x 360931 streams Received 11549785 bytes Took 16.386690140 secs

      As you can see, you get diminishing returns from the concurrency, but over provisioning the 4 core CPU to manage 16 concurrent IO-bounds threads results in the best throughput.

      And how does that compare to using WGET and a single thread on the same connection and processor:

      C:\test>wget http://************/goldenPath/hg19/chromosomes/chr21.fa +.gz --2014-11-17 11:38:37-- http://************/goldenPath/hg19/chromosom +es/chr21.fa.gz Resolving ************... 128.114.119.*** Connecting to ************|128.114.119.***|:80... connected. HTTP request sent, awaiting response... 200 OK Length: 11549785 (11M) [application/x-gzip] Saving to: `chr21.fa.gz' 100%[================================================================= +===================>] 11,549,785 379K/s in 32s 2014-11-17 11:39:30 (352 KB/s) - `chr21.fa.gz' saved [11549785/1154978 +5]

      It beats it hands down!

      ************ server name redacted to discourage the world+dog from hitting them by way of comparison.


      With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.