in reply to Re: Getting/handling big files w/ perl
in thread Getting/handling big files w/ perl

BrowserUK writes: (1) Do you have the bandwidth at your machine to carry 2 or more parallel download streams? (2) Does the server have the capacity to serve 2 or more parallel download streams? (3) Will requesting parallel download streams break the serving sites terms and conditions? (Or just upset them?)

Theoretically yes on (1) and (2) with the caveat that I am in Alaska, and even though I am on a university trunk, realizable (vs. theoretical) band width can at times be sub-optimal. Depends on time of day, day of month, tide height…

For (3), the source site is a “.gov” site and has different rules for different parties, so I am not sure where I stand. Personal contacts have advised “try it and see what happens and who (if anyone) squawks, and work it from there”. So, I am ready to give it a go. For example, several simultaneous “curls” seem to work OK.

”you might be able to decrease the download time by concurrently requesting two or more partial downloads using the range…” The idea of directly decomposing the file and simultaneously downloading component parts is an idea I had not thought of. Actually, I had kind of thought of it but had hoped that such code might already exist. If someone has done it already, why recreate this wheel…

Which is it? Compress them or uncompress them? Or both (in that order or the reverse)? And are they anything to do with the 0.5GB file? On the surface is seems likely there is potential to overlap at least some of this work; but you'll need to make it a lot clearer what you are actually doing.

Both, actually‐ though not sequentially. The big (0.5 Gb) file is used to initialize a numerical weather prediction (NWP) model. (Actually, I need several of these.) The output from the model are hourly forecast states (~300 Mb each, in NetCDF format), 1 for each hour of the 48 h forecast period, plus one for an initial state, hence 49 files. Each file gets “post-processed” upon output (one about every 7 minutes). Ultimately, the output files get gzipped (either as a collection at the end of the run or one at a time immediately after processing.)

The files are then finally moved out of the working directory and written to RAID. Rinse. Wash. Repeat. 4 to 6 times each day, every day. So keeping the working directory clean of uncompressed files is essential and the most likely place for failure of the whole process.

Uncompression: In the course of research, the RAID-archived .gz files are not infrequently uncompressed into a working area (usually as a 49-file batch) for further interrogation. If several minutes can be saved (some how) in uncompressing said 49 files (and this might need to be done for a 30 day period: 30 x 49 ~ 1500 files), a few seconds for each file might really add up.

I doubt that sys calls to "gzip" and "gunzip" are optimal for this. There has to be a big IO buffering price to pay here.

We are all vivified… only to be ultimately garbage-collected.

  • Comment on Re^2: Getting/handling big files w/ perl

Replies are listed 'Best First'.
Re^3: Getting/handling big files w/ perl
by BrowserUk (Patriarch) on Nov 17, 2014 at 08:10 UTC
    The idea of directly decomposing the file and simultaneously downloading component parts is an idea I had not thought of. Actually, I had kind of thought of it but had hoped that such code might already exist. If someone has done it already, why recreate this wheel…

    I'm sure the code is out there somewhere, but its not that hard to implement. I have code that worked for my purposes a couple of years since, but I am reluctant to pass it along because Iknow that it worked for one server, but froze when I tried it on a couple of others -- for reasons I never bothered to investigate. (If you /msg me a url for one of your files I can dig the code out and try it.)

    You might also look at "download managers" eg. uGet or similar. Several of those available for Windows will do multiple concurrent streams. I don't use them, but they might be worth investigating/comparing to wget for the servers you use.

    Uncompression: In the course of research, the RAID-archived .gz files are not infrequently uncompressed into a working area (usually as a 49-file batch) for further interrogation. If several minutes can be saved (some how) in uncompressing said 49 files (and this might need to be done for a 30 day period: 30 x 49 ~ 1500 files), a few seconds for each file might really add up.

    The most time efficient (and cost effective) way to speed up your compression/decompression, would be to skip it entirely!

    30 * 49 * 300GB = 441TB. Whilst that is a substantial amount of disk; you're only likely to cut it in half using per-file compression.

    However, if your raid drives do block-level de-duplication, you're likely to save more space by allowing those algorithms to operate upon (and across the whole set of) uncompressed files; than you will by doing individual file compression and asking them to dedup those, because the compression on the individual files tends to mask the cross-file commonalities.

    This is especially true for generational datasets (which it sounds like your may be). That is to say, where the second file produced represents the same set of data points as the first, but having moved on through some elapsed process. Eg. weather over time. This type of dataset often has considerable commonalities between successive generations, which block-level deduping can exploit to achieve seemingly miraculous space saving.

    Worth a look at least.

    If that is not possible, or proves to be less effective, then the next thing I would look at is doing the compression on-the-fly, on output; and the decompression on-the-fly on input; rather than writing uncompressed to disk and the reading it back/processing/writing it again.

    The best way to go about that will depend upon whether the writing processes are perl/C/other languages. If they are Perl, then it may be as simple as adding an appropriate IO-layer when you open the output file and the compression will be otherwise transparent to the writing processes, thus requiring minimal changes. But there is little point in speculating further without you give more information.

    I strongly recommend you investigate the actual on-disk space requirements of storing your files uncompressed -- assuming that your hardware/software supports on-the-fly dedupe, which most NAS/SAS etc. produced in the last 5 or so years does.

    I've spent a good part of my career optimising processes and the single, most effective optimisation you can ever do: is to avoid doing things that aren't absolutely necessary. In your case, compressing to later have to decompress, is not necessary! It is a (cost) optimisation that itself can be be optimised, by not doing it.


    With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re^3: Getting/handling big files w/ perl
by roboticus (Chancellor) on Nov 17, 2014 at 12:28 UTC

    Gisel:

    In addition to BrowserUk's tips, one I found useful was to filter the data if you don't need it all. I had to deal with processing a horrendous amount of credit card transaction information in the past, and filtering out the data I didn't need allowed me to save quite a bit of storage space[*]. So if the resulting files have a large amount of data in them you won't ever use, you may find it worth while to filter the data before storing it.

    You mention that the input files are in NetCDF format, so I did a quick surf to Wikipedia's NetCDF article, and see that there are some unix command-line tools for file surgery already available. So if you know the items you need from the files, you may be able to chop out a good bit of data from them and avoid compression altogether. If you're storing the files locally, you can probably avoid the time cost of filtering the data by using your filtering operation as the operation you use to copy to long-term storage (saving some network traffic to your SAN in the bargain).

    *: My original purpose wasn't to save the disk space, but to use a single file format for my process. The incoming data was in multiple very different format types. (About 15 different file formats, IIRC.) The processor needed the files sorted and in a different format. The resulting space savings (Substantial!) was just a product of the input file format.

    Update: Fixed acronym... (I wonder what IIRS might mean? D'oh!)

    ...roboticus

    When your only tool is a hammer, all problems look like your thumb.