The idea of directly decomposing the file and simultaneously downloading component parts is an idea I had not thought of. Actually, I had kind of thought of it but had hoped that such code might already exist. If someone has done it already, why recreate this wheel…

I'm sure the code is out there somewhere, but its not that hard to implement. I have code that worked for my purposes a couple of years since, but I am reluctant to pass it along because Iknow that it worked for one server, but froze when I tried it on a couple of others -- for reasons I never bothered to investigate. (If you /msg me a url for one of your files I can dig the code out and try it.)

You might also look at "download managers" eg. uGet or similar. Several of those available for Windows will do multiple concurrent streams. I don't use them, but they might be worth investigating/comparing to wget for the servers you use.

Uncompression: In the course of research, the RAID-archived .gz files are not infrequently uncompressed into a working area (usually as a 49-file batch) for further interrogation. If several minutes can be saved (some how) in uncompressing said 49 files (and this might need to be done for a 30 day period: 30 x 49 ~ 1500 files), a few seconds for each file might really add up.

The most time efficient (and cost effective) way to speed up your compression/decompression, would be to skip it entirely!

30 * 49 * 300GB = 441TB. Whilst that is a substantial amount of disk; you're only likely to cut it in half using per-file compression.

However, if your raid drives do block-level de-duplication, you're likely to save more space by allowing those algorithms to operate upon (and across the whole set of) uncompressed files; than you will by doing individual file compression and asking them to dedup those, because the compression on the individual files tends to mask the cross-file commonalities.

This is especially true for generational datasets (which it sounds like your may be). That is to say, where the second file produced represents the same set of data points as the first, but having moved on through some elapsed process. Eg. weather over time. This type of dataset often has considerable commonalities between successive generations, which block-level deduping can exploit to achieve seemingly miraculous space saving.

Worth a look at least.

If that is not possible, or proves to be less effective, then the next thing I would look at is doing the compression on-the-fly, on output; and the decompression on-the-fly on input; rather than writing uncompressed to disk and the reading it back/processing/writing it again.

The best way to go about that will depend upon whether the writing processes are perl/C/other languages. If they are Perl, then it may be as simple as adding an appropriate IO-layer when you open the output file and the compression will be otherwise transparent to the writing processes, thus requiring minimal changes. But there is little point in speculating further without you give more information.

I strongly recommend you investigate the actual on-disk space requirements of storing your files uncompressed -- assuming that your hardware/software supports on-the-fly dedupe, which most NAS/SAS etc. produced in the last 5 or so years does.

I've spent a good part of my career optimising processes and the single, most effective optimisation you can ever do: is to avoid doing things that aren't absolutely necessary. In your case, compressing to later have to decompress, is not necessary! It is a (cost) optimisation that itself can be be optimised, by not doing it.


With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.

In reply to Re^3: Getting/handling big files w/ perl by BrowserUk
in thread Getting/handling big files w/ perl by Gisel

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.