With 4 cpus, there ought to be some benefit available from parallelising the process, but I doubt you would see any benefit from using threads rather than processes for this. Threads only really come into their own if there is a need to share data. If your reformatter could accept its input from a ram buffer, threads might make sense, but when the interchange medium has to be disk, processes will serve you better.

You say that the process seems to be cpu-bound "whilst decompressing", just 1 of the cpus?

You indicate that there are 600 files and a total uncompressed size of 120GB. That implies a filesize of around 200 MB? If this has to run on a single disc, I think I would sacrific 0.5 - 1.0 GB of my ram to a RAM drive. I would then use

  1. One process to unzip the files onto the RAM drive.
  2. The second process to do the re-formatting
  3. A third process to zip the re-formatted back to the harddrive.

By using a RAM disc to store the intermediate files, you should reduce the competition for the one drive.

If the re-formatting process is slow, then a second process performing that function might help, but thats a suck-it-and-see test.

By splitting the overall task into 3, you stand the best chance of overlapping the cpu intensive parts with the IO-bounds parts. Have each of the process controlled by watching the RAM drive.

  1. The first process would decompress two files onto the ramdrive and then wait for one to disappear before it started on a the third.
  2. The second (and maybe third) process(es) would wait for a decompressed file to appear and then run the formatting process on it, deleteing the input file once it is done.
  3. The last process, waits for the reformatted file to appear and zips it back to the harddrive.

This means that you have 3 files on the ram drive at a time. Two waiting to be re-formatted, one waiting to be zipped. It also mean that each stage is event driven and self-limiting, giving the best chance of extracting the maximum throughput.

Just throwing lots threads or processes at it, especially if those processes are all doing the complete task, is unlikely to benefit you as you would have no way of controlling them in any meaningful way. The chances are that each of your threads would end up hitting the disk at the same time slowing the io-bound parts, and more than 1 cpu-intensive processes/ threads per cpu will slow things down with context switching.

This kind of assumes that the box will be dedicated to this task whilst it is running. It also makes a lot of (hopefully not to wild) assumptions about your set up and processesing.


Examine what is said, not who speaks.
"Efficiency is intelligent laziness." -David Dunham
"When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller
If I understand your problem, I can solve it! Of course, the same can be said for you.


In reply to Re: Best Practices for Uncompressing/Recompressing Files? by BrowserUk
in thread Best Practices for Uncompressing/Recompressing Files? by biosysadmin

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.