Sometimes, it is only one part (let's say part 3 of 16) that keeps running whereas the remaining 15 CPUs are idle

My numbers were only by way of example.

Let's say your chunk 3 takes an hour whereas the other 15 chunks take 5 minutes. Your overall processing time is 1 hour, with 15*55 = 13.75 hours of wasted, idle processor by the time you complete.

Now let's say that you split the 16 chunks into 16 bits for a total of 256 bits; and assume that the cost of processing those smaller bits is 1/16th of the time of the larger chunks.

You now have 240 bits that take 0.3125 minutes each; and 16 bits that take 3.75 minutes each.

  1. The 16 bits of chunk 1 are processed in parallel in 0.3125 minutes.
  2. The 16 bits of chunk 2 are processed in parallel in 0.3125 minutes.
  3. The 16 bits of chunk 3 are processed in parallel in 3.75 minutes.
  4. The 16 bits of chunks 4 through 16 are processed in parallel in 0.3125 minutes each.

Total processing time is 15*0.3125 + 1*3.75 = 8.4375 minutes; with 0 wasted cpu. You've saved 85% of your runtime, and fully utilised the cluster.

It won't split as nice and evenly as that; but the smaller the bits you process; the more evenly the processing will divided between the processors; no matter how the slow bits are (randomly) distributed through the file.

The reason I don't want to split the files further is because I have to deal with enough files already, and I would rather avoid the confusion.

More seriously, the simple way to avoid the too-many-files syndrome, is don't create many files. Instead of having each process write a separate output file, have them all write to a single output file.

I know, I know. you're thinking that the output file with get all mixed up and it involves a mess of locking to prevent corruption; but it doesn't have to be that way!

Instead of writing variable length output lines (records) sequentially, you write fixed length records using random access. (Bear with me!)

Let's say you normally write a variable length output line, of between 30 and 120 characters, for each input record. Instead, you round that up (and pad with spaces) to the largest possible output record (say 128 bytes), and then seek to position input_record_number * 128 and write this fixed-length, space-padded record.

Because each input record is only ever read by one process, each output position will only ever be written by one process, thus you don't need locking. And as each output record has a known (calculated) fixed position in the file, it doesn't matter what order, or which processor writes them.

Once the input file is processed and the output file completely written, you run a simple, line oriented filter program on the output file that reads each line, trims off any trailing space-padding and writes the result to a new output file. You then delete the padded one. You end up with your required variable length record output file all in the correct order.

This final processing is far more efficient than merging many small output files together, and the whole too-many-small-files problem simply disappears.

Your call.


With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked

In reply to Re^3: Wait for individual sub processes by BrowserUk
in thread Wait for individual sub processes [SOLVED] by crackerjack.tej

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.