My idea is to split the input file into 16 parts, and run 16 instances of the text processing tool, and once all of them are done, I parse the output and merge it into a single file.
there is a small problem. Due to the nature of the text processing tool, some parts get completed much before others, in no specific order. The difference is in hours, which means that many cores of the CPU are idle for a long time, just waiting for few parts to finish. I want to keep checking which part (or corresponding process) has exited successfully, so that I can start the processing of the same part of the next input file.

The problem with that is Sod's Law guarantees that it will always be the first, middle and last chunks of the files that take the longest, so you'll still end up with 13 cpus standing idle while those 3 run on for hours trying to catch up. (Or some other twist of the numbers.)

My suggestion to you is to split your files into 256 chunks (or more depending on their size) and feed whichever processor finishes first the next chunk. This will have the affect of distributing the processing far more evenly amongst cores -- some cores may process many more than their 16 chunks, whilst others many less -- with the overall effect of minimising overall time for the total file.

Within reason -- ie. the startup/teardown/merge costs -- the more, smaller chunks you divide the processing, the more evenly distributed the processing will be.

Ideally, if this was a shared-memory cluster, I'd use a shared memory queue feeding 16 persistent threads processing one minimal processing unit (ie. 1 line or multi-line record) at a time; and have another thread gathering and reassembling the output as it is produced and writing it back to a single output file.

That eliminates both process startup & teardown time; and ensures the absolute best possible fairness of the workload distribution across the processors. arranging the merge-back of the output on-the-fly takes some thought, but is doable. I'll go into detail if this approach is feasible for you and interests you.


With the rise and rise of 'Social' network sites: 'Computers are making people easier to use everyday'
Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority". I'm with torvalds on this
In the absence of evidence, opinion is indistinguishable from prejudice. Agile (and TDD) debunked

In reply to Re: Wait for individual sub processes by BrowserUk
in thread Wait for individual sub processes [SOLVED] by crackerjack.tej

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.