Forking certainly can make processing a large number of large files go much faster. We have a system that does exactly that and forking allows things to run many times faster. But then, our processing of files is mostly CPU-bound as we are transcoding the files and this runs on a system with 32 cores exactly because of this.

We just finished benchmarks on a revamp of this and it is about 4x faster than it used to be (despite it previously forking more workers than there are CPU cores). The old process worked pretty much exactly like Parallel::ForkManager. The new strategy pre-forks the same number of workers and just continuously feeds filenames to them over a simple pipe.

There are several advantages to the new approach. The children are forked off of the process before it has built up the list of files to be processed, which will often be a huge list, so there are much fewer copy-on-write pages to eventually be copied. The children live a very long time now, so there is less overhead from fork()ing (once per worker instead of once per file). The above two features also mean that it makes sense for the children to be the ones to talk to the database, which is probably the biggest "win". It also significantly simplified the code.

If your processing of files is mostly I/O bound, then doing a bunch of them in parallel could actually be slower than doing them in serial. Though, I would expect that your processing of one file isn't perfectly I/O bound and having at least two running will provide some speed-up as often one can use the CPU while the other is waiting for I/O.

Once you have enough processes that you are maxing out the available throughput of either your CPU cores or your I/O subsystem, then adding more processes will just add overhead.

- tye        


In reply to Re: To Fork or Not to Fork (bottle necks) by tye
in thread To Fork or Not to Fork. Tis a question for PerlMonks by pimperator

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.