I'd be surprised if a straightforward rewrite bought you major performance gains. Let's sanity check. You end up with 20,000 files. If you have to start 5 processes per file, then you're launching 100,000 processes. If it takes 0.001 seconds to launch a process, that is 100 seconds of improvement from removing that overhead. Even if it takes an absurd 0.01 seconds per process, that is only 15% of your total time. Probably not worth it. Update: But you should still run a benchmark to see what the overhead really is for you, it may be much larger than I'm estimating..

However your tasks all look heavily I/O bound. I/O tends to lend itself well to parallelization. It would take a lot more work, but if you made good use of something like Parallel::ForkManager to parallelize the work, you could get big wins. Suppose that you found that you could run 4 processes at once without them interfering with each other. If you rewrote the whole thing to take advantage of that, then your 2 hour job drops to 30 minutes!

You'll have to benchmark to find where you hit the point of diminishing returns from parallelizing, but I'd consider only being able to benefit from 4 processes at once to be a disappointing gain. But before you start having visions of being able to run 8 or 16 processes at once, note that you undoubtably spend at least a little bit of time doing non-parallelizable work. Time spent with, for instance, a remote connection saturated on bandwidth is not going to go away when you parallelize.

So it will take more work than you were planning on, but a rewrite should be able to achieve significant performance gains. But only if you look for the performance gains in a different place than you were looking.

UPDATE graff's benchmark at Re^3: BASH vs Perl performance suggests that the overhead for launching a process is much higher than I'd have thought. On the order of 0.035 seconds per process on his laptop. If that holds true on the hardware that you're running, stopping launching processes could be worth a lot more performance than I would have thought.


In reply to Re: BASH vs Perl performance by tilly
in thread BASH vs Perl performance by jcoxen

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.