Or, to look at it another way, you are passing the input a line at a time to the worker threads -- so two things are happening:

  1. all of the thread communication overhead is being carried by every line -- I think this is the real issue.

  2. you are streaming the output, so Thread::Pool arranges for the output of each job to be done in job order.

    Having looked at the code, the way it does this is quite cunning: when a do routine completes a job before earlier jobs have completed, the results are put to one side, and the thread loops round to do the next job. When a job completes and all earlier have completed, the thread outputs its results and any results put aside earlier which can now be output.

    Note that this means that if one job takes a very long time indeed, all the other threads will continue taking jobs and generating and storing the output. This is a Good Thing, because slower jobs are not blocking faster ones. However, I don't see a mechanism to stop jobs being dispatched if the streamed output backs up. But that could be because I haven't studied the code hard enough.

    So apart from the cost of all this cunning, I don't think this is the issue.

Without changing the structure of what you have too much, I'd change:

$pool->job($_) while (<STDIN>);
to something that reads a bunch of lines (say a 100K bytes worth), something like:
while (!eof(STDIN)) { my $buffer ; read STDIN, $buffer, 100*1024 or die "$!" ; if (!eof(STDIN)) { my $nl = rindex($buffer, "\n") + 1 or die "No \\n in lump ?" ; if (my $trim = $nl - length($buffer) { seek STDIN, $trim, 1 or die "Seek failed ??" ; substr($buffer, $trim) = '' ; } ; } ; $pool->job($buffer) ; } ;
(I haven't tested this, so caveat emptor !). Then the processing subroutine needs to break its input into lines and process into another buffer, and then return that lump of results. Something like:
sub do { my ($lump) = @_ ; my $result = '' ; while ($lump =~ m/(.+)/g) { # Ignores blank lines my $line = $1 ; ... whatever ... $result .= ...whatever.. ."\n" ; } ; return $result ; } ;
(also untested). There's some saving to be made by removing my ($lump) = @_ and doing while ($_[0] =~ m/(.+)/g) directly instead.

This will amortise the job overhead over a reasonable chunk of lines. Obviously you can experiment to see if larger chunks means faster processing.

BTW, your do subroutine starts:

sub do { chomp; my $input = $_; $input = shift;
which doesn't look right to me. The arguments are passed to do() in the usual way, so hacking around $_ appears redundant.


In reply to Re^2: How do you parallelize STDIN for large file processing? by gone2015
in thread How do you parallelize STDIN for large file processing? by forsaken75

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.