comment on

How does it go about determining the boundary for a 64K hunk? Is there a chance that a word or a line could get split between chunks?

Great question. Workers, after reading the chunk, continue reading till the end of record or end of line.

I guess that there is some slight inefficiency because MCE has to enforce a sequential finishing - and again.

The demonstration involves workers awaiting their turn to output orderly, which is by chunk_id behind the scene. The code between MCE->relay_lock and MCE->relay_unlock runs serially. Below, MCE results from a Fedora Linux box.

use threads;  # add line before loading MCE (to spin threads)
1 thread     12.473 seconds
4 threads     3.182 seconds  3.92x

fork 
1 process     7.331 seconds
4 processes   1.848 seconds  3.97x
[download]

In the OP's situation, he/she says that there could be millions of files to process.

You may have missed a post where I mentioned similarly.

This fork business is weird on Windows and this code may work a lot better on Unix which can do "real" forks.

Something to try is the MCE use_threads => 0 option. That will cause MCE to spin workers via the emulated fork.

... there is some variability between runs depending upon how the O/S does the core assignment and what else is going on in the machine.

Dividing work equally by the number of workers may exhibit an unlikely abnormality. One doesn't know for sure if the regular expression aspect may run faster on the first part of the file or somewhere in between. It's possible for a worker to finish sooner than others, causing idled CPU time. This idled CPU time may be greater than the time MCE workers await their turn to output serially. I'm not seeing the abnormality for this workload.

For reference, MCE above completes in 1.848 seconds (fork). One reason for completing faster is MCE code processing entire chunk versus line-by-line. For apples-to-apples comparison, I will update the MCE code to process line-by-line and report back.

-rw-r--r--. 1 mario mario 80348000 Oct  9 22:15 nightfall.txt
-rw-r--r--. 1 mario mario 20087000 Oct  9 22:12 nightfall1.txt
-rw-r--r--. 1 mario mario 20087000 Oct  9 22:13 nightfall2.txt
-rw-r--r--. 1 mario mario 20087000 Oct  9 22:13 nightfall3.txt
-rw-r--r--. 1 mario mario 20087000 Oct  9 22:14 nightfall4.txt

0.000 secs Spawned child pid: 11194 for nightfall1.txt
0.000 secs This is child pid 11194 for nightfall1.txt. I am alive and 
+working!
0.000 secs Spawned child pid: 11195 for nightfall2.txt
0.001 secs opened nightfall1.txt and nightfall1.out
0.001 secs Spawned child pid: 11196 for nightfall3.txt
0.001 secs This is child pid 11195 for nightfall2.txt. I am alive and 
+working!
0.001 secs Spawned child pid: 11197 for nightfall4.txt
0.001 secs This is child pid 11196 for nightfall3.txt. I am alive and 
+working!
0.001 secs opened nightfall2.txt and nightfall2.out
0.001 secs This is child pid 11197 for nightfall4.txt. I am alive and 
+working!
0.001 secs opened nightfall3.txt and nightfall3.out
0.001 secs opened nightfall4.txt and nightfall4.out
2.243 secs Child 11194 finished working on nightfall1.txt!
2.261 secs Child 11197 finished working on nightfall4.txt!
2.274 secs Child 11196 finished working on nightfall3.txt!
2.275 secs Child 11195 finished working on nightfall2.txt!
2.276 secs Parenting talking...all my children are finished! Hooray!
[download]

I'm back. I updated the MCE code to process line-by-line and ran again.

mce-process-entire-chunk.pl  1.848 seconds
mce-process-line-by-line.pl  2.264 seconds  2.235 ~ 2.281
[download]

Workers processing entire chunk:

    user_func => sub {
        # worker chunk routine
        my ($mce, $chunk_ref, $chunk_id) = @_;

        $$chunk_ref =~ tr/-!"#%&'()*,.\/:;?@\[\\\]_{}0123456789//d;
        $$chunk_ref =~ s/w(as|ere)/be/gi;
        $$chunk_ref =~ s/$RE1/ $W1{lc $1} /g;
        $$chunk_ref =~ s/$RE2/ $W2{lc $1} /g;
        $$chunk_ref =~ s/$RE3/ $W3{lc $1} /g;

        # Output orderly and serially.
        MCE->relay_lock;
        print $OUT_FH $$chunk_ref; $OUT_FH->flush;
        MCE->relay_unlock;
    }
[download]

Workers processing line-by-line:

    user_func => sub {
        # worker chunk routine
        my ($mce, $chunk_ref, $chunk_id) = @_;
        my $output = '';

        open my $fh, '<', $chunk_ref;
        while (<$fh>) {
            tr/-!"#%&'()*,.\/:;?@\[\\\]_{}0123456789//d;
            s/w(as|ere)/be/gi;
            s/$RE1/ $W1{lc $1} /g;
            s/$RE2/ $W2{lc $1} /g;
            s/$RE3/ $W3{lc $1} /g;
            $output .= $_;
        }
        close $fh;

        # Output orderly and serially.
        MCE->relay_lock;
        print $OUT_FH $output; $OUT_FH->flush;
        MCE->relay_unlock;
    }
[download]

In reply to Re^7: Need to speed up many regex substitutions and somehow make them a here-doc list by marioroy
in thread Need to speed up many regex substitutions and somehow make them a here-doc list by xnous

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.