1nickt has asked for the wisdom of the Perl Monks concerning the following question:

Hello all,

I've read through the docs and made some experiments with basic usage of MCE, but I'm not sure if I'm barking up the wrong tree.

I'm unclear on:

I have an arrayref of hashrefs, and am outputting an arrayref of hashrefs. Processing each hashref is quite slow: takes about 0.1s. There are 7,500 hashes in the arrayref; that could grow to some tens of thousands.

The code is running on an Ubuntu AWS instance whose lscpu outputs:

Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 2 On-line CPU(s) list: 0,1 Thread(s) per core: 2 Core(s) per socket: 1 Socket(s): 1

The MCE manager is splitting the array into 25 chunks using 'auto'.

I am seeing almost no difference in time taken to execute using MCE versus a sequential foreach loop, in fact the sequential loop appears faster, which I would not have expected.

Although the CPU usage looks quite different:

Benchmark: timing 5 iterations of MCE loop , MCE map , Se +quential loop... MCE loop : 83 wallclock secs ( 3.28 usr 0.19 sys + 30.94 cusr 2 +4.89 csys = 59.30 CPU) @ 0.08/s (n=5) MCE map : 75 wallclock secs ( 4.48 usr 0.28 sys + 41.29 cusr 3 +7.93 csys = 83.98 CPU) @ 0.06/s (n=5) Sequential loop: 76 wallclock secs (37.79 usr + 28.91 sys = 66.70 CPU) + @ 0.07/s (n=5)

Am I missing something obvious? Or non-obvious? Doing something wrong? Can anyone shed any light please?


The way forward always starts with a minimal test.

Replies are listed 'Best First'.
Re: MCE -- how to know which function to use
by marioroy (Prior) on Sep 29, 2016 at 17:14 UTC

    Data passing involves IPC over sockets in MCE including MCE::Shared. It depends on the application whether CPU or network bound. An application polling metrics via SNMP may run with 100 workers and chunk at 300 on a box with 24 logical cores. Each application is unique and not easy to set max workers to some value.

    MCE::Map and MCE::Grep closely resembles the native map and grep functions respectively. This means that output order matches input order. If I had to do this again, I would likely merge MCE::Loop and MCE::Flow into one module.

    I am unable to provide tips to minimize IPC overhead without knowing what the application is doing. For example, batching updates to a local array or hash. Then populating the shared array or hash using 1 IPC call. This combined with chunking is one way to decrease IPC overhead.

    It is quite possible for other AWS instances to be running on the same physical box. Try setting max_workers to 3 or 4 even though 2 CPU cores are allocated for your Ubuntu instance.

    Regards, Mario.

      The following provides a demonstration. Notice how workers populate a local array before sending to the manager process using one gather call. This is one way to minimize IPC overhead.

      use strict; use warnings; use MCE::Loop; use MCE::Candy; my $count = 0; my $input = [ ]; my $output = [ ]; my $sample = { a => ++$count, b => ++$count, c => ++$count }; push @{ $input }, $sample for ( 1 .. 100000 ); MCE::Loop::init( max_workers => 3, chunk_size => 100, gather => MCE::Candy::out_iter_array( $output ) ); mce_loop { my ( $mce, $chunk_ref, $chunk_id ) = @_; my @ret; for my $h ( @{ $chunk_ref } ) { push @ret, { a => $h->{a} * 2, b => $h->{b} * 2, c => $h->{c} * 2, }; } MCE->gather($chunk_id, @ret); } $input; MCE::Loop::finish(); print scalar(@{ $output }), "\n";

      Regards, Mario.

        Thanks, Mario. After implementing this suggestion I get the following with my real data:

        Benchmark: timing 5 iterations of MCE loop , MCE loop/batc +h gather, MCE map , Sequential loop ... MCE loop : 56 wallclock secs ( 1.26 usr 0.57 sys + 36.11 +cusr 42.06 csys = 80.00 CPU) @ 0.06/s (n=5) MCE loop/batch gather: 57 wallclock secs ( 0.55 usr 0.17 sys + 36.20 +cusr 40.75 csys = 77.67 CPU) @ 0.06/s (n=5) MCE map : 66 wallclock secs ( 3.75 usr 0.32 sys + 36.34 +cusr 40.67 csys = 81.08 CPU) @ 0.06/s (n=5) Sequential loop : 73 wallclock secs (35.16 usr + 28.66 sys = 63.8 +2 CPU) @ 0.08/s (n=5)

        Surprisingly the batching didn't seem to help. Going to keep working with it. Also thanks for your PMs.

        The way forward always starts with a minimal test.