in reply to Count byte/character occurrence (quickly)

The following is a parallel demonstration, based on the code by BrowserUk. There is a bug in MCE::Shared 1.001 and the reason for the length check below. MCE 1.704 and MCE 1.002 will be released in ETA ~ 1 week with the fix.

MCE::Flow and MCE::Shared

use strict; use warnings; use MCE::Flow; use MCE::Shared; use Time::HiRes qw[ time ]; my $start = time; my $fh = MCE::Shared->handle( "<:raw", $ARGV[ 0 ] ); my @seen; sub tally { my ($aref) = @_; for ( 0 .. 255 ) { $seen[$_] += $aref->[$_] if $aref->[$_]; } return; } mce_flow { max_workers => 8 }, sub { my @_seen; while( read( $fh, my $buf, 16384 * 4 ) ) { # the length check may be omitted with MCE::Shared 1.002+ last unless length($buf); ++$_seen[$_] for unpack 'C*', $buf; } MCE->do('tally', \@_seen); }; close $fh; printf "Took %f secs\n", time() - $start; # for ( 0 .. 255 ) { # printf "%c : %u\n", $_, $seen[$_] if $seen[$_]; # }

The serial code takes 8.390 seconds. In comparison, the parallel code completes in 2.253 seconds for a 126 MB file on a machine with 4 real cores and 4 hyper-threads.

Update:

The upcoming MCE::Shared 1.002 release will support the following construction by allowing the main or worker process to handle the error. I've been wanting for the shared open call to feel like the native open call.

use MCE::Shared 1.002; mce_open my $IN, "<:gzip", "wat.paths.gz" or die "open error: $!"; mce_open my $OUT, ">", \*STDOUT or die "open error: $!";

Replies are listed 'Best First'.
Re^2: Count byte/character occurrence (quickly)
by marioroy (Prior) on Apr 01, 2016 at 15:01 UTC

    The following are parallel demonstrations using MCE::Hobo and threads.

    MCE::Hobo and MCE::Shared

    A Hobo is a migratory worker inside the machine that carries the asynchronous gene. Hobos are equipped with threads-like capability for running code asynchronously. Unlike threads, each hobo is a unique process to the underlying OS. The IPC is managed by MCE::Shared, which runs on all the major platforms including Cygwin.

    use strict; use warnings; use MCE::Hobo; use MCE::Shared; use Time::HiRes qw[ time ]; my $start = time; my $fh = MCE::Shared->handle( "<:raw", $ARGV[ 0 ] ); my $seen = MCE::Shared->array; sub task { my @_seen; while( read( $fh, my $buf, 16384 * 4 ) ) { # the length check may be omitted with MCE::Shared 1.002+ last unless length($buf); ++$_seen[$_] for unpack 'C*', $buf; } for ( 0 .. 255 ) { $seen->incrby($_, $_seen[$_]) if $_seen[$_]; } } MCE::Hobo->create('task') for 1 .. 8; # do other stuff if desired $_->join for MCE::Hobo->list; close $fh; printf "Took %f secs\n", time() - $start; # export and destroy the shared array into a local non-shared array $seen = $seen->destroy; # for ( 0 .. 255 ) { # printf "%c : %u\n", $_, $seen->[$_] if $seen->[$_]; # }

    threads and MCE::Shared

    The code for MCE::Hobo and threads are very similar.

    use strict; use warnings; use threads; use MCE::Shared; use Time::HiRes qw[ time ]; my $start = time; my $fh = MCE::Shared->handle( "<:raw", $ARGV[ 0 ] ); my $seen = MCE::Shared->array; sub task { my @_seen; while( read( $fh, my $buf, 16384 * 4 ) ) { # the length check may be omitted with MCE::Shared 1.002+ last unless length($buf); ++$_seen[$_] for unpack 'C*', $buf; } for ( 0 .. 255 ) { $seen->incrby($_, $_seen[$_]) if $_seen[$_]; } } threads->create('task') for 1 .. 8; # do other stuff if desired $_->join for threads->list; close $fh; printf "Took %f secs\n", time() - $start; # export and destroy the shared array into a local non-shared array $seen = $seen->destroy; # for ( 0 .. 255 ) { # printf "%c : %u\n", $_, $seen->[$_] if $seen->[$_]; # }

      You guys are awesome. Thanks for these good examples :) When I run this code, it calculates byte occurrence in .9 secs or less!

      Edit: I do have a few questions as well. I dont have time to ask right now, but I will be back!
        Just noticed that you were the author of that module haha. anyways, I have read a file into a buffer and then opened it like:
        my $arg = shift; my $len = -s $arg; open my $file, '<', $arg; binmode $file; read $file, my $buf, $len; close $file; open my $mem_file, '<', \$buf; binmode $mem_file; .....do stuff....
        when I try to use mce_open with $mem_file, I get an error:
        open error: Invalid argument at C:/Perl/site/lib/MCE/Shared/Server.pm +line 1035 thread 1, <__ANONIO__> line 6. MCE::Shared::Server::__ANON__() called at C:/Perl/site/lib/MCE +/Shared/Server.pm line 1324 thread 1 MCE::Shared::Server::_loop(0, 6624) called at C:/Perl/site/lib +/MCE/Shared/Server.pm line 335 thread 1 eval {...} called at C:/Perl/site/lib/MCE/Shared/Server.pm lin +e 335 thread 1

        Is there anyway I can get this to work? Because I have passed around this $mem_file in my script and would like to use it instead of having to re-read the actual file. If i need to elaborate any more please let me know :)

        EDIT: I will go ahead and elaborate a little more. when I pass $mem_file to the sub and try to open it like this:

        stat_check($mem_file); sub stat_check{ my ($mem_file) = @_; my $fh = MCE::Shared->handle( "<:raw", \$mem_file ); ....rest of threaded function... }

        I get error:

        Not a GLOB reference at C:/Perl/site/lib/MCE/Shared/Server.pm line 203 +6, <__ANONIO__> line 3.

        If i try:

        stat_check($mem_file); sub stat_check{ my ($mem_file) = @_; my $fh = MCE::Shared->handle( "<:raw", $mem_file ); ....rest of threaded function... }

        I get error:

        open error: Invalid argument at C:/Perl/site/lib/MCE/Shared/Server.pm +line 1035 thread 1, <__ANONIO__> line 6. MCE::Shared::Server::__ANON__() called at C:/Perl/site/lib/MCE +/Shared/Server.pm line 1324 thread 1 MCE::Shared::Server::_loop(0, 3232) called at C:/Perl/site/lib +/MCE/Shared/Server.pm line 335 thread 1 eval {...} called at C:/Perl/site/lib/MCE/Shared/Server.pm lin +e 335 thread 1