Update: Changed the chunk_size option from '1m' to '24m'. The time drops down to 3.2 seconds via MCE with FS cache purged ( sudo purge ) before running on a Macbook Pro laptop. Previously, this was taking 6.2 seconds for chunk_size => '1m'. The time is ~ 1 second if the file resides in FS cache.
Update: Added the 'm' modifier to the regex operation.
Update: Ensuring the file does not live in FS cache, the time is 7.8 seconds running serially and 6.2 seconds running on many cores for the ~ 2 GB plain text file. Once in FS cache, the time is 5.4 seconds serially and 0.9 seconds via MCE.
Update: The unzipping of the file met that the file resided in FS cache afterwards. One doesn't normally flush FS memory typically. But, I met to do so before running. I have already removed the zip and plain text files and did not run again. IO is fast when processing a file directly. The reason is that workers do not involved the manager process when reading.
Anonymous Monk, the following is a parallel demonstration of the online code. Yes, reading line by line is not necessary. Thus performance increases by 5x from the serial version. This is also faster than the previous parallel demonstrations by many factors.
The parallel example below parses the ~ 2 GB plain text file in 0.9 seconds. The online serial demonstration completes in 5.2 seconds. My laptop has 4 real cores and 4 hyper-threads. Seeing nearly 6x is really good and did not expect that.
use strict;
use warnings;
use MCE::Flow;
use MCE::Shared;
my $counter1 = MCE::Shared->scalar( 0 );
my $counter2 = MCE::Shared->scalar( 0 );
mce_flow_f {
chunk_size => '24m', max_workers => 8,
use_slurpio => 1,
},
sub {
my ( $mce, $chunk_ref, $chunk_id ) = @_;
my $numlines = $$chunk_ref =~ tr/\n//;
my $occurances = () = $$chunk_ref =~ /123456\r?$/mg;
$counter1->incrby( $numlines );
$counter2->incrby( $occurances );
}, "Dictionary2GB.txt";
print "Num lines : ", $counter1->get(), "\n";
print "Occurances: ", $counter2->get(), "\n";
|