The assumptions and testing are valid. However, MCE is quite fast at this. MCE has bin/mce_grep (a parallel wrapper for the grep binary) and examples/egrep.pl (100% Perl code). Both run faster than the grep command.
$ time grep karl very_huge.file
nose cuke karl
real 0m2.127s
user 0m1.845s
sys 0m0.283s
$ time ./MCE-1.608/bin/mce_grep karl very_huge.file
nose cuke karl
real 0m1.061s
user 0m2.176s
sys 0m1.616s
$ time ./MCE-1.608/examples/egrep.pl karl very_huge.file
nose cuke karl
real 0m0.690s
user 0m2.165s
sys 0m0.362s
The MCE::Grep has an alternative mode by appending the "_f" suffix and reading the file directly. That runs in 8.5 seconds. The overhead is from calling the code block once per each line. Thus, use egrep.pl residing inside the examples directory.
# open( my $fh, '<', 'very_huge.file' );
# my @result = mce_grep { /karl/ } $fh;
# close $fh;
my @result = mce_grep_f { /karl/ } 'very_huge.file';
The following code snippet parses the 2 GiB file in 1 second.
use MCE::Loop;
use Time::HiRes qw( time );
MCE::Loop::init( { max_workers => 4, use_slurpio => 1 } );
my $start = time;
my $pattern = 'karl';
my @result = mce_loop_f {
my ($mce, $slurp_ref, $chunk_id) = @_;
## Quickly determine if a match is found.
## Basically, only process slurped chunk if true.
if ($$slurp_ref =~ /$pattern/im) {
my @matches;
open my $MEM_FH, '<', $slurp_ref;
binmode $MEM_FH, ':raw';
while (<$MEM_FH>) { push @matches, $_ if (/$pattern/); }
close $MEM_FH;
MCE->gather(@matches);
}
} 'very_huge.file';
print join('', @result);
printf "Took %.3f seconds\n", time - $start;