in reply to Threads From Hell #2: How To Search A Very Huge File [SOLVED]
The assumptions and testing are valid. However, MCE is quite fast at this. MCE has bin/mce_grep (a parallel wrapper for the grep binary) and examples/egrep.pl (100% Perl code). Both run faster than the grep command.
$ time grep karl very_huge.file nose cuke karl real 0m2.127s user 0m1.845s sys 0m0.283s $ time ./MCE-1.608/bin/mce_grep karl very_huge.file nose cuke karl real 0m1.061s user 0m2.176s sys 0m1.616s $ time ./MCE-1.608/examples/egrep.pl karl very_huge.file nose cuke karl real 0m0.690s user 0m2.165s sys 0m0.362s
The MCE::Grep has an alternative mode by appending the "_f" suffix and reading the file directly. That runs in 8.5 seconds. The overhead is from calling the code block once per each line. Thus, use egrep.pl residing inside the examples directory.
# open( my $fh, '<', 'very_huge.file' ); # my @result = mce_grep { /karl/ } $fh; # close $fh; my @result = mce_grep_f { /karl/ } 'very_huge.file';
The following code snippet parses the 2 GiB file in 1 second.
use MCE::Loop; use Time::HiRes qw( time ); MCE::Loop::init( { max_workers => 4, use_slurpio => 1 } ); my $start = time; my $pattern = 'karl'; my @result = mce_loop_f { my ($mce, $slurp_ref, $chunk_id) = @_; ## Quickly determine if a match is found. ## Basically, only process slurped chunk if true. if ($$slurp_ref =~ /$pattern/im) { my @matches; open my $MEM_FH, '<', $slurp_ref; binmode $MEM_FH, ':raw'; while (<$MEM_FH>) { push @matches, $_ if (/$pattern/); } close $MEM_FH; MCE->gather(@matches); } } 'very_huge.file'; print join('', @result); printf "Took %.3f seconds\n", time - $start;
|
---|
Replies are listed 'Best First'. | |
---|---|
Re^2: Threads From Hell #2: How To Parse A Very Huge File
by BrowserUk (Patriarch) on May 24, 2015 at 07:56 UTC | |
by marioroy (Prior) on May 24, 2015 at 13:27 UTC |