The assumptions and testing are valid. However, MCE is quite fast at this. MCE has bin/mce_grep (a parallel wrapper for the grep binary) and examples/egrep.pl (100% Perl code). Both run faster than the grep command.
$ time grep karl very_huge.file nose cuke karl real 0m2.127s user 0m1.845s sys 0m0.283s $ time ./MCE-1.608/bin/mce_grep karl very_huge.file nose cuke karl real 0m1.061s user 0m2.176s sys 0m1.616s $ time ./MCE-1.608/examples/egrep.pl karl very_huge.file nose cuke karl real 0m0.690s user 0m2.165s sys 0m0.362s
The MCE::Grep has an alternative mode by appending the "_f" suffix and reading the file directly. That runs in 8.5 seconds. The overhead is from calling the code block once per each line. Thus, use egrep.pl residing inside the examples directory.
# open( my $fh, '<', 'very_huge.file' ); # my @result = mce_grep { /karl/ } $fh; # close $fh; my @result = mce_grep_f { /karl/ } 'very_huge.file';
The following code snippet parses the 2 GiB file in 1 second.
use MCE::Loop; use Time::HiRes qw( time ); MCE::Loop::init( { max_workers => 4, use_slurpio => 1 } ); my $start = time; my $pattern = 'karl'; my @result = mce_loop_f { my ($mce, $slurp_ref, $chunk_id) = @_; ## Quickly determine if a match is found. ## Basically, only process slurped chunk if true. if ($$slurp_ref =~ /$pattern/im) { my @matches; open my $MEM_FH, '<', $slurp_ref; binmode $MEM_FH, ':raw'; while (<$MEM_FH>) { push @matches, $_ if (/$pattern/); } close $MEM_FH; MCE->gather(@matches); } } 'very_huge.file'; print join('', @result); printf "Took %.3f seconds\n", time - $start;
In reply to Re: Threads From Hell #2: How To Parse A Very Huge File
by marioroy
in thread Threads From Hell #2: How To Search A Very Huge File [SOLVED]
by karlgoethebier
For: | Use: | ||
& | & | ||
< | < | ||
> | > | ||
[ | [ | ||
] | ] |