comment on

The assumptions and testing are valid. However, MCE is quite fast at this. MCE has bin/mce_grep (a parallel wrapper for the grep binary) and examples/egrep.pl (100% Perl code). Both run faster than the grep command.

$ time grep karl very_huge.file
nose cuke karl

real  0m2.127s
user  0m1.845s
sys   0m0.283s

$ time ./MCE-1.608/bin/mce_grep karl very_huge.file
nose cuke karl

real  0m1.061s
user  0m2.176s
sys   0m1.616s

$ time ./MCE-1.608/examples/egrep.pl karl very_huge.file
nose cuke karl

real  0m0.690s
user  0m2.165s
sys   0m0.362s
[download]

The MCE::Grep has an alternative mode by appending the "_f" suffix and reading the file directly. That runs in 8.5 seconds. The overhead is from calling the code block once per each line. Thus, use egrep.pl residing inside the examples directory.

# open( my $fh, '<', 'very_huge.file' );
# my @result = mce_grep { /karl/ } $fh;
# close $fh;

my @result = mce_grep_f { /karl/ } 'very_huge.file';
[download]

The following code snippet parses the 2 GiB file in 1 second.

use MCE::Loop;
use Time::HiRes qw( time );

MCE::Loop::init( { max_workers => 4, use_slurpio => 1 } );

my $start = time;
my $pattern = 'karl';

my @result = mce_loop_f {
   my ($mce, $slurp_ref, $chunk_id) = @_;

   ## Quickly determine if a match is found.
   ## Basically, only process slurped chunk if true.

   if ($$slurp_ref =~ /$pattern/im) {
      my @matches;

      open my $MEM_FH, '<', $slurp_ref;
      binmode $MEM_FH, ':raw';
      while (<$MEM_FH>) { push @matches, $_ if (/$pattern/); }
      close   $MEM_FH;

      MCE->gather(@matches);
   }

} 'very_huge.file';

print join('', @result);

printf "Took %.3f seconds\n", time - $start;
[download]

In reply to Re: Threads From Hell #2: How To Parse A Very Huge File by marioroy
in thread Threads From Hell #2: How To Search A Very Huge File [SOLVED] by karlgoethebier

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.