comment on

The OP seemed interested if parallelism is possible for such a task. Please disregard my posts if I have thought wrong. In the spirit of parallelism, I tested a 20 GiB file under the host OS (laptop with 16 GiB) comparing the grep command, bin/mce_grep, examples/egrep.pl and the script using MCE::Loop.

Recap: bin/mce_grep is a parallel wrapper for the grep command; examples/egrep.pl is 100% Perl code.

I am getting the impression that you're not liking MCE. If that is the case, then I should refrain from posting here. Have you not tried MCE against your 10 GiB file; e.g. bin/mce_grep or examples/egrep.pl?

$ ls -lh very_huge.file
-rw-r--r--  1 mario  staff    20G May 24 14:53 very_huge.file

## grep command

$ time grep karl very_huge.file
nose cuke karl
nose cuke karl
nose cuke karl
nose cuke karl
nose cuke karl
nose cuke karl
nose cuke karl
nose cuke karl
nose cuke karl
nose cuke karl

real    6m47.048s    ( 407 seconds )
user    6m42.372s
sys     0m 4.669s

## bin/mce_grep

$ time ./MCE-1.608/bin/mce_grep karl very_huge.file
nose cuke karl
nose cuke karl
nose cuke karl
nose cuke karl
nose cuke karl
nose cuke karl
nose cuke karl
nose cuke karl
nose cuke karl
nose cuke karl

real    2m17.003s    ( 137 seconds )
user   17m 9.223s
sys     0m33.223s

## examples/egrep.pl

$ time ./MCE-1.608/examples/egrep.pl karl very_huge.file
nose cuke karl
nose cuke karl
nose cuke karl
nose cuke karl
nose cuke karl
nose cuke karl
nose cuke karl
nose cuke karl
nose cuke karl
nose cuke karl

real    0m26.447s
user    0m22.527s
sys     0m 8.459s

## MCE::Loop script

$ time ./mce_loop_script.pl
nose cuke karl
nose cuke karl
nose cuke karl
nose cuke karl
nose cuke karl
nose cuke karl
nose cuke karl
nose cuke karl
nose cuke karl
nose cuke karl
Took 25.650 seconds

real    0m25.764s
user    0m42.494s
sys     0m 7.264s
[download]

Below, the script using MCE::Loop.

use MCE::Loop;
use Time::HiRes qw( time );

MCE::Loop::init( { max_workers => 4, use_slurpio => 1 } );

my $start = time;
my $pattern = 'karl';

my @result = mce_loop_f {
   my ($mce, $slurp_ref, $chunk_id) = @_;

   ## Quickly determine if a match is found.
   ## Basically, only process slurped chunk if true.

   if ($$slurp_ref =~ /$pattern/im) {
      my @matches;

      open my $MEM_FH, '<', $slurp_ref;
      binmode $MEM_FH, ':raw';
      while (<$MEM_FH>) { push @matches, $_ if (/$pattern/); }
      close   $MEM_FH;

      MCE->gather(@matches);
   }

} 'very_huge.file';

print join('', @result);

printf "Took %.3f seconds\n", time - $start;
[download]

I have taken the time to answer the OP's request -- free time. It is not worth it anymore at this site, especially when you (being at the Pope level) seem to disprove of MCE.

Best regards to all, -mario

In reply to Re^7: Threads From Hell #2: How To Parse A Very Huge File by marioroy
in thread Threads From Hell #2: How To Search A Very Huge File [SOLVED] by karlgoethebier

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.