comment on

Hi all,

for learning purposes i started to think about how to ~~parse~~ search a very huge file using the multithreading capabilities of Perl.

As i like trivial examples, i started out with something trivial and created some huge file at first:

karls-mac-mini:monks karl$ ls -hl very_huge.file 
-rw-r--r--  1 karl  karl   2,0G 23 Mai 19:38 very_huge.file

karls-mac-mini:monks karl$ tail very_huge.file 
Lorem ipsum kizuaheli
Lorem ipsum kizuaheli
Lorem ipsum kizuaheli
Lorem ipsum kizuaheli
Lorem ipsum kizuaheli
Lorem ipsum kizuaheli
Lorem ipsum kizuaheli
Lorem ipsum kizuaheli
Lorem ipsum kizuaheli
nose cuke karl
 
karls-mac-mini:monks karl$ wc -l very_huge.file 
 100000001 very_huge.file
[download]

By RTFM i figured out this using MCE::Grep:

#!/usr/bin/env perl 

use strict;
use warnings;
use MCE::Grep;
use Data::Dump;
use Time::HiRes qw (time);

MCE::Grep::init( { max_workers => 4 } );

my $start = time;

open( my $fh, '<', 'very_huge.file' );

my @result = mce_grep { /karl/ } $fh;

close $fh;

printf "Took %.3f seconds\n", time - $start;

dd \@result;

__END__

karls-mac-mini:monks karl$ ./huge.pl 
Took 29.690 seconds
["nose cuke karl\n"]
[download]

Good old grep performs very much better easily:

karls-mac-mini:monks karl$ time grep karl very_huge.file 
nose cuke karl

real    0m2.563s
user    0m2.176s
sys    0m0.309s
[download]

I don't know if this trivial exercise is peinlich parallel, but i'm wondering how to:

do this "by hand" (without using MCE::Grep)
...and improve the performance

Thank you very much for any hint and best regards,

Update:

Edit: Striked out nonsense.

Ouch! Perhaps more RTFM would have helped:

PID Prozessname Benutzer % CPU Physikal. Speic Virt. S +peicher 1065 perl karl 12,7 10,3 MB 2, +33 GB 1068 perl karl 83,7 3,9 MB 2, +33 GB 1069 perl karl 84,6 3,9 MB 2, +33 GB 1070 perl karl 83,5 3,9 MB 2, +33 GB 1071 perl karl 84,0 3,9 MB 2, +33 GB
[download]

Edit 2: Renamed the thread

Update 3: Many thanks to marioroy and BrowserUk for their patience and their contributions to this interesting thread.

Karl

ŤThe Crux of the Biscuit is the Apostropheť

In reply to Threads From Hell #2: How To Search A Very Huge File [SOLVED] by karlgoethebier

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.