comment on

Update: Changed the chunk_size option from '1m' to '24m'. The time drops down to 3.2 seconds via MCE with FS cache purged ( sudo purge ) before running on a Macbook Pro laptop. Previously, this was taking 6.2 seconds for chunk_size => '1m'. The time is ~ 1 second if the file resides in FS cache.

Update: Added the 'm' modifier to the regex operation.

Update: Ensuring the file does not live in FS cache, the time is 7.8 seconds running serially and 6.2 seconds running on many cores for the ~ 2 GB plain text file. Once in FS cache, the time is 5.4 seconds serially and 0.9 seconds via MCE.

Update: The unzipping of the file met that the file resided in FS cache afterwards. One doesn't normally flush FS memory typically. But, I met to do so before running. I have already removed the zip and plain text files and did not run again. IO is fast when processing a file directly. The reason is that workers do not involved the manager process when reading.

Anonymous Monk, the following is a parallel demonstration of the online code. Yes, reading line by line is not necessary. Thus performance increases by 5x from the serial version. This is also faster than the previous parallel demonstrations by many factors.

The parallel example below parses the ~ 2 GB plain text file in 0.9 seconds. The online serial demonstration completes in 5.2 seconds. My laptop has 4 real cores and 4 hyper-threads. Seeing nearly 6x is really good and did not expect that.

use strict;
use warnings;

use MCE::Flow;
use MCE::Shared;

my $counter1 = MCE::Shared->scalar( 0 );
my $counter2 = MCE::Shared->scalar( 0 );

mce_flow_f {
   chunk_size => '24m', max_workers => 8,
   use_slurpio => 1,
},
sub {
   my ( $mce, $chunk_ref, $chunk_id ) = @_;

   my $numlines = $$chunk_ref =~ tr/\n//;
   my $occurances = () = $$chunk_ref =~ /123456\r?$/mg;

   $counter1->incrby( $numlines );
   $counter2->incrby( $occurances );

}, "Dictionary2GB.txt";

print "Num lines : ", $counter1->get(), "\n";
print "Occurances: ", $counter2->get(), "\n";
[download]

In reply to Re^2: How to optimize a regex on a large file read line by line ? by marioroy
in thread How to optimize a regex on a large file read line by line ? by John FENDER

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.