Re: How to optimize a regex on a large file read line by line ?

The following is a parallel demonstration using MCE::Flow and MCE::Shared.

use strict;
use warnings;

use MCE::Flow;
use MCE::Shared;

open my $fh, "unzip -p 10-million-combos.zip |" or die "$!";

my $counter1 = MCE::Shared->scalar( 0 );
my $counter2 = MCE::Shared->scalar( 0 );

mce_flow {
   chunk_size => '1m', max_workers => 8,
   use_slurpio => 1,
},
sub {
   my ( $mce, $chunk_ref, $chunk_id ) = @_;
   my ( $numlines, $occurances ) = ( 0, 0 );

   while ( $$chunk_ref =~ /([^\n]+\n)/mg ) {
      $numlines++;
      $occurances++ if ( $1 =~ /123456\r/ );
   }

   $counter1->incrby( $numlines   );
   $counter2->incrby( $occurances );

}, $fh;

close $fh;

print "Num lines : ", $counter1->get(), "\n";
print "Occurances: ", $counter2->get(), "\n";
[download]

The following construction reads the plain text file directly if already unzipped.

use strict;
use warnings;

use MCE::Flow;
use MCE::Shared;

my $counter1 = MCE::Shared->scalar( 0 );
my $counter2 = MCE::Shared->scalar( 0 );

mce_flow_f {
   chunk_size => '1m', max_workers => 8,
   use_slurpio => 1,
},
sub {
   my ( $mce, $chunk_ref, $chunk_id ) = @_;
   my ( $numlines, $occurances ) = ( 0, 0 );

   while ( $$chunk_ref =~ /([^\n]+\n)/mg ) {
      $numlines++;
      $occurances++ if ( $1 =~ /123456\r/ );
   }

   $counter1->incrby( $numlines   );
   $counter2->incrby( $occurances );

}, "10-million-combos.txt";

print "Num lines : ", $counter1->get(), "\n";
print "Occurances: ", $counter2->get(), "\n";
[download]

Comment on Re: How to optimize a regex on a large file read line by line ? Select or Download Code

Replies are listed 'Best First'.
Re^2: How to optimize a regex on a large file read line by line ? by marioroy (Prior) on Apr 17, 2016 at 16:19 UTC
Update: Shorten code Hello again, Slurping requires double regular expressions. One for breaking into actual lines and the other for the query. Below, workers receive an array reference containing some number of lines and run slightly faster, possibly due to one regex. use strict; use warnings; use MCE::Flow; use MCE::Shared; open my $fh, "10-million-combos.zip \|" or die "$!"; my $counter1 = MCE::Shared->scalar( 0 ); my $counter2 = MCE::Shared->scalar( 0 ); mce_flow { chunk_size => '1m', max_workers => 8, }, sub { my ( $mce, $chunk_ref, $chunk_id ) = @_; my $numlines = @{ $chunk_ref }; my $occurances = 0; for ( @{ $chunk_ref } ) { $occurances++ if /123456\r/; } $counter1->incrby( $numlines ); $counter2->incrby( $occurances ); }, $fh; close $fh; print "Num lines : ", $counter1->get(), "\n"; print "Occurances: ", $counter2->get(), "\n"; [download] And finally, the construction for reading the plain text file directly. use strict; use warnings; use MCE::Flow; use MCE::Shared; my $counter1 = MCE::Shared->scalar( 0 ); my $counter2 = MCE::Shared->scalar( 0 ); mce_flow_f { chunk_size => '1m', max_workers => 8, }, sub { my ( $mce, $chunk_ref, $chunk_id ) = @_; my $numlines = @{ $chunk_ref }; my $occurances = 0; for ( @{ $chunk_ref } ) { $occurances++ if /123456\r/; } $counter1->incrby( $numlines ); $counter2->incrby( $occurances ); }, "10-million-combos.txt"; print "Num lines : ", $counter1->get(), "\n"; print "Occurances: ", $counter2->get(), "\n"; [download]	[reply] [d/l] [select]
Re^3: How to optimize a regex on a large file read line by line ? by John FENDER (Acolyte) on Apr 17, 2016 at 22:17 UTC
Hello marioroy, It's impressiv ! And i've test successfully your code. Eventually the Strawberry dist works the best on my system. "It's works even well with the mixed file (multiple kind of EOF) ! Benchmarking both the 3 methods, i've found 32,53 and 33,07 for the two codes provided kindfully. My current code (works only on a cr+lf or lf file) have done 33,76". It's impressive to see the 8 CPU cores up to 100% at the same time with your demo ! But the results only slightly differs with my code which doesn't impact all the core at all like that. Strange ! Very happy anyway, i'm now close to the best performance i could got on my laptop with perl ! `. Grep : 10,71 . Java : 25,95 . C# : 30,05 . Perl : 32,53 . C++ : 41,3 . PHP : 52,31 . Free Pascal : 76,46 . Delphi 7 : 78,14 . VB.NET : 100,15 . Python : 315,13 . PowerShell : 681,93 . VBS : 1031,63 . Ruby : Failed to parse the file correctly.` [download]	[reply] [d/l]
Re^4: How to optimize a regex on a large file read line by line ? by Anonymous Monk on Apr 18, 2016 at 16:37 UTC
Hi John FENDER. It seems like you might have given up too early, before the cavalry arrived. See Re^2: How to optimize a regex on a large file read line by line ?. Looks like both are even faster than the grep+wc solution, though this would need to be confirmed on your machine.	[reply]
Re^5: How to optimize a regex on a large file read line by line ? by John FENDER (Acolyte) on Apr 19, 2016 at 23:13 UTC
Re^6: How to optimize a regex on a large file read line by line ? by Anonymous Monk on Apr 20, 2016 at 20:48 UTC
Re^4: How to optimize a regex on a large file read line by line ? by LanX (Saint) on Apr 17, 2016 at 22:48 UTC
Look I'd appreciate if you'd stop posting benchmarks without showing the code. Alone our grep vs perl discussion showed that you are comparing apples with oranges. The only thing you are possibly benchmarking are your coding skills. Cheers Rolf _{(addicted to the Perl Programming Language and ☆☆☆☆ :) Je suis Charlie!}	[reply]
Re^5: How to optimize a regex on a large file read line by line ? by John FENDER (Acolyte) on Apr 18, 2016 at 06:31 UTC