in reply to How to optimize a regex on a large file read line by line ?

Hello John FENDER,

The following is a parallel demonstration using MCE::Flow and MCE::Shared.

use strict; use warnings; use MCE::Flow; use MCE::Shared; open my $fh, "unzip -p 10-million-combos.zip |" or die "$!"; my $counter1 = MCE::Shared->scalar( 0 ); my $counter2 = MCE::Shared->scalar( 0 ); mce_flow { chunk_size => '1m', max_workers => 8, use_slurpio => 1, }, sub { my ( $mce, $chunk_ref, $chunk_id ) = @_; my ( $numlines, $occurances ) = ( 0, 0 ); while ( $$chunk_ref =~ /([^\n]+\n)/mg ) { $numlines++; $occurances++ if ( $1 =~ /123456\r/ ); } $counter1->incrby( $numlines ); $counter2->incrby( $occurances ); }, $fh; close $fh; print "Num lines : ", $counter1->get(), "\n"; print "Occurances: ", $counter2->get(), "\n";

The following construction reads the plain text file directly if already unzipped.

use strict; use warnings; use MCE::Flow; use MCE::Shared; my $counter1 = MCE::Shared->scalar( 0 ); my $counter2 = MCE::Shared->scalar( 0 ); mce_flow_f { chunk_size => '1m', max_workers => 8, use_slurpio => 1, }, sub { my ( $mce, $chunk_ref, $chunk_id ) = @_; my ( $numlines, $occurances ) = ( 0, 0 ); while ( $$chunk_ref =~ /([^\n]+\n)/mg ) { $numlines++; $occurances++ if ( $1 =~ /123456\r/ ); } $counter1->incrby( $numlines ); $counter2->incrby( $occurances ); }, "10-million-combos.txt"; print "Num lines : ", $counter1->get(), "\n"; print "Occurances: ", $counter2->get(), "\n";

Replies are listed 'Best First'.
Re^2: How to optimize a regex on a large file read line by line ?
by marioroy (Prior) on Apr 17, 2016 at 16:19 UTC

    Update: Shorten code

    Hello again,

    Slurping requires double regular expressions. One for breaking into actual lines and the other for the query. Below, workers receive an array reference containing some number of lines and run slightly faster, possibly due to one regex.

    use strict; use warnings; use MCE::Flow; use MCE::Shared; open my $fh, "10-million-combos.zip |" or die "$!"; my $counter1 = MCE::Shared->scalar( 0 ); my $counter2 = MCE::Shared->scalar( 0 ); mce_flow { chunk_size => '1m', max_workers => 8, }, sub { my ( $mce, $chunk_ref, $chunk_id ) = @_; my $numlines = @{ $chunk_ref }; my $occurances = 0; for ( @{ $chunk_ref } ) { $occurances++ if /123456\r/; } $counter1->incrby( $numlines ); $counter2->incrby( $occurances ); }, $fh; close $fh; print "Num lines : ", $counter1->get(), "\n"; print "Occurances: ", $counter2->get(), "\n";

    And finally, the construction for reading the plain text file directly.

    use strict; use warnings; use MCE::Flow; use MCE::Shared; my $counter1 = MCE::Shared->scalar( 0 ); my $counter2 = MCE::Shared->scalar( 0 ); mce_flow_f { chunk_size => '1m', max_workers => 8, }, sub { my ( $mce, $chunk_ref, $chunk_id ) = @_; my $numlines = @{ $chunk_ref }; my $occurances = 0; for ( @{ $chunk_ref } ) { $occurances++ if /123456\r/; } $counter1->incrby( $numlines ); $counter2->incrby( $occurances ); }, "10-million-combos.txt"; print "Num lines : ", $counter1->get(), "\n"; print "Occurances: ", $counter2->get(), "\n";
      Hello marioroy,

      It's impressiv ! And i've test successfully your code. Eventually the Strawberry dist works the best on my system.

      "It's works even well with the mixed file (multiple kind of EOF) ! Benchmarking both the 3 methods, i've found 32,53 and 33,07 for the two codes provided kindfully. My current code (works only on a cr+lf or lf file) have done 33,76".

      It's impressive to see the 8 CPU cores up to 100% at the same time with your demo ! But the results only slightly differs with my code which doesn't impact all the core at all like that. Strange !

      Very happy anyway, i'm now close to the best performance i could got on my laptop with perl !

      . Grep : 10,71 . Java : 25,95 . C# : 30,05 . Perl : 32,53 . C++ : 41,3 . PHP : 52,31 . Free Pascal : 76,46 . Delphi 7 : 78,14 . VB.NET : 100,15 . Python : 315,13 . PowerShell : 681,93 . VBS : 1031,63 . Ruby : Failed to parse the file correctly.
        Look I'd appreciate if you'd stop posting benchmarks without showing the code.

        Alone our grep vs perl discussion showed that you are comparing apples with oranges.

        The only thing you are possibly benchmarking are your coding skills.

        Cheers Rolf
        (addicted to the Perl Programming Language and ☆☆☆☆ :)
        Je suis Charlie!