Re^3: How to optimize a regex on a large file read line by line ?

Update: Am providing updated results due to background processes running previously. I rebooted my laptop and realized that things were running faster. That met having to re-run all the tests. Included are results for the upcoming MCE 1.706 release with faster IO ( applies to use_slurpio => 1 ). Previously, was unable to run below 3.0 seconds on the Mac with MCE 1.705. The run time is 2.2 seconds with MCE 1.706, which is close to the underlying hardware limit. MCE 1.706 will be released soon.

I ran the same tests from a Linux VM via Parallels Desktop with the 2 GB plain text file residing on a virtual disk inside Fedora 22. Unlike on OS X, the binary grep command runs much faster under Linux.

## FS cache purged inside Linux and on Mac OS X before running.

         wc -l : 1.732 secs.  from virtual disk
       grep -c : 1.912 secs.  from virtual disk
         total : 3.644 secs.

         wc -l : 1.732 secs.  from virtual disk
       grep -c : 0.884 secs.  from FS cache
         total : 2.616 secs.

   Perl script : 3.910 secs.  non-MCE      using 1 core

                  MCE 1.705    MCE 1.706
      with MCE : 4.357 secs.  4.015 secs.  using 1 core
      with MCE : 3.228 secs.  2.979 secs.  using 2 cores
      with MCE : 2.884 secs.  2.624 secs.  using 3 cores
      with MCE : 2.908 secs.  2.501 secs.  using 4 cores

## Dictionary2GB.txt residing inside FS cache on Linux.

         wc -l : 1.035 secs.
       grep -c : 0.866 secs.
         total : 1.901 secs.

   Perl script : 2.314 secs.  non-MCE      using 1 core

                  MCE 1.705    MCE 1.706
      with MCE : 2.344 secs.  2.337 secs.  using 1 core
      with MCE : 1.349 secs.  1.345 secs.  using 2 cores
      with MCE : 0.961 secs.  0.932 secs.  using 3 cores
      with MCE : 0.820 secs.  0.775 secs.  using 4 cores
[download]

On Linux, it takes at least 3 workers to run as fast as wc and grep combined with grep reading from FS cache.

Below, the serial code and MCE code respectively.

use strict;
use warnings;

my $size = 24 * 1024 * 1024;
my ( $numlines, $occurances ) = ( 0, 0 );

open my $fh, '<', '/home/mario/Dictionary2GB.txt' or die "$!";

while ( read( $fh, my $b, $size ) ) {
   $b .= <$fh> unless ( eof $fh );
   $numlines   += $b =~ tr/\n//;
   $occurances += () = $b =~ /123456\r?$/mg;
}

close $fh;

print "Num lines : $numlines\n";
print "Occurances: $occurances\n";
[download]

Using MCE for running on multiple cores.

use strict;
use warnings;

use MCE::Flow;
use MCE::Shared;

my $counter1 = MCE::Shared->scalar( 0 );
my $counter2 = MCE::Shared->scalar( 0 );

mce_flow_f {
   chunk_size => '24m', max_workers => 4,
   use_slurpio => 1,
},
sub {
   my ( $mce, $chunk_ref, $chunk_id ) = @_;

   my $numlines = $$chunk_ref =~ tr/\n//;
   my $occurances = () = $$chunk_ref =~ /123456\r?$/mg;

   $counter1->incrby( $numlines );
   $counter2->incrby( $occurances );

}, "/home/mario/Dictionary2GB.txt";

print "Num lines : ", $counter1->get(), "\n";
print "Occurances: ", $counter2->get(), "\n";
[download]

Kind regards, Mario.

Comment on Re^3: How to optimize a regex on a large file read line by line ? Select or Download Code

Replies are listed 'Best First'.
Re^4: How to optimize a regex on a large file read line by line ? by marioroy (Prior) on Apr 23, 2016 at 03:15 UTC
MCE 1.706 has been released along with MCE::Shared 1.005. The MCE 1.706 release enables faster IO when use_slurpio => 1 is specified. Also, the chunk_size option is not necessary. The performance is close to optimum on auto. `mce_flow_f { max_workers => 4, use_slurpio => 1 }, sub { ... }, '/path/to/huge_file.txt';` [download] Kind regards, Mario.	[reply] [d/l]
Re^4: How to optimize a regex on a large file read line by line ? by Anonymous Monk on Apr 21, 2016 at 21:00 UTC
Thank you. Finally some numbers from the same machine. It's good to see that perl is in the same ballpark with (and sometimes better than) grep+wc.	[reply]
Re^5: How to optimize a regex on a large file read line by line ? by marioroy (Prior) on Apr 22, 2016 at 07:04 UTC
I had to run the tests again after booting my Mac and seeing that things were snappier. Perhaps, something was running in the background when testing the first time. Maybe, the virtual disk containing the 2GB file was defragmented between then and now. Am not sure really. Timings for the upcoming MCE 1.706 release are included for comparison.	[reply]