in reply to How to optimize a regex on a large file read line by line ?

Update: The time is 2.2 seconds using the same demonstration below on a Mac running the upcoming MCE 1.706 release. Running with four workers also completes in 2.2 seconds. Basically, have reached the the underlying hardware limitation.

Today, I looked at MCE to compare against the 2 GB plain text file residing in FS cache and not. Increasing the chunk_size value is beneficial, especially when the file does not exists in OS level FS cache.

With an update to the code, simply by increasing the chunk_size value from '1m' to '24m', the total time now takes 3.2 seconds to complete.

use strict; use warnings; use MCE::Flow; use MCE::Shared; my $counter1 = MCE::Shared->scalar( 0 ); my $counter2 = MCE::Shared->scalar( 0 ); mce_flow_f { chunk_size => '24m', max_workers => 8, use_slurpio => 1, }, sub { my ( $mce, $chunk_ref, $chunk_id ) = @_; my $numlines = $$chunk_ref =~ tr/\n//; my $occurances = () = $$chunk_ref =~ /123456\r?$/mg; $counter1->incrby( $numlines ); $counter2->incrby( $occurances ); }, "Dictionary2GB.txt"; print "Num lines : ", $counter1->get(), "\n"; print "Occurances: ", $counter2->get(), "\n";

One day, I will try another technique inside MCE to see if IO performance can be improved upon.

Resolved.

Replies are listed 'Best First'.
Re^2: How to optimize a regex on a large file read line by line ?
by Anonymous Monk on Apr 21, 2016 at 15:59 UTC

    How fast is grep+wc on your machine?

      Grep and egrep run slow on the Mac and do not know why.

      wc -l: 2.162 seconds grep -c: 45.316 seconds
        > Grep and egrep run slow on the Mac and do not know why.

        you have optimized the chunk-size in your script manually, probably those tools where compiled with hardcoded chunk-sizes which are not optimal any more.

        Cheers Rolf
        (addicted to the Perl Programming Language and ☆☆☆☆ :)
        Je suis Charlie!

        update

        This was my 5000 th post, time to retire....

Re^2: How to optimize a regex on a large file read line by line ?
by LanX (Saint) on Apr 21, 2016 at 16:19 UTC
    I already wanted to remark that since the FS is the bottleneck, I'm not sure if paralleling helps (there's only one FS)

    When comparing with grep/wc please also compare the one worker case, cause grep shouldn't be paralleling (AFAIK)

    BTW: While we never saw the bash script, I suppose we have to call wc twice to get the total number of numlines too (which makes comparing even more complicated, cause the second wc would need reading the file again)

    Cheers Rolf
    (addicted to the Perl Programming Language and ☆☆☆☆ :)
    Je suis Charlie!

      Update: Added serial code. Am happy that IO in MCE is not too far behind. One day, will try another technique. IO aside, any CPU intensive operations such as regex do benefit from running with multiple workers.

      Yes, IO will only go as fast as the underlying IO capabilities. MCE does sequential IO, meaning only one worker reads at any given time. The regex operation benefits from having multiple workers. Eventually, IO becomes the bottleneck.

      1 worker: 9.437 secs. 2 workers: 4.480 secs. 3 workers: 3.248 secs. 4 workers: 3.236 secs. 8 workers: 3.240 secs.

      Below, removed counting and regex from the equation and running with 1 worker. It completes as fast as IO allows in 3.256 seconds.

      mce_flow_f { chunk_size => '24m', max_workers => 1, use_slurpio => 1, }, sub { }, 'Dictionary2GB.txt';

      The following serial code, reader only and without MCE, takes 2.864 seconds to read directly from the PCIe-based SSD drive, not from FS cache.

      use strict; use warnings; my $size = 24 * 1024 * 1024; open my $fh, '<', 'Dictionary2GB.txt' or die "$!"; while ( read( $fh, my $b, $size ) ) { $b .= <$fh>; } close $fh;

      Update: Am providing updated results due to background processes running previously. I rebooted my laptop and realized that things were running faster. That met having to re-run all the tests. Included are results for the upcoming MCE 1.706 release with faster IO ( applies to use_slurpio => 1 ). Previously, was unable to run below 3.0 seconds on the Mac with MCE 1.705. The run time is 2.2 seconds with MCE 1.706, which is close to the underlying hardware limit. MCE 1.706 will be released soon.

      I ran the same tests from a Linux VM via Parallels Desktop with the 2 GB plain text file residing on a virtual disk inside Fedora 22. Unlike on OS X, the binary grep command runs much faster under Linux.

      ## FS cache purged inside Linux and on Mac OS X before running. wc -l : 1.732 secs. from virtual disk grep -c : 1.912 secs. from virtual disk total : 3.644 secs. wc -l : 1.732 secs. from virtual disk grep -c : 0.884 secs. from FS cache total : 2.616 secs. Perl script : 3.910 secs. non-MCE using 1 core MCE 1.705 MCE 1.706 with MCE : 4.357 secs. 4.015 secs. using 1 core with MCE : 3.228 secs. 2.979 secs. using 2 cores with MCE : 2.884 secs. 2.624 secs. using 3 cores with MCE : 2.908 secs. 2.501 secs. using 4 cores ## Dictionary2GB.txt residing inside FS cache on Linux. wc -l : 1.035 secs. grep -c : 0.866 secs. total : 1.901 secs. Perl script : 2.314 secs. non-MCE using 1 core MCE 1.705 MCE 1.706 with MCE : 2.344 secs. 2.337 secs. using 1 core with MCE : 1.349 secs. 1.345 secs. using 2 cores with MCE : 0.961 secs. 0.932 secs. using 3 cores with MCE : 0.820 secs. 0.775 secs. using 4 cores

      On Linux, it takes at least 3 workers to run as fast as wc and grep combined with grep reading from FS cache.

      Below, the serial code and MCE code respectively.

      use strict; use warnings; my $size = 24 * 1024 * 1024; my ( $numlines, $occurances ) = ( 0, 0 ); open my $fh, '<', '/home/mario/Dictionary2GB.txt' or die "$!"; while ( read( $fh, my $b, $size ) ) { $b .= <$fh> unless ( eof $fh ); $numlines += $b =~ tr/\n//; $occurances += () = $b =~ /123456\r?$/mg; } close $fh; print "Num lines : $numlines\n"; print "Occurances: $occurances\n";

      Using MCE for running on multiple cores.

      use strict; use warnings; use MCE::Flow; use MCE::Shared; my $counter1 = MCE::Shared->scalar( 0 ); my $counter2 = MCE::Shared->scalar( 0 ); mce_flow_f { chunk_size => '24m', max_workers => 4, use_slurpio => 1, }, sub { my ( $mce, $chunk_ref, $chunk_id ) = @_; my $numlines = $$chunk_ref =~ tr/\n//; my $occurances = () = $$chunk_ref =~ /123456\r?$/mg; $counter1->incrby( $numlines ); $counter2->incrby( $occurances ); }, "/home/mario/Dictionary2GB.txt"; print "Num lines : ", $counter1->get(), "\n"; print "Occurances: ", $counter2->get(), "\n";

      Kind regards, Mario.

        MCE 1.706 has been released along with MCE::Shared 1.005. The MCE 1.706 release enables faster IO when use_slurpio => 1 is specified. Also, the chunk_size option is not necessary. The performance is close to optimum on auto.

        mce_flow_f { max_workers => 4, use_slurpio => 1 }, sub { ... }, '/path/to/huge_file.txt';

        Kind regards, Mario.

        Thank you. Finally some numbers from the same machine.

        It's good to see that perl is in the same ballpark with (and sometimes better than) grep+wc.