in reply to Re: How to optimize a regex on a large file read line by line ?
in thread How to optimize a regex on a large file read line by line ?

I already wanted to remark that since the FS is the bottleneck, I'm not sure if paralleling helps (there's only one FS)

When comparing with grep/wc please also compare the one worker case, cause grep shouldn't be paralleling (AFAIK)

BTW: While we never saw the bash script, I suppose we have to call wc twice to get the total number of numlines too (which makes comparing even more complicated, cause the second wc would need reading the file again)

Cheers Rolf
(addicted to the Perl Programming Language and ☆☆☆☆ :)
Je suis Charlie!

Replies are listed 'Best First'.
Re^3: How to optimize a regex on a large file read line by line ?
by marioroy (Prior) on Apr 21, 2016 at 16:53 UTC

    Update: Added serial code. Am happy that IO in MCE is not too far behind. One day, will try another technique. IO aside, any CPU intensive operations such as regex do benefit from running with multiple workers.

    Yes, IO will only go as fast as the underlying IO capabilities. MCE does sequential IO, meaning only one worker reads at any given time. The regex operation benefits from having multiple workers. Eventually, IO becomes the bottleneck.

    1 worker: 9.437 secs. 2 workers: 4.480 secs. 3 workers: 3.248 secs. 4 workers: 3.236 secs. 8 workers: 3.240 secs.

    Below, removed counting and regex from the equation and running with 1 worker. It completes as fast as IO allows in 3.256 seconds.

    mce_flow_f { chunk_size => '24m', max_workers => 1, use_slurpio => 1, }, sub { }, 'Dictionary2GB.txt';

    The following serial code, reader only and without MCE, takes 2.864 seconds to read directly from the PCIe-based SSD drive, not from FS cache.

    use strict; use warnings; my $size = 24 * 1024 * 1024; open my $fh, '<', 'Dictionary2GB.txt' or die "$!"; while ( read( $fh, my $b, $size ) ) { $b .= <$fh>; } close $fh;
Re^3: How to optimize a regex on a large file read line by line ?
by marioroy (Prior) on Apr 21, 2016 at 20:42 UTC

    Update: Am providing updated results due to background processes running previously. I rebooted my laptop and realized that things were running faster. That met having to re-run all the tests. Included are results for the upcoming MCE 1.706 release with faster IO ( applies to use_slurpio => 1 ). Previously, was unable to run below 3.0 seconds on the Mac with MCE 1.705. The run time is 2.2 seconds with MCE 1.706, which is close to the underlying hardware limit. MCE 1.706 will be released soon.

    I ran the same tests from a Linux VM via Parallels Desktop with the 2 GB plain text file residing on a virtual disk inside Fedora 22. Unlike on OS X, the binary grep command runs much faster under Linux.

    ## FS cache purged inside Linux and on Mac OS X before running. wc -l : 1.732 secs. from virtual disk grep -c : 1.912 secs. from virtual disk total : 3.644 secs. wc -l : 1.732 secs. from virtual disk grep -c : 0.884 secs. from FS cache total : 2.616 secs. Perl script : 3.910 secs. non-MCE using 1 core MCE 1.705 MCE 1.706 with MCE : 4.357 secs. 4.015 secs. using 1 core with MCE : 3.228 secs. 2.979 secs. using 2 cores with MCE : 2.884 secs. 2.624 secs. using 3 cores with MCE : 2.908 secs. 2.501 secs. using 4 cores ## Dictionary2GB.txt residing inside FS cache on Linux. wc -l : 1.035 secs. grep -c : 0.866 secs. total : 1.901 secs. Perl script : 2.314 secs. non-MCE using 1 core MCE 1.705 MCE 1.706 with MCE : 2.344 secs. 2.337 secs. using 1 core with MCE : 1.349 secs. 1.345 secs. using 2 cores with MCE : 0.961 secs. 0.932 secs. using 3 cores with MCE : 0.820 secs. 0.775 secs. using 4 cores

    On Linux, it takes at least 3 workers to run as fast as wc and grep combined with grep reading from FS cache.

    Below, the serial code and MCE code respectively.

    use strict; use warnings; my $size = 24 * 1024 * 1024; my ( $numlines, $occurances ) = ( 0, 0 ); open my $fh, '<', '/home/mario/Dictionary2GB.txt' or die "$!"; while ( read( $fh, my $b, $size ) ) { $b .= <$fh> unless ( eof $fh ); $numlines += $b =~ tr/\n//; $occurances += () = $b =~ /123456\r?$/mg; } close $fh; print "Num lines : $numlines\n"; print "Occurances: $occurances\n";

    Using MCE for running on multiple cores.

    use strict; use warnings; use MCE::Flow; use MCE::Shared; my $counter1 = MCE::Shared->scalar( 0 ); my $counter2 = MCE::Shared->scalar( 0 ); mce_flow_f { chunk_size => '24m', max_workers => 4, use_slurpio => 1, }, sub { my ( $mce, $chunk_ref, $chunk_id ) = @_; my $numlines = $$chunk_ref =~ tr/\n//; my $occurances = () = $$chunk_ref =~ /123456\r?$/mg; $counter1->incrby( $numlines ); $counter2->incrby( $occurances ); }, "/home/mario/Dictionary2GB.txt"; print "Num lines : ", $counter1->get(), "\n"; print "Occurances: ", $counter2->get(), "\n";

    Kind regards, Mario.

      MCE 1.706 has been released along with MCE::Shared 1.005. The MCE 1.706 release enables faster IO when use_slurpio => 1 is specified. Also, the chunk_size option is not necessary. The performance is close to optimum on auto.

      mce_flow_f { max_workers => 4, use_slurpio => 1 }, sub { ... }, '/path/to/huge_file.txt';

      Kind regards, Mario.

      Thank you. Finally some numbers from the same machine.

      It's good to see that perl is in the same ballpark with (and sometimes better than) grep+wc.

        I had to run the tests again after booting my Mac and seeing that things were snappier. Perhaps, something was running in the background when testing the first time. Maybe, the virtual disk containing the 2GB file was defragmented between then and now. Am not sure really.

        Timings for the upcoming MCE 1.706 release are included for comparison.