Update: Am providing updated results due to background processes running previously. I rebooted my laptop and realized that things were running faster. That met having to re-run all the tests. Included are results for the upcoming MCE 1.706 release with faster IO ( applies to use_slurpio => 1 ). Previously, was unable to run below 3.0 seconds on the Mac with MCE 1.705. The run time is 2.2 seconds with MCE 1.706, which is close to the underlying hardware limit. MCE 1.706 will be released soon.

I ran the same tests from a Linux VM via Parallels Desktop with the 2 GB plain text file residing on a virtual disk inside Fedora 22. Unlike on OS X, the binary grep command runs much faster under Linux.

## FS cache purged inside Linux and on Mac OS X before running. wc -l : 1.732 secs. from virtual disk grep -c : 1.912 secs. from virtual disk total : 3.644 secs. wc -l : 1.732 secs. from virtual disk grep -c : 0.884 secs. from FS cache total : 2.616 secs. Perl script : 3.910 secs. non-MCE using 1 core MCE 1.705 MCE 1.706 with MCE : 4.357 secs. 4.015 secs. using 1 core with MCE : 3.228 secs. 2.979 secs. using 2 cores with MCE : 2.884 secs. 2.624 secs. using 3 cores with MCE : 2.908 secs. 2.501 secs. using 4 cores ## Dictionary2GB.txt residing inside FS cache on Linux. wc -l : 1.035 secs. grep -c : 0.866 secs. total : 1.901 secs. Perl script : 2.314 secs. non-MCE using 1 core MCE 1.705 MCE 1.706 with MCE : 2.344 secs. 2.337 secs. using 1 core with MCE : 1.349 secs. 1.345 secs. using 2 cores with MCE : 0.961 secs. 0.932 secs. using 3 cores with MCE : 0.820 secs. 0.775 secs. using 4 cores

On Linux, it takes at least 3 workers to run as fast as wc and grep combined with grep reading from FS cache.

Below, the serial code and MCE code respectively.

use strict; use warnings; my $size = 24 * 1024 * 1024; my ( $numlines, $occurances ) = ( 0, 0 ); open my $fh, '<', '/home/mario/Dictionary2GB.txt' or die "$!"; while ( read( $fh, my $b, $size ) ) { $b .= <$fh> unless ( eof $fh ); $numlines += $b =~ tr/\n//; $occurances += () = $b =~ /123456\r?$/mg; } close $fh; print "Num lines : $numlines\n"; print "Occurances: $occurances\n";

Using MCE for running on multiple cores.

use strict; use warnings; use MCE::Flow; use MCE::Shared; my $counter1 = MCE::Shared->scalar( 0 ); my $counter2 = MCE::Shared->scalar( 0 ); mce_flow_f { chunk_size => '24m', max_workers => 4, use_slurpio => 1, }, sub { my ( $mce, $chunk_ref, $chunk_id ) = @_; my $numlines = $$chunk_ref =~ tr/\n//; my $occurances = () = $$chunk_ref =~ /123456\r?$/mg; $counter1->incrby( $numlines ); $counter2->incrby( $occurances ); }, "/home/mario/Dictionary2GB.txt"; print "Num lines : ", $counter1->get(), "\n"; print "Occurances: ", $counter2->get(), "\n";

Kind regards, Mario.


In reply to Re^3: How to optimize a regex on a large file read line by line ? by marioroy
in thread How to optimize a regex on a large file read line by line ? by John FENDER

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.