in reply to How to optimize a regex on a large file read line by line ?

I'd appreciate it if someone would take this "big buffer" approach and adopt it to the test case and get timings for it. I'm stuck on this small tablet so I can't test it myself.

http://ideone.com/LzaQI0

I don't even know how to paste it into this post, sorry

  • Comment on Re: How to optimize a regex on a large file read line by line ?

Replies are listed 'Best First'.
Re^2: How to optimize a regex on a large file read line by line ?
by marioroy (Prior) on Apr 18, 2016 at 05:21 UTC

    Update: Changed the chunk_size option from '1m' to '24m'. The time drops down to 3.2 seconds via MCE with FS cache purged ( sudo purge ) before running on a Macbook Pro laptop. Previously, this was taking 6.2 seconds for chunk_size => '1m'. The time is ~ 1 second if the file resides in FS cache.

    Update: Added the 'm' modifier to the regex operation.

    Update: Ensuring the file does not live in FS cache, the time is 7.8 seconds running serially and 6.2 seconds running on many cores for the ~ 2 GB plain text file. Once in FS cache, the time is 5.4 seconds serially and 0.9 seconds via MCE.

    Update: The unzipping of the file met that the file resided in FS cache afterwards. One doesn't normally flush FS memory typically. But, I met to do so before running. I have already removed the zip and plain text files and did not run again. IO is fast when processing a file directly. The reason is that workers do not involved the manager process when reading.

    Anonymous Monk, the following is a parallel demonstration of the online code. Yes, reading line by line is not necessary. Thus performance increases by 5x from the serial version. This is also faster than the previous parallel demonstrations by many factors.

    The parallel example below parses the ~ 2 GB plain text file in 0.9 seconds. The online serial demonstration completes in 5.2 seconds. My laptop has 4 real cores and 4 hyper-threads. Seeing nearly 6x is really good and did not expect that.

    use strict; use warnings; use MCE::Flow; use MCE::Shared; my $counter1 = MCE::Shared->scalar( 0 ); my $counter2 = MCE::Shared->scalar( 0 ); mce_flow_f { chunk_size => '24m', max_workers => 8, use_slurpio => 1, }, sub { my ( $mce, $chunk_ref, $chunk_id ) = @_; my $numlines = $$chunk_ref =~ tr/\n//; my $occurances = () = $$chunk_ref =~ /123456\r?$/mg; $counter1->incrby( $numlines ); $counter2->incrby( $occurances ); }, "Dictionary2GB.txt"; print "Num lines : ", $counter1->get(), "\n"; print "Occurances: ", $counter2->get(), "\n";

      How do you handle a chunk that ends in the middle of the pattern? I did it by completing the partial line (see code line with comment "finish partial line").

        Yes, that is likely to happen when slurping a chunk. MCE handles that automatically by reading till the end of line.

      Thanks for the timings. If possible, would you please also get a time for the grep+wc on your machine so we can tell how both these solutions compare to it.

Re^2: How to optimize a regex on a large file read line by line ?
by LanX (Saint) on Apr 18, 2016 at 01:59 UTC
    The code is incomplete cause a match could span two chunks.

    You need to seek back the longest possible match (here 8) before reading the next chunk.

    Actually the correct number is something like min ( p ,m )

    With p = chunksize - pos

    and m = length of longest possible match

    Cheers Rolf
    (addicted to the Perl Programming Language and ☆☆☆☆ :)
    Je suis Charlie!

      The match is only in one line, that's the purpose of the line

      $_ .= <$fh> // '';

      It completes a partial line.

        Ahh! You are combining read with readline ...

         $_ .= <$fh> // ''; # finish partial line

        That's a good trick!

        (As long as a line doesn't become bigger than memory, but that's hardly the case here.)

        Cheers Rolf
        (addicted to the Perl Programming Language and ☆☆☆☆ :)
        Je suis Charlie!