in reply to Re^2: How to optimize a regex on a large file read line by line ?
in thread How to optimize a regex on a large file read line by line ?

How do you grep line by line?

I suppose grep does the same  like I suggested before, reading large chunks into memory and trying to match multiple lines at once.

Another option is to fork into four child's each processing a quarter to use the full power of your machine.

And btw using lexical variables declared with my should help a little too.

Cheers Rolf
(addicted to the Perl Programming Language and ☆☆☆☆ :)
Je suis Charlie!

Replies are listed 'Best First'.
Re^4: How to optimize a regex on a large file read line by line ?
by John FENDER (Acolyte) on Apr 16, 2016 at 16:16 UTC

    It's a good query.

    Currently i can't anser you as i need to take a look into the grep source for that. Currently i'm just greping the file like that :

    grep "12345$" myfile

    Same for the count :

    wc -l myfile

    I have another code in Perl for doing what you are suggesting maybe : it first load all the lines in memory, then grep it. But unfortunatly, the result is worst compared to a line by line try (2,47s vs 8,33s). Here is the code used for this test (on a reduced set, 200 mb)

    open (FH, '<', "../Tests/10-million-combos.txt"); print "Loading the file...\n"; while (<FH>) { push (@_file_to_parse, $_); } print "Counting the file...\n"; $NumberOfLine=@_file_to_parse; print "Searching 123456$...\n"; @_result=grep {/123456$/} @_file_to_parse; $NumberOfResult=@_result; print "$NumberOfResult - $NumberOfLine\n"; close FH;
       while (<FH>) { push (@_file_to_parse, $_); }

      This can't work faster , like explained in the linked post you have to read a chunk at once.

      And please search for use Benchmark and/or use Time::HiRes to learn how to measure performance. (I'm mobile so no code sorry )

      Btw I have my doubts what kind of grep and wc you might be using on Win.

      update

      ah indeed GNU grep is using regex by default for patterns not fixed strings...

      Cheers Rolf
      (addicted to the Perl Programming Language and ☆☆☆☆ :)
      Je suis Charlie!

      some rough timing to demonstrate how reading line by line will damage the performance

      ###### Timing grep (not fair bc exec takes time too) DB<163> $start=time; print `grep 123456\$ txt`; print time-$start 123456 ... # shortend 123456 0.207661151885986 ###### Reading and parsing a chunk from Perl not much slower DB<164> $start=time; open FH,"<txt"; read FH,$txt,100e6; print $txt +=~ /(123456\n)/g; print time-$start 123456 ... 123456 0.257488012313843 ###### Even reading a chunk takes already half the time DB<165> $start=time; print $txt =~ /(123456\n)/g; print time-$start 123456 ... 123456 0.116161108016968 DB<166> $start=time; open FH,"<txt"; read FH,$txt,100e6; print time +-$start 0.124891042709351 ####### Size of txt is 70 MB DB<167> length $txt => 70000080 ###### READING LINE BY LINE IS A BOTTLENECK DB<168> $start=time; open FH,"<txt"; while ($txt=<FH>){ print $1 if +$txt =~ /(123456\n)/g;} print time-$start 123456 ... 123456 16.3332719802856

      all done on a netbook with ubuntu.

      questions left?

      Cheers Rolf
      (addicted to the Perl Programming Language and ☆☆☆☆ :)
      Je suis Charlie!

      First of all make.sure the results fit, thus grep doesn't take $.as regex meta symbol!!!

      Than I seen to remember problems with end of line localization and or Unicode

      Please check the encoding for \n in your file and your Perl.

      HTH! :)

      Cheers Rolf
      (addicted to the Perl Programming Language and ☆☆☆☆ :)
      Je suis Charlie!

        All langage give the same count using the same regex (i speak for my own test). The EOF are the usual Windows one (13+10). Grep take the $ as it should be as the result are the same with the other try. :)
Re^4: How to optimize a regex on a large file read line by line ?
by John FENDER (Acolyte) on Apr 16, 2016 at 18:25 UTC
    Interesting. Thanks !