in reply to Re: How to optimize a regex on a large file read line by line ?
in thread How to optimize a regex on a large file read line by line ?

Update: Shorten code

Hello again,

Slurping requires double regular expressions. One for breaking into actual lines and the other for the query. Below, workers receive an array reference containing some number of lines and run slightly faster, possibly due to one regex.

use strict; use warnings; use MCE::Flow; use MCE::Shared; open my $fh, "10-million-combos.zip |" or die "$!"; my $counter1 = MCE::Shared->scalar( 0 ); my $counter2 = MCE::Shared->scalar( 0 ); mce_flow { chunk_size => '1m', max_workers => 8, }, sub { my ( $mce, $chunk_ref, $chunk_id ) = @_; my $numlines = @{ $chunk_ref }; my $occurances = 0; for ( @{ $chunk_ref } ) { $occurances++ if /123456\r/; } $counter1->incrby( $numlines ); $counter2->incrby( $occurances ); }, $fh; close $fh; print "Num lines : ", $counter1->get(), "\n"; print "Occurances: ", $counter2->get(), "\n";

And finally, the construction for reading the plain text file directly.

use strict; use warnings; use MCE::Flow; use MCE::Shared; my $counter1 = MCE::Shared->scalar( 0 ); my $counter2 = MCE::Shared->scalar( 0 ); mce_flow_f { chunk_size => '1m', max_workers => 8, }, sub { my ( $mce, $chunk_ref, $chunk_id ) = @_; my $numlines = @{ $chunk_ref }; my $occurances = 0; for ( @{ $chunk_ref } ) { $occurances++ if /123456\r/; } $counter1->incrby( $numlines ); $counter2->incrby( $occurances ); }, "10-million-combos.txt"; print "Num lines : ", $counter1->get(), "\n"; print "Occurances: ", $counter2->get(), "\n";

Replies are listed 'Best First'.
Re^3: How to optimize a regex on a large file read line by line ?
by John FENDER (Acolyte) on Apr 17, 2016 at 22:17 UTC
    Hello marioroy,

    It's impressiv ! And i've test successfully your code. Eventually the Strawberry dist works the best on my system.

    "It's works even well with the mixed file (multiple kind of EOF) ! Benchmarking both the 3 methods, i've found 32,53 and 33,07 for the two codes provided kindfully. My current code (works only on a cr+lf or lf file) have done 33,76".

    It's impressive to see the 8 CPU cores up to 100% at the same time with your demo ! But the results only slightly differs with my code which doesn't impact all the core at all like that. Strange !

    Very happy anyway, i'm now close to the best performance i could got on my laptop with perl !

    . Grep : 10,71 . Java : 25,95 . C# : 30,05 . Perl : 32,53 . C++ : 41,3 . PHP : 52,31 . Free Pascal : 76,46 . Delphi 7 : 78,14 . VB.NET : 100,15 . Python : 315,13 . PowerShell : 681,93 . VBS : 1031,63 . Ruby : Failed to parse the file correctly.
        Hi ! You have all my code and my number, i can't give more ! On My machine, Grep+Wc are the faster. And the line by line solution are also faster than puttin all the lines in memory, then searching. Cheers.
      Look I'd appreciate if you'd stop posting benchmarks without showing the code.

      Alone our grep vs perl discussion showed that you are comparing apples with oranges.

      The only thing you are possibly benchmarking are your coding skills.

      Cheers Rolf
      (addicted to the Perl Programming Language and ☆☆☆☆ :)
      Je suis Charlie!

        It was not the place neither the object of my post, but why not. Not sure it will interest anyone here, but here it is :

        C++
        int counter=0; int counter2=0; regex MyRegex("123456$"); string ligne; ifstream file("10-million-combos.txt" ); if (file.is_open()) { while (getline(file,line)) { ++counter; if(regex_search(line,MyRegex)) { ++counter2; } }
        Pascal/Delphi
        RegexObj := TRegExpr.Create; RegexObj.Expression := '123456$'; reset(tfIn); while not eof(tfIn) do begin readln(tfIn, s); if RegexObj.Exec(s) then counter2:=counter2+1; counter:=counter+1; end;
        PS
        $f = [System.IO.File]::OpenText("10-million-combos.txt") while (! $f.EndOfStream) { $line = $f.ReadLine(); if ($line -match "123456$" ) { $counter +=1 } $counter2+=1 }
        Py
        with open("10-million-combos.txt", encoding="cp850") as infile: for line in infile: counter2 += 1 if re.search('123456$', line): counter += 1
        Rb (not give he right numbers)
        open("10-million-combos_LF.txt") do |content| content.each_line do |line| counter=counter+1 if line.match(/123456$/) counter2 += 1 end end end
        VB.net
        Dim mStreamReader As StreamReader = New StreamReader("10-million-c +ombos.txt") line = mStreamReader.ReadLine() Do While (line IsNot Nothing) counter += 1 If Regex.IsMatch(line, "123456$") Then counter2 += 1 line = mStreamReader.ReadLine() Loop
        VBS
        Set objTextFile = objFSO.OpenTextFile("10-million-combos.txt", For +Reading) Set objRegEx = CreateObject("VBScript.RegExp") objRegEx.Pattern = "123456$" Count=0 Count2=0 Do Until objTextFile.AtEndOfStream Count2=Count2+1 strNextLine = objTextFile.ReadLine Set colMatches = objRegEx.Execute(strNextLine) Count=Count+colMatches.Count Loop
        C#
        Regex rgx = new Regex("123456$"); int counter = 0; int counter2 = 0; using (StreamReader sr = new StreamReader(@"10-million-combos.txt" +)) { String line; while ((line = sr.ReadLine()) != null) { ++counter; if (rgx.IsMatch(line)) { ++counter2; } } }
        PHP
        $counter=0; $counter2=0; $handle = fopen("10-million-combos.txt", "r"); while (($line = fgets($handle)) !== false) { ++$counter; if(preg_match('/123456\R$/',$line)) { ++$counter2; } }
        Java
        String line; Pattern p = Pattern.compile("123456$"); String fichier ="10-million-combos.txt"; InputStream ips=new FileInputStream(file); InputStreamReader ipsr=new InputStreamReader(ips); BufferedReader br=new BufferedReader(ipsr); while ((line=br.readLine())!=null){ m = p.matcher(line); if (m.find()) {count2+=1;} count+=1; }