in reply to Re^2: How to optimize a regex on a large file read line by line ?
in thread How to optimize a regex on a large file read line by line ?

Hello marioroy,

It's impressiv ! And i've test successfully your code. Eventually the Strawberry dist works the best on my system.

"It's works even well with the mixed file (multiple kind of EOF) ! Benchmarking both the 3 methods, i've found 32,53 and 33,07 for the two codes provided kindfully. My current code (works only on a cr+lf or lf file) have done 33,76".

It's impressive to see the 8 CPU cores up to 100% at the same time with your demo ! But the results only slightly differs with my code which doesn't impact all the core at all like that. Strange !

Very happy anyway, i'm now close to the best performance i could got on my laptop with perl !

. Grep : 10,71 . Java : 25,95 . C# : 30,05 . Perl : 32,53 . C++ : 41,3 . PHP : 52,31 . Free Pascal : 76,46 . Delphi 7 : 78,14 . VB.NET : 100,15 . Python : 315,13 . PowerShell : 681,93 . VBS : 1031,63 . Ruby : Failed to parse the file correctly.

Replies are listed 'Best First'.
Re^4: How to optimize a regex on a large file read line by line ?
by Anonymous Monk on Apr 18, 2016 at 16:37 UTC
      Hi ! You have all my code and my number, i can't give more ! On My machine, Grep+Wc are the faster. And the line by line solution are also faster than puttin all the lines in memory, then searching. Cheers.

        marioroy has found that the bigbuffer method (found in http://ideone.com/LzaQI0) runs in 7.8 seconds on his machine, vs the 10.71 seconds you get for grep+wc. What's not known is the relative speed of the two machines. Unfortunately, my machines are in storage while I'm moving, and I'm stuck on this tiny tablet, so I can't run any benchmarks myself. So I ask again, please, can you adopt and run the above code on your machine so we can get accurate results for both the above bigbuffer technique and grep+wc?

Re^4: How to optimize a regex on a large file read line by line ?
by LanX (Saint) on Apr 17, 2016 at 22:48 UTC
    Look I'd appreciate if you'd stop posting benchmarks without showing the code.

    Alone our grep vs perl discussion showed that you are comparing apples with oranges.

    The only thing you are possibly benchmarking are your coding skills.

    Cheers Rolf
    (addicted to the Perl Programming Language and ☆☆☆☆ :)
    Je suis Charlie!

      It was not the place neither the object of my post, but why not. Not sure it will interest anyone here, but here it is :

      C++
      int counter=0; int counter2=0; regex MyRegex("123456$"); string ligne; ifstream file("10-million-combos.txt" ); if (file.is_open()) { while (getline(file,line)) { ++counter; if(regex_search(line,MyRegex)) { ++counter2; } }
      Pascal/Delphi
      RegexObj := TRegExpr.Create; RegexObj.Expression := '123456$'; reset(tfIn); while not eof(tfIn) do begin readln(tfIn, s); if RegexObj.Exec(s) then counter2:=counter2+1; counter:=counter+1; end;
      PS
      $f = [System.IO.File]::OpenText("10-million-combos.txt") while (! $f.EndOfStream) { $line = $f.ReadLine(); if ($line -match "123456$" ) { $counter +=1 } $counter2+=1 }
      Py
      with open("10-million-combos.txt", encoding="cp850") as infile: for line in infile: counter2 += 1 if re.search('123456$', line): counter += 1
      Rb (not give he right numbers)
      open("10-million-combos_LF.txt") do |content| content.each_line do |line| counter=counter+1 if line.match(/123456$/) counter2 += 1 end end end
      VB.net
      Dim mStreamReader As StreamReader = New StreamReader("10-million-c +ombos.txt") line = mStreamReader.ReadLine() Do While (line IsNot Nothing) counter += 1 If Regex.IsMatch(line, "123456$") Then counter2 += 1 line = mStreamReader.ReadLine() Loop
      VBS
      Set objTextFile = objFSO.OpenTextFile("10-million-combos.txt", For +Reading) Set objRegEx = CreateObject("VBScript.RegExp") objRegEx.Pattern = "123456$" Count=0 Count2=0 Do Until objTextFile.AtEndOfStream Count2=Count2+1 strNextLine = objTextFile.ReadLine Set colMatches = objRegEx.Execute(strNextLine) Count=Count+colMatches.Count Loop
      C#
      Regex rgx = new Regex("123456$"); int counter = 0; int counter2 = 0; using (StreamReader sr = new StreamReader(@"10-million-combos.txt" +)) { String line; while ((line = sr.ReadLine()) != null) { ++counter; if (rgx.IsMatch(line)) { ++counter2; } } }
      PHP
      $counter=0; $counter2=0; $handle = fopen("10-million-combos.txt", "r"); while (($line = fgets($handle)) !== false) { ++$counter; if(preg_match('/123456\R$/',$line)) { ++$counter2; } }
      Java
      String line; Pattern p = Pattern.compile("123456$"); String fichier ="10-million-combos.txt"; InputStream ips=new FileInputStream(file); InputStreamReader ipsr=new InputStreamReader(ips); BufferedReader br=new BufferedReader(ipsr); while ((line=br.readLine())!=null){ m = p.matcher(line); if (m.find()) {count2+=1;} count+=1; }