Re^2: How to optimize a regex on a large file read line by line ?

Update: Shorten code

Hello again,

Slurping requires double regular expressions. One for breaking into actual lines and the other for the query. Below, workers receive an array reference containing some number of lines and run slightly faster, possibly due to one regex.

use strict;
use warnings;

use MCE::Flow;
use MCE::Shared;

open my $fh, "10-million-combos.zip |" or die "$!";

my $counter1 = MCE::Shared->scalar( 0 );
my $counter2 = MCE::Shared->scalar( 0 );

mce_flow {
   chunk_size => '1m', max_workers => 8,
},
sub {
   my ( $mce, $chunk_ref, $chunk_id ) = @_;
   my $numlines   = @{ $chunk_ref };
   my $occurances = 0;

   for ( @{ $chunk_ref } ) {
      $occurances++ if /123456\r/;
   }

   $counter1->incrby( $numlines   );
   $counter2->incrby( $occurances );

}, $fh;

close $fh;

print "Num lines : ", $counter1->get(), "\n";
print "Occurances: ", $counter2->get(), "\n";
[download]

And finally, the construction for reading the plain text file directly.

use strict;
use warnings;

use MCE::Flow;
use MCE::Shared;

my $counter1 = MCE::Shared->scalar( 0 );
my $counter2 = MCE::Shared->scalar( 0 );

mce_flow_f {
   chunk_size => '1m', max_workers => 8,
},
sub {
   my ( $mce, $chunk_ref, $chunk_id ) = @_;
   my $numlines   = @{ $chunk_ref };
   my $occurances = 0;

   for ( @{ $chunk_ref } ) {
      $occurances++ if /123456\r/;
   }

   $counter1->incrby( $numlines   );
   $counter2->incrby( $occurances );

}, "10-million-combos.txt";

print "Num lines : ", $counter1->get(), "\n";
print "Occurances: ", $counter2->get(), "\n";
[download]

Comment on Re^2: How to optimize a regex on a large file read line by line ? Select or Download Code

Replies are listed 'Best First'.
Re^3: How to optimize a regex on a large file read line by line ? by John FENDER (Acolyte) on Apr 17, 2016 at 22:17 UTC
Hello marioroy, It's impressiv ! And i've test successfully your code. Eventually the Strawberry dist works the best on my system. "It's works even well with the mixed file (multiple kind of EOF) ! Benchmarking both the 3 methods, i've found 32,53 and 33,07 for the two codes provided kindfully. My current code (works only on a cr+lf or lf file) have done 33,76". It's impressive to see the 8 CPU cores up to 100% at the same time with your demo ! But the results only slightly differs with my code which doesn't impact all the core at all like that. Strange ! Very happy anyway, i'm now close to the best performance i could got on my laptop with perl ! `. Grep : 10,71 . Java : 25,95 . C# : 30,05 . Perl : 32,53 . C++ : 41,3 . PHP : 52,31 . Free Pascal : 76,46 . Delphi 7 : 78,14 . VB.NET : 100,15 . Python : 315,13 . PowerShell : 681,93 . VBS : 1031,63 . Ruby : Failed to parse the file correctly.` [download]	[reply] [d/l]
Re^4: How to optimize a regex on a large file read line by line ? by Anonymous Monk on Apr 18, 2016 at 16:37 UTC
Hi John FENDER. It seems like you might have given up too early, before the cavalry arrived. See Re^2: How to optimize a regex on a large file read line by line ?. Looks like both are even faster than the grep+wc solution, though this would need to be confirmed on your machine.	[reply]
Re^5: How to optimize a regex on a large file read line by line ? by John FENDER (Acolyte) on Apr 19, 2016 at 23:13 UTC
Hi ! You have all my code and my number, i can't give more ! On My machine, Grep+Wc are the faster. And the line by line solution are also faster than puttin all the lines in memory, then searching. Cheers.	[reply]
Re^6: How to optimize a regex on a large file read line by line ? by Anonymous Monk on Apr 20, 2016 at 20:48 UTC
Re^4: How to optimize a regex on a large file read line by line ? by LanX (Saint) on Apr 17, 2016 at 22:48 UTC
Look I'd appreciate if you'd stop posting benchmarks without showing the code. Alone our grep vs perl discussion showed that you are comparing apples with oranges. The only thing you are possibly benchmarking are your coding skills. Cheers Rolf _{(addicted to the Perl Programming Language and ☆☆☆☆ :) Je suis Charlie!}	[reply]
Re^5: How to optimize a regex on a large file read line by line ? by John FENDER (Acolyte) on Apr 18, 2016 at 06:31 UTC
It was not the place neither the object of my post, but why not. Not sure it will interest anyone here, but here it is : C++ `int counter=0; int counter2=0; regex MyRegex("123456$"); string ligne; ifstream file("10-million-combos.txt" ); if (file.is_open()) { while (getline(file,line)) { ++counter; if(regex_search(line,MyRegex)) { ++counter2; } }` [download] Pascal/Delphi `RegexObj := TRegExpr.Create; RegexObj.Expression := '123456$'; reset(tfIn); while not eof(tfIn) do begin readln(tfIn, s); if RegexObj.Exec(s) then counter2:=counter2+1; counter:=counter+1; end;` [download] PS `$f = [System.IO.File]::OpenText("10-million-combos.txt") while (! $f.EndOfStream) { $line = $f.ReadLine(); if ($line -match "123456$" ) { $counter +=1 } $counter2+=1 }` [download] Py `with open("10-million-combos.txt", encoding="cp850") as infile: for line in infile: counter2 += 1 if re.search('123456$', line): counter += 1` [download] Rb (not give he right numbers) `open("10-million-combos_LF.txt") do \|content\| content.each_line do \|line\| counter=counter+1 if line.match(/123456$/) counter2 += 1 end end end` [download] VB.net `Dim mStreamReader As StreamReader = New StreamReader("10-million-c +ombos.txt") line = mStreamReader.ReadLine() Do While (line IsNot Nothing) counter += 1 If Regex.IsMatch(line, "123456$") Then counter2 += 1 line = mStreamReader.ReadLine() Loop` [download] VBS `Set objTextFile = objFSO.OpenTextFile("10-million-combos.txt", For +Reading) Set objRegEx = CreateObject("VBScript.RegExp") objRegEx.Pattern = "123456$" Count=0 Count2=0 Do Until objTextFile.AtEndOfStream Count2=Count2+1 strNextLine = objTextFile.ReadLine Set colMatches = objRegEx.Execute(strNextLine) Count=Count+colMatches.Count Loop` [download] C# `Regex rgx = new Regex("123456$"); int counter = 0; int counter2 = 0; using (StreamReader sr = new StreamReader(@"10-million-combos.txt" +)) { String line; while ((line = sr.ReadLine()) != null) { ++counter; if (rgx.IsMatch(line)) { ++counter2; } } }` [download] PHP `$counter=0; $counter2=0; $handle = fopen("10-million-combos.txt", "r"); while (($line = fgets($handle)) !== false) { ++$counter; if(preg_match('/123456\R$/',$line)) { ++$counter2; } }` [download] Java `String line; Pattern p = Pattern.compile("123456$"); String fichier ="10-million-combos.txt"; InputStream ips=new FileInputStream(file); InputStreamReader ipsr=new InputStreamReader(ips); BufferedReader br=new BufferedReader(ipsr); while ((line=br.readLine())!=null){ m = p.matcher(line); if (m.find()) {count2+=1;} count+=1; }` [download]	[reply] [d/l] [select]