Re^2: How to optimize a regex on a large file read line by line ?
by John FENDER (Acolyte) on Apr 16, 2016 at 15:40 UTC
|
I'm currently hiding nothing :).
I've the latest ActiveState PERL installed on my machine (ActivePerl-5.22.1.2201-MSWin32-x64-299574).
I've uploaded on my FTP both file i used for my tests. I'm running a Windows 10 home edition (it's my personnal laptop, as i'm at home these days), with a Quad Core 3.1/16 Gb.
To give you an idea, a grep + wc command give me a result of 10 s, java or c#, 30s, c++, 48s, php 7, 50s, ruby 85s, python, 346s, powershell 682s, VBS, 1031s, Free Pascal,72,58s, VB.NET,100,63s...
Maybe something related to the perl distribution you think ? I will try with another distribution.
| [reply] |
|
|
How do you grep line by line?
I suppose grep does the same like I suggested before, reading large chunks into memory and trying to match multiple lines at once.
Another option is to fork into four child's each processing a quarter to use the full power of your machine.
And btw using lexical variables declared with my should help a little too.
| [reply] [d/l] |
|
|
It's a good query.
Currently i can't anser you as i need to take a look into the grep source for that. Currently i'm just greping the file like that :
grep "12345$" myfile
Same for the count :
wc -l myfile
I have another code in Perl for doing what you are suggesting maybe : it first load all the lines in memory, then grep it. But unfortunatly, the result is worst compared to a line by line try (2,47s vs 8,33s). Here is the code used for this test (on a reduced set, 200 mb)
open (FH, '<', "../Tests/10-million-combos.txt");
print "Loading the file...\n";
while (<FH>) {
push (@_file_to_parse, $_);
}
print "Counting the file...\n";
$NumberOfLine=@_file_to_parse;
print "Searching 123456$...\n";
@_result=grep {/123456$/} @_file_to_parse;
$NumberOfResult=@_result;
print "$NumberOfResult - $NumberOfLine\n";
close FH;
| [reply] [d/l] [select] |
|
|
|
|
|
|
|
|
|
|
| [reply] |
|
|
But do you confirm that the processing time with Perl for the OPed code is in excess of 12 minutes? That's what would be shocking to me.
Someone else would have to advise about differences between distributions (I'm running Strawberry 5.14.4.1 for my tests (update: on Windows 7)), but I would be flabbergasted by such a performance difference.
Give a man a fish: <%-{-{-{-<
| [reply] [d/l] |
|
|
| [reply] |
|
|
By the way, here is the full 2 Gb dict i'm using for tests :
http://mab.to/tbT8VsPDm
Please give me your execution times with the same code, your plateform, it's interesting.
| [reply] |
|
|
#!perl
use strict;
my $testfile = '200-million-combos.txt';
unless (-e $testfile){
open OUT,'>',$testfile or die "$!";
my $record = '890123456';
for (1..200_000_000){
print OUT $record."\n";
}
close OUT;
}
my $counter1 = 0;
my $counter2 = 0;
my $t0 = time;
open FH, '<', $testfile or die "$!";
while (<FH>) {
++$counter1;
if (/123456$/){
++$counter2;
}
}
close FH;
my $dur = time-$t0;;
print "$counter1 read in $dur secs\n";
poj | [reply] [d/l] |
|
|
|
|
|
|
|
$ time ./script.pl dict.txt
Num. Line : 185866729 - Occ : 14900
real 0m39.453s
user 0m38.999s
sys 0m0.445s
$ perl -v
This is perl 5, version 16, subversion 2 (v5.16.2) built for darwin-th
+read-multi-2level
(with 3 registered patches, see perl -V for more detail)
Mac OS X 10.9.5, Intel Core i7 2.4 GHz, 16 GB RAM 1600 MHz DDR3
You can shove some time off getting rid of $counter and using $. instead, a quick test took about 6 seconds less in my configuration.
perl -ple'$_=reverse' <<<ti.xittelop@oivalf
Io ho capito... ma tu che hai detto?
| [reply] [d/l] [select] |
|
|
|
|
|