Re^2: How to optimize a regex on a large file read line by line ?

Replies are listed 'Best First'.
Re^3: How to optimize a regex on a large file read line by line ? by LanX (Saint) on Apr 16, 2016 at 15:47 UTC
How do you grep line by line? I suppose grep does the same like I suggested before, reading large chunks into memory and trying to match multiple lines at once. Another option is to fork into four child's each processing a quarter to use the full power of your machine. And btw using lexical variables declared with `my` should help a little too. Cheers Rolf _{(addicted to the Perl Programming Language and ☆☆☆☆ :) Je suis Charlie!}	[reply] [d/l]
Re^4: How to optimize a regex on a large file read line by line ? by John FENDER (Acolyte) on Apr 16, 2016 at 16:16 UTC
It's a good query. Currently i can't anser you as i need to take a look into the grep source for that. Currently i'm just greping the file like that : `grep "12345$" myfile` Same for the count : `wc -l myfile` I have another code in Perl for doing what you are suggesting maybe : it first load all the lines in memory, then grep it. But unfortunatly, the result is worst compared to a line by line try (2,47s vs 8,33s). Here is the code used for this test (on a reduced set, 200 mb) `open (FH, '<', "../Tests/10-million-combos.txt"); print "Loading the file...\n"; while (<FH>) { push (@_file_to_parse, $_); } print "Counting the file...\n"; $NumberOfLine=@_file_to_parse; print "Searching 123456$...\n"; @_result=grep {/123456$/} @_file_to_parse; $NumberOfResult=@_result; print "$NumberOfResult - $NumberOfLine\n"; close FH;` [download]	[reply] [d/l] [select]
Re^5: How to optimize a regex on a large file read line by line ? by LanX (Saint) on Apr 16, 2016 at 19:33 UTC
`while (<FH>) { push (@_file_to_parse, $_); }` This can't work faster , like explained in the linked post you have to `read` a chunk at once. And please search for `use Benchmark` and/or `use Time::HiRes` to learn how to measure performance. (I'm mobile so no code sorry ) Btw I have my doubts what kind of grep and wc you might be using on Win. update ah indeed GNU grep is using regex by default for patterns not fixed strings... Cheers Rolf _{(addicted to the Perl Programming Language and ☆☆☆☆ :) Je suis Charlie!}	[reply] [d/l] [select]
Re^5: How to optimize a regex on a large file read line by line ? (timing) by LanX (Saint) on Apr 16, 2016 at 20:51 UTC
some rough timing to demonstrate how reading line by line will damage the performance ###### Timing grep (not fair bc exec takes time too) DB<163> $start=time; print `grep 123456\$ txt`; print time-$start 123456 ... # shortend 123456 0.207661151885986 ###### Reading and parsing a chunk from Perl not much slower DB<164> $start=time; open FH,"<txt"; read FH,$txt,100e6; print $txt +=~ /(123456\n)/g; print time-$start 123456 ... 123456 0.257488012313843 ###### Even reading a chunk takes already half the time DB<165> $start=time; print $txt =~ /(123456\n)/g; print time-$start 123456 ... 123456 0.116161108016968 DB<166> $start=time; open FH,"<txt"; read FH,$txt,100e6; print time +-$start 0.124891042709351 ####### Size of txt is 70 MB DB<167> length $txt => 70000080 ###### READING LINE BY LINE IS A BOTTLENECK DB<168> $start=time; open FH,"<txt"; while ($txt=<FH>){ print $1 if +$txt =~ /(123456\n)/g;} print time-$start 123456 ... 123456 16.3332719802856 [download] all done on a netbook with ubuntu. questions left? Cheers Rolf _{(addicted to the Perl Programming Language and ☆☆☆☆ :) Je suis Charlie!}	[reply] [d/l]
Re^5: How to optimize a regex on a large file read line by line ? by LanX (Saint) on Apr 16, 2016 at 16:52 UTC
First of all make.sure the results fit, thus grep doesn't take $.as regex meta symbol!!! Than I seen to remember problems with end of line localization and or Unicode Please check the encoding for \n in your file and your Perl. HTH! :) Cheers Rolf _{(addicted to the Perl Programming Language and ☆☆☆☆ :) Je suis Charlie!}	[reply]
Re^6: How to optimize a regex on a large file read line by line ? by John FENDER (Acolyte) on Apr 16, 2016 at 18:05 UTC
Re^4: How to optimize a regex on a large file read line by line ? by John FENDER (Acolyte) on Apr 16, 2016 at 18:25 UTC
Interesting. Thanks !	[reply]
Re^3: How to optimize a regex on a large file read line by line ? by AnomalousMonk (Archbishop) on Apr 16, 2016 at 15:51 UTC
But do you confirm that the processing time with Perl for the OPed code is in excess of 12 minutes? That's what would be shocking to me. Someone else would have to advise about differences between distributions (I'm running Strawberry 5.14.4.1 for my tests (update: on Windows 7)), but I would be flabbergasted by such a performance difference. Give a man a fish: `<%-{-{-{-<`	[reply] [d/l]
Re^4: How to optimize a regex on a large file read line by line ? by John FENDER (Acolyte) on Apr 16, 2016 at 16:05 UTC
I confirm for the 12 minuts. I'm currently making tests with Dwimperl and Straberry Perl for comparing. I will also do a snap of the time. Maybe i'm doing something wrong... I'm wandering if the 64 bits distribution was maybe less optimized that the 32 bits. I will also add that to my todo list.	[reply]
Re^3: How to optimize a regex on a large file read line by line ? by John FENDER (Acolyte) on Apr 16, 2016 at 15:47 UTC
By the way, here is the full 2 Gb dict i'm using for tests : http://mab.to/tbT8VsPDm Please give me your execution times with the same code, your plateform, it's interesting.	[reply]
Re^4: How to optimize a regex on a large file read line by line ? by poj (Abbot) on Apr 16, 2016 at 16:28 UTC
Please give me your execution times with the same code Using my own 200 million record 2Gb file, it takes 25 secs to get a count of lines only and 50 seconds with the regex included. (win 10 i5 3.3GHz/8GB AS v5.16.1) `#!perl use strict; my $testfile = '200-million-combos.txt'; unless (-e $testfile){ open OUT,'>',$testfile or die "$!"; my $record = '890123456'; for (1..200_000_000){ print OUT $record."\n"; } close OUT; } my $counter1 = 0; my $counter2 = 0; my $t0 = time; open FH, '<', $testfile or die "$!"; while (<FH>) { ++$counter1; if (/123456$/){ ++$counter2; } } close FH; my $dur = time-$t0;; print "$counter1 read in $dur secs\n";` [download] poj	[reply] [d/l]
Re^5: How to optimize a regex on a large file read line by line ? by John FENDER (Acolyte) on Apr 16, 2016 at 18:01 UTC
Sound good to my hear. Which distribution/version are you using ?	[reply]
Re^6: How to optimize a regex on a large file read line by line ? by poj (Abbot) on Apr 16, 2016 at 18:14 UTC
Re^7: How to optimize a regex on a large file read line by line ? by John FENDER (Acolyte) on Apr 16, 2016 at 18:24 UTC
Some notes below your chosen depth have not been shown here
Re^4: How to optimize a regex on a large file read line by line ? by polettix (Vicar) on Apr 16, 2016 at 16:57 UTC
`$ time ./script.pl dict.txt Num. Line : 185866729 - Occ : 14900 real 0m39.453s user 0m38.999s sys 0m0.445s $ perl -v This is perl 5, version 16, subversion 2 (v5.16.2) built for darwin-th +read-multi-2level (with 3 registered patches, see perl -V for more detail)` [download] Mac OS X 10.9.5, Intel Core i7 2.4 GHz, 16 GB RAM 1600 MHz DDR3 You can shove some time off getting rid of `$counter` and using `$.` instead, a quick test took about 6 seconds less in my configuration. perl -ple'$_=reverse' <<<ti.xittelop@oivalf Io ho capito... ma tu che hai detto?	[reply] [d/l] [select]
Re^5: How to optimize a regex on a large file read line by line ? by John FENDER (Acolyte) on Apr 16, 2016 at 18:03 UTC
So maybe an issue related to my Windows/Distro, i will try to search why. Thanks.	[reply]
Re^6: How to optimize a regex on a large file read line by line ? by polettix (Vicar) on Apr 16, 2016 at 20:53 UTC
Re^7: How to optimize a regex on a large file read line by line ? by John FENDER (Acolyte) on Apr 16, 2016 at 21:52 UTC

update