Re^2: How to optimize a regex on a large file read line by line ?

"The predefined global variable $. does that for you"

Wasn't aware of this trick, thanks !

"Spoiler alert: your file "10-million-combos.txt" does not contain any lines that match /123456$/."

Hahem, sound like i've done something wrong while zipping the file. Now the 19x mb file containing 10 millions password are updated in the right way. You will find 10000000 lines in it, and 61466 with the regex 123456$.

"unzip -p 10-million-combos.txt.zip | perlscript"

Currently i'm working on txt file only. But it's interesting. I've done your test like that :

    echo 1:%time%
    unzip -p 10-million-combos.zip | grep 123456$ | wc -l
    echo 2:%time%
    grep 123456$ 10-million-combos.txt  | wc -l
    echo 3:%time%
    pause
[download]

Result :

1:19:16:46,11
  61466
2:19:16:48,43
  61466
3:19:16:49,00
[download]

0,58 in plaintext, 2,27 in zip file piped.

More now with your command line

zip piped : 3,89

unzip -p "C:\Users\admin\Desktop\10-million-combos.zip" | perl -ne "BE
+GIN{$n=0} $n++ if /123456$/; END{print $n}"

plain text : 5,16

type "C:\Users\admin\Desktop\10-million-combos.txt" | perl -ne "BEGIN{
+$n=0} $n++ if /123456$/; END{print $n}")

perl direct : 2,29

perl "demo.pl"
[download]

=Fastest on my side stay the direct access to the plain text file either using grep or perl. Amazing to see the perl unzip goes faster than the plain text access with an inline command... The shell is strange sometimes...

"I was going to suggest using the gnu/*n*x "grep" command-line utility to get a performance baseline"

Im' using the one you can find in the unix utils, i suppose it's the GNU one ported on windows. --version give me : grep (GNU grep) 2.4.2.

Now grep vs perl

echo %time%& grep 123456$ C:\Users\admin\Desktop\10-million-combos.txt
+ | wc -l& echo %time%
echo %time%& type "C:\Users\admin\Desktop\10-million-combos.txt" | per
+l -ne "BEGIN{$n=0} $n++ if /123456$/; END{print $n}"& echo.&echo %tim
+e%
echo %time%& perl demo.pl& echo %time%
[download]

Give me :

19:43:28,91/61466/19:43:29,51 for grep (0,6)
19:45:29,51/61466/19:45:34,71 for perl (5,2)
19:46:13,27/61466/19:46:15,47 for perl (direct) (2,2)
[download]

Comment on Re^2: How to optimize a regex on a large file read line by line ? Select or Download Code

Replies are listed 'Best First'.
Re^3: How to optimize a regex on a large file read line by line ? by graff (Chancellor) on Apr 18, 2016 at 09:02 UTC
Thanks for showing your comparison of the unzip pipeline vs. reading uncompressed text. I had said that the former would be faster (because of less reading from disk), but without actually testing it. (I think I must have encountered at least a couple situations in the past where some process finished more quickly if I read compressed data from disk, rather than uncompressed, but I don't know what may have been different in those cases.) Having now tested it for this situation (multiple times in quick succession to check for consistency), the difference in timing was negligible or slightly favoring reading the uncompressed file, so it seems my initial idea about the role of disk access was wrong: either it really doesn't make any difference, or else whatever difference it makes is washed out by the added overhead of the extra unzip process and/or the pipeline itself. (The perl one-liner was still faster than the compiled "grep" utility on my machine, but YMMV - different machines will have different versions / compilations of both Perl and grep.)	[reply]

Replies are listed 'Best First'.

Re^3: How to optimize a regex on a large file read line by line ?
by graff (Chancellor) on Apr 18, 2016 at 09:02 UTC

Having now tested it for this situation (multiple times in quick succession to check for consistency), the difference in timing was negligible or slightly favoring reading the uncompressed file, so it seems my initial idea about the role of disk access was wrong: either it really doesn't make any difference, or else whatever difference it makes is washed out by the added overhead of the extra unzip process and/or the pipeline itself.

(The perl one-liner was still faster than the compiled "grep" utility on my machine, but YMMV - different machines will have different versions / compilations of both Perl and grep.)

[reply]