Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"
 
PerlMonks  

Re^2: grep for lines containg two variables

by l3v3l (Monk)
on Dec 07, 2005 at 23:35 UTC ( #515071=note: print w/replies, xml ) Need Help??


in reply to Re: grep for lines containg two variables
in thread grep for lines containg two variables

seems that lookahead grep is the slightly faster solution ... here is the test script I used:
#!/usr/bin/perl -w # usage : ./this_script.pl < input_file > captured_benchmarks use strict; use Benchmark; my @data=<>; my (@res1,@res2,@res3); timethese (100000000, { grep_and => q{ @res1 = grep /GGGGGACACCTTCTCTCTCT/ && /RH_MEa0001bG06/, @data; }, double_grep => q{ @res2 = grep /GGGGGACACCTTCTCTCTCT/,grep /RH_MEa0001bG06/,@data; }, lookahead_grep => q{ @res3 = grep /^(?=.*GGGGGACACCTTCTCTCTCT)(?=.*RH_MEa0001bG06)/,@da +ta; } } );
... and the results
Benchmark: timing 100000000 iterations of double_grep, grep_and, looka +head_grep... double_grep : 27 wallclock secs (26.98 usr + 0.00 sys = 26.98 CPU) @ 3705899.79/s ( +n=100000000) grep_and : 24 wallclock secs (23.05 usr + 0.00 sys = 23.05 CPU) @ 4338959.52/s ( +n=100000000) lookahead_grep : 24 wallclock secs (22.83 usr + 0.00 sys = 22.83 CPU) @ 4380585.25/s ( +n=100000000)

Replies are listed 'Best First'.
Re^3: grep for lines containg two variables
by ikegami (Patriarch) on Dec 08, 2005 at 01:09 UTC

    I wish I had a computer that executed 4380585.25 greps per second...

    Your test is useless. The @data, $string1 and $string2 used by the test are always undef. I fixed it up below. Note the use of sub { ... } instead of q{ ... }. Subs capture over my variables, while the string is evaled in a different scope where the my varibles don't exist.

    #!/usr/bin/perl use strict; use warnings; use Benchmark qw( cmpthese ); my $string1 = qr/tr/; my $string2 = qr/e/; my @data = do { open(my $fh, '<', $0) or die; <$fh> }; cmpthese (-3, { grep_and => sub { my @r = grep /$string1/ && /$string2/, @data; return @r; }, double_grep => sub { my @r = grep /$string1/, grep /$string2/, @data; return @r; }, lookahead => sub { my @r = grep /^(?=.*$string1)(?=.*$string2)/, @data; return @r; } });
    outputs
    Rate lookahead double_grep grep_and lookahead 8114/s -- -52% -62% double_grep 16986/s 109% -- -21% grep_and 21483/s 165% 26% --

    The contents of @data are probably not all that good, so the figures aren't perfect, but they give a pretty good idea.

      I am not sure I understand your points - so to clarify:

      listing above was posted as pseudo code and not the actual code run - for $string1 and $string2 in the test case I put the literal strings of interest in place (and have updated to illustrate) I used what I found interesting for my specific situation and I left it generic when posting the code example above because I thought that would be more useful - I understand your point re: my $str... and now that works properly if I use your qr update (and I get the same times). This is not a useful or valid benchmark?

      my input file was:

      RH_MEa0001bA09_1 1253 871 10 GGAGAGGGGTCGAATTTCTC... RH_MEa0001bB03_1 553 104 12 GTCCGTTGCAACAAAAGTGA... RH_MEa0001bC11_1 1160 385 12 TGGGGTTGAAGAAAGGTTNG... RH_MEa0001bG06_1 710 14 18 Invalid starting position (14) RH_MEa0001bG06_2 710 34 10 GGGGGACACCTTCTCTCTCT... RH_MEa0001bG06_3 710 51 10 GGGGGACACCTTCTCTCTCT... etc

      since diff boxes have different performance depending on input_files, strings, mem., proc. etc - I guess it is better to list relative results instead of specifics ... is it just luck that your results confirm the general reason I posted, that lookahead grep is the fastest (accurate) solution?

        I am not sure I understand your points

        Change q{ to q{use strict; or q{print scalar @data; and you'll see. You've updated your node, but the problem is still there. @data is empty in the tests, because the tests are using our @main::data and not the my @data that holds the test file.

        listing above was posted as pseudo code and not the actual code run

        That's rather silly.

        since diff boxes have different performance depending on input_files, strings, mem., proc. etc -

        Sorry, but your machine is not 540x faster than mine. Change q{ to sub { and you'll see.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://515071]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
Others contemplating the Monastery: (2)
As of 2022-05-24 01:10 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?
    Do you prefer to work remotely?



    Results (82 votes). Check out past polls.

    Notices?