Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"
 
PerlMonks  

Re^3: grep for lines containg two variables

by ikegami (Patriarch)
on Dec 08, 2005 at 01:09 UTC ( [id://515096]=note: print w/replies, xml ) Need Help??


in reply to Re^2: grep for lines containg two variables
in thread grep for lines containg two variables

I wish I had a computer that executed 4380585.25 greps per second...

Your test is useless. The @data, $string1 and $string2 used by the test are always undef. I fixed it up below. Note the use of sub { ... } instead of q{ ... }. Subs capture over my variables, while the string is evaled in a different scope where the my varibles don't exist.

#!/usr/bin/perl use strict; use warnings; use Benchmark qw( cmpthese ); my $string1 = qr/tr/; my $string2 = qr/e/; my @data = do { open(my $fh, '<', $0) or die; <$fh> }; cmpthese (-3, { grep_and => sub { my @r = grep /$string1/ && /$string2/, @data; return @r; }, double_grep => sub { my @r = grep /$string1/, grep /$string2/, @data; return @r; }, lookahead => sub { my @r = grep /^(?=.*$string1)(?=.*$string2)/, @data; return @r; } });
outputs
Rate lookahead double_grep grep_and lookahead 8114/s -- -52% -62% double_grep 16986/s 109% -- -21% grep_and 21483/s 165% 26% --

The contents of @data are probably not all that good, so the figures aren't perfect, but they give a pretty good idea.

Replies are listed 'Best First'.
Re^4: grep for lines containg two variables
by l3v3l (Monk) on Dec 08, 2005 at 17:54 UTC
    I am not sure I understand your points - so to clarify:

    listing above was posted as pseudo code and not the actual code run - for $string1 and $string2 in the test case I put the literal strings of interest in place (and have updated to illustrate) I used what I found interesting for my specific situation and I left it generic when posting the code example above because I thought that would be more useful - I understand your point re: my $str... and now that works properly if I use your qr update (and I get the same times). This is not a useful or valid benchmark?

    my input file was:

    RH_MEa0001bA09_1 1253 871 10 GGAGAGGGGTCGAATTTCTC... RH_MEa0001bB03_1 553 104 12 GTCCGTTGCAACAAAAGTGA... RH_MEa0001bC11_1 1160 385 12 TGGGGTTGAAGAAAGGTTNG... RH_MEa0001bG06_1 710 14 18 Invalid starting position (14) RH_MEa0001bG06_2 710 34 10 GGGGGACACCTTCTCTCTCT... RH_MEa0001bG06_3 710 51 10 GGGGGACACCTTCTCTCTCT... etc

    since diff boxes have different performance depending on input_files, strings, mem., proc. etc - I guess it is better to list relative results instead of specifics ... is it just luck that your results confirm the general reason I posted, that lookahead grep is the fastest (accurate) solution?

      I am not sure I understand your points

      Change q{ to q{use strict; or q{print scalar @data; and you'll see. You've updated your node, but the problem is still there. @data is empty in the tests, because the tests are using our @main::data and not the my @data that holds the test file.

      listing above was posted as pseudo code and not the actual code run

      That's rather silly.

      since diff boxes have different performance depending on input_files, strings, mem., proc. etc -

      Sorry, but your machine is not 540x faster than mine. Change q{ to sub { and you'll see.

        Right! thank you for the pointers/clarification - makes sense now!!!! and this is now valid: (?)
        #!/usr/bin/perl -w # usage : ./this_script.pl input_file > captured_benchmarks use strict; use Benchmark; my @data = do { open(my $fh, '<', $0) or die; <$fh> }; timethese (1000000, { grep_and => sub{ my @res1 = grep /GGGGGACACCTTCTCTCTCT/ && /RH_MEa0001bG06/,@data; }, double_grep => sub{ my @res2 = grep /GGGGGACACCTTCTCTCTCT/,grep /RH_MEa0001bG06/,@data +; }, lookahead_grep => sub{ my @res3 = grep /^(?=.*GGGGGACACCTTCTCTCTCT)(?=.*RH_MEa0001bG06)/, +@data; } } );
        because I get the following now:
        Benchmark: timing 1000000 iterations of double_grep, grep_and, lookahe +ad_grep... double_grep: 11 wallclock secs ( 9.06 usr + 0.00 sys = 9.06 CPU) @ 1 +10350.92/s (n=1000000) grep_and: 8 wallclock secs ( 8.70 usr + 0.00 sys = 8.70 CPU) @ 11 +4902.91/s (n=1000000) lookahead_grep: 21 wallclock secs (20.31 usr + 0.00 sys = 20.31 CPU) +@ 49231.98/s (n=1000000)

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://515096]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others making s'mores by the fire in the courtyard of the Monastery: (7)
As of 2024-04-19 10:01 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found