Re^2: grep for lines containg two variables

seems that lookahead grep is the slightly faster solution ... here is the test script I used:

#!/usr/bin/perl -w
# usage : ./this_script.pl < input_file > captured_benchmarks
use strict;
use Benchmark;
my @data=<>;
my (@res1,@res2,@res3);
timethese (100000000,
    { grep_and => q{
    @res1 = grep /GGGGGACACCTTCTCTCTCT/ && /RH_MEa0001bG06/, @data;
    },
    double_grep => q{
    @res2 = grep /GGGGGACACCTTCTCTCTCT/,grep /RH_MEa0001bG06/,@data;
    },
    lookahead_grep => q{
    @res3 = grep /^(?=.*GGGGGACACCTTCTCTCTCT)(?=.*RH_MEa0001bG06)/,@da
+ta;
    }
}
);
[download]

... and the results

Benchmark: timing 100000000 iterations of double_grep, grep_and, looka
+head_grep...

double_grep    :
27 wallclock secs (26.98 usr +  0.00 sys = 26.98 CPU) @ 3705899.79/s (
+n=100000000)

grep_and    :
24 wallclock secs (23.05 usr +  0.00 sys = 23.05 CPU) @ 4338959.52/s (
+n=100000000)

lookahead_grep    :
24 wallclock secs (22.83 usr +  0.00 sys = 22.83 CPU) @ 4380585.25/s (
+n=100000000)
[download]

Comment on Re^2: grep for lines containg two variables Select or Download Code

Replies are listed 'Best First'.
Re^3: grep for lines containg two variables by ikegami (Patriarch) on Dec 08, 2005 at 01:09 UTC
I wish I had a computer that executed 4380585.25 greps per second... Your test is useless. The @data, $string1 and $string2 used by the test are always undef. I fixed it up below. Note the use of `sub { ... }` instead of `q{ ... }`. Subs capture over `my` variables, while the string is `eval`ed in a different scope where the `my` varibles don't exist. `#!/usr/bin/perl use strict; use warnings; use Benchmark qw( cmpthese ); my $string1 = qr/tr/; my $string2 = qr/e/; my @data = do { open(my $fh, '<', $0) or die; <$fh> }; cmpthese (-3, { grep_and => sub { my @r = grep /$string1/ && /$string2/, @data; return @r; }, double_grep => sub { my @r = grep /$string1/, grep /$string2/, @data; return @r; }, lookahead => sub { my @r = grep /^(?=.$string1)(?=.$string2)/, @data; return @r; } });` [download] outputs `Rate lookahead double_grep grep_and lookahead 8114/s -- -52% -62% double_grep 16986/s 109% -- -21% grep_and 21483/s 165% 26% --` [download] The contents of @data are probably not all that good, so the figures aren't perfect, but they give a pretty good idea.	[reply] [d/l] [select]
Re^4: grep for lines containg two variables by l3v3l (Monk) on Dec 08, 2005 at 17:54 UTC
I am not sure I understand your points - so to clarify: listing above was posted as pseudo code and not the actual code run - for $string1 and $string2 in the test case I put the literal strings of interest in place (and have updated to illustrate) I used what I found interesting for my specific situation and I left it generic when posting the code example above because I thought that would be more useful - I understand your point re: my $str... and now that works properly if I use your qr update (and I get the same times). This is not a useful or valid benchmark? my input file was: `RH_MEa0001bA09_1 1253 871 10 GGAGAGGGGTCGAATTTCTC... RH_MEa0001bB03_1 553 104 12 GTCCGTTGCAACAAAAGTGA... RH_MEa0001bC11_1 1160 385 12 TGGGGTTGAAGAAAGGTTNG... RH_MEa0001bG06_1 710 14 18 Invalid starting position (14) RH_MEa0001bG06_2 710 34 10 GGGGGACACCTTCTCTCTCT... RH_MEa0001bG06_3 710 51 10 GGGGGACACCTTCTCTCTCT... etc` [download] since diff boxes have different performance depending on input_files, strings, mem., proc. etc - I guess it is better to list relative results instead of specifics ... is it just luck that your results confirm the general reason I posted, that lookahead grep is the fastest (accurate) solution?	[reply] [d/l]
Re^5: grep for lines containg two variables by ikegami (Patriarch) on Dec 08, 2005 at 18:35 UTC
I am not sure I understand your points Change `q{` to `q{use strict;` or `q{print scalar @data;` and you'll see. You've updated your node, but the problem is still there. `@data` is empty in the tests, because the tests are using `our @main::data` and not the `my @data` that holds the test file. listing above was posted as pseudo code and not the actual code run That's rather silly. since diff boxes have different performance depending on input_files, strings, mem., proc. etc - Sorry, but your machine is not 540x faster than mine. Change `q{` to `sub {` and you'll see.	[reply] [d/l] [select]
Re^6: grep for lines containg two variables by l3v3l (Monk) on Dec 08, 2005 at 19:55 UTC


Keep It Simple, Stupid
	PerlMonks