Beefy Boxes and Bandwidth Generously Provided by pair Networks
There's more than one way to do things
 
PerlMonks  

Re: grep for lines containg two variables

by ikegami (Patriarch)
on Dec 07, 2005 at 16:29 UTC ( [id://514885]=note: print w/replies, xml ) Need Help??


in reply to grep for lines containg two variables

@interesting_lines = grep /$string1/, grep /$string2/, @log;

or

@interesting_lines = grep /$string1/ && /$string2/, @log;

or

@interesting_lines = grep /^(?=.*$string1)(?=.*$string2)/, @log;

OT Note: If your strings contain text, not regex, be sure to escape them using quotemeta or /\Q$string\E/. Better yet, use index instead of regexs in that case since it's much faster.

Replies are listed 'Best First'.
Re^2: grep for lines containg two variables
by l3v3l (Monk) on Dec 07, 2005 at 23:35 UTC
    seems that lookahead grep is the slightly faster solution ... here is the test script I used:
    #!/usr/bin/perl -w # usage : ./this_script.pl < input_file > captured_benchmarks use strict; use Benchmark; my @data=<>; my (@res1,@res2,@res3); timethese (100000000, { grep_and => q{ @res1 = grep /GGGGGACACCTTCTCTCTCT/ && /RH_MEa0001bG06/, @data; }, double_grep => q{ @res2 = grep /GGGGGACACCTTCTCTCTCT/,grep /RH_MEa0001bG06/,@data; }, lookahead_grep => q{ @res3 = grep /^(?=.*GGGGGACACCTTCTCTCTCT)(?=.*RH_MEa0001bG06)/,@da +ta; } } );
    ... and the results
    Benchmark: timing 100000000 iterations of double_grep, grep_and, looka +head_grep... double_grep : 27 wallclock secs (26.98 usr + 0.00 sys = 26.98 CPU) @ 3705899.79/s ( +n=100000000) grep_and : 24 wallclock secs (23.05 usr + 0.00 sys = 23.05 CPU) @ 4338959.52/s ( +n=100000000) lookahead_grep : 24 wallclock secs (22.83 usr + 0.00 sys = 22.83 CPU) @ 4380585.25/s ( +n=100000000)

      I wish I had a computer that executed 4380585.25 greps per second...

      Your test is useless. The @data, $string1 and $string2 used by the test are always undef. I fixed it up below. Note the use of sub { ... } instead of q{ ... }. Subs capture over my variables, while the string is evaled in a different scope where the my varibles don't exist.

      #!/usr/bin/perl use strict; use warnings; use Benchmark qw( cmpthese ); my $string1 = qr/tr/; my $string2 = qr/e/; my @data = do { open(my $fh, '<', $0) or die; <$fh> }; cmpthese (-3, { grep_and => sub { my @r = grep /$string1/ && /$string2/, @data; return @r; }, double_grep => sub { my @r = grep /$string1/, grep /$string2/, @data; return @r; }, lookahead => sub { my @r = grep /^(?=.*$string1)(?=.*$string2)/, @data; return @r; } });
      outputs
      Rate lookahead double_grep grep_and lookahead 8114/s -- -52% -62% double_grep 16986/s 109% -- -21% grep_and 21483/s 165% 26% --

      The contents of @data are probably not all that good, so the figures aren't perfect, but they give a pretty good idea.

        I am not sure I understand your points - so to clarify:

        listing above was posted as pseudo code and not the actual code run - for $string1 and $string2 in the test case I put the literal strings of interest in place (and have updated to illustrate) I used what I found interesting for my specific situation and I left it generic when posting the code example above because I thought that would be more useful - I understand your point re: my $str... and now that works properly if I use your qr update (and I get the same times). This is not a useful or valid benchmark?

        my input file was:

        RH_MEa0001bA09_1 1253 871 10 GGAGAGGGGTCGAATTTCTC... RH_MEa0001bB03_1 553 104 12 GTCCGTTGCAACAAAAGTGA... RH_MEa0001bC11_1 1160 385 12 TGGGGTTGAAGAAAGGTTNG... RH_MEa0001bG06_1 710 14 18 Invalid starting position (14) RH_MEa0001bG06_2 710 34 10 GGGGGACACCTTCTCTCTCT... RH_MEa0001bG06_3 710 51 10 GGGGGACACCTTCTCTCTCT... etc

        since diff boxes have different performance depending on input_files, strings, mem., proc. etc - I guess it is better to list relative results instead of specifics ... is it just luck that your results confirm the general reason I posted, that lookahead grep is the fastest (accurate) solution?

Re^2: grep for lines containg two variables
by tweetiepooh (Hermit) on Dec 08, 2005 at 14:12 UTC
    Just a couple of queries...

    What if one string is contained in the other but should not be counted?
    Ok so add white space checks/breaks to the search strings.

    Do you need to lookahead on both strings? This seems to work.
    @interesting_lines = grep /(?=.*$string1).*$string2/,@log;
      What if one string is contained in the other but should not be counted?

      That's tricky. Do you really want that? I don't have the time right now to spend the effort on a what-if you won't use.

      Do you need to lookahead on both strings?

      No, it's optional on the last one. If you had 4 strings, you'd need the lookahead on the first three, but it would be optional on the fourth.

        Nah I don't want any of it, just commenting.

        If you are looking for whole words then you could use spaces around your searches but else it would be a real pain I'd agree.
      What if one string is contained in the other but should not be counted?

      How about

      grep /$string1.*$string2|$string2.*$string1/,

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: note [id://514885]
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others admiring the Monastery: (5)
As of 2024-03-28 21:05 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found