After visiting this node I realized that I wasn't too sure when to use grep(1), perl, or perl's grep
Here are the results of my experiment: First I created a data file
#!/usr/bin/perl -w use strict; my $a; my $i; open( OUT, ">data.txt" ); for( $i = 0; $i < 1000000; $i++ ) { $a = rand( 126 ); $a = int( $a ); until( $a > 33 ) { $a = rand( 126 ); $a = int( $a ); } print OUT chr( $a ); print OUT "\n" if( $i % 80 == 0); } close( OUT );
Then I made some tests using the Benchmark module
#!/usr/bin/perl use Benchmark; sub child_grepping { $a = `/bin/grep 'S;Dvg&T?sBu=\@j4qkP&O' data.txt`; } sub inline_grepping { open( IN, "data.txt" ); @lines = <IN>; close( IN ); ($a) = grep /S;Dvg&T\?sBu=\@j4qkP&O/, @lines; } sub child_perlizing { my $command = "perl -e 'while( \$line = <> ) { if( \$line =~ "; $command .= "/S;Dvg&T\\\?sBu=\\\@j4qkP&O/) { print \$line; last; } + }' "; $command .= '< data.txt'; $a = `$command`; } sub inline_perlizing { my $line; open( IN, "data.txt" ); while( $line = <IN> ) { if( $line =~ /S;Dvg&T\?sBu=\@j4qkP&O/ ) { $a = $line; last; } } close( IN ); } timethese( 1000, { child_grepping => 'child_grepping()', child_perlizing => 'child_perlizing()', inline_grepping => 'inline_grepping()', inline_perlizing => 'inline_perlizing()' } );
And then when I ran my test I got some interesting results (this took a while btw ;)
Benchmark: timing 1000 iterations of child_grepping, child_perlizing, inline_grepping, inline_perlizing... child_grepping: 15 wallclock secs ( 0.17 usr 0.62 sys + 7.27 cusr 7.17 csys = 0. +00 CPU) child_perlizing: 90 wallclock secs ( 0.17 usr 0.60 sys + 75.91 cusr 6.87 csys = 0. +00 CPU) inline_grepping: 191 wallclock secs (177.97 usr + 7.64 sys = 185.61 CPU) inline_perlizing: 66 wallclock secs (59.68 usr + 1.19 sys = 60.87 CPU)
Can somebody help me analyze the results?

-- Dave

Replies are listed 'Best First'.
Re: Groking grep
by Trinary (Pilgrim) on Feb 06, 2001 at 23:59 UTC
    I'm not positive on the specifics of Benchmark, but since the first two (child_grepping and child_perlizing) start separate processes, it would make sense that their CPU time wouldn't be recorded by Benchmark. That's why child_grepping appears to be so much faster than the others...child_perlizing take so long (even though it starts a child) because it starts a separate perl interpreter for each interation, which is horrendously inefficient.

    I think what you should take away from this is that standard regexes are the most efficient way to match stuff in general, I only use grep in certain situations where it lends itself to simpler code.

    Enjoy

    Trinary

      I took the "cusr" and "csys" values as being the Child versions of the "usr" and "sys" values that inline_grepping and inline_perlizing generated.

      -- Dave
Re: Groking grep
by Anonymous Monk on Feb 07, 2001 at 02:07 UTC
    There is a always tradeoff between file open/close operations and spawning children.

    This will vary somewhat from computer to computer. On my own machine, I got nearly identical timings for the "child_grep" and the "inline_perl" routines.

    This may also vary depending on the data set. When I used a data.txt which was only one tenth the size, the "inline_perl" routine was actually five times faster than the "child_grep". When the data file got larger, the "child_grep" became the faster choice by a reasonable margin.

    A lot of this depends on where your computer's bottlenecks are. Is the data file on your local disk, or being served over NFS? (And are the NFS files being cached?) Are you low on memory? Is your processor too slow to keep up?

    I think you will also find as a general rule that "inline_grep" is slower than "inline_perl", especially if a match occurs early in the data set and you can jump out of the "inline_perl" loop early with a "last" command.

Re: Groking grep
by dws (Chancellor) on Feb 06, 2001 at 23:56 UTC
    Have you verified that all of the subroutines return the expected result?

    Updated: That can be a rude thing to ask, but everyone once in a while I get tripped up by this one in my haste to rush forward.

    Update + 1: looks like Trinary nailed it below.

      I most certainly did, and they all do. Obviously you should only generate the data file once, and find a suitable piece of text in it to search for. Then add 'print "$a\n"' as the last line of each statement to make sure that all the subs work (which they should).

      -- Dave