push to an array lines matching a pattern, and the lines before and after the matching

gianni has asked for the wisdom of the Perl Monks concerning the following question:

Hi, I would like to do a pattern matching on a file (about 200 MegaBytes) and then push in an array the matching lines and also an arbitrary number of lines before and after each matching line.
sub1 takes 11 seconds
sub2, which uses unix egrep, 1 second
the sub6 (ack) 50 seconds (it is faster if you don't use \b, \s anchors etc)
ack from the command line takes 15 seconds
Any way to speed up the sub1?
It seems that perl grep is much slower than the unix one.
thanks

use 5.014;
use strict;
use warnings;
use Time::HiRes qw(usleep ualarm gettimeofday tv_interval);
use List::MoreUtils qw(uniq);

###################################################
open FILE, '<textMatchInAfile.txt' or die;

my $p = '\bsala|che|relazione|di|questo|coso|^qui$';
my $mR = 1;        #print more rows before - after the matching
my @n  = <FILE>;

&sub1( $p, $mR, @n );    #suggest: insert references
&sub3( $p, $mR );

###################################################

sub sub1 {               #questa sub usa perl grep
    my $p    = $_[0];             #pattern
    my $mR   = $_[1];             #more rows
    my @n    = @_[ 2 .. $#_ ];    #input File
    my $time = [gettimeofday];

    my @new = grep { $n[$_] =~ /$p/ } 0 .. $#n;
    my @unique =
      map { @n[ $_ - $mR .. $_ + $mR ] } @new[ 0 + $mR .. $#new - $mR 
+];
    say "\n" . 'time sub1 perl grep: ' . tv_interval($time);
    @unique = uniq(@unique);
    say "sub 1 $#unique";
}

#############################################

sub sub3 {    #unix grep with color and line numbers
    my $p   = $_[0];
    my $mR  = $_[1];
    my $cmd = "grep -n -C $mR";    #with line numbers
    $p =~ s/\|/ /g;
    $p =~ s/\h+/" -e "/g;
    $p = ' -e "' . $p . '" ';
    say "cmd ===$cmd=== ss ===$p===";
    my @values;
    $values[0] = $p;
    $values[1] = ( ' ' . 'textMatchInAfile.txt' );    #lasciare lo spa
+zio
    my $time = [gettimeofday];
    my @valori = `$cmd @values` or die "system @values` failed: $?";
    say 'sub3 egrep shell: ' . $#valori;
    say 'time sub3 tempo trovati con egrep shell ' . tv_interval($time
+);
    my @uniq_list = uniq(@valori);
}

#############################################

sub sub6 {             #perl ack
    my $p  = $_[0];    #pattern
    my $mR = $_[1];    #more rows
    my @values;
    my $time   = [gettimeofday];
    my @valori = qx (ack -C $mR "$p" textMatchInAfile.txt)
      or die "system @values` failed: $?";
    say 'number of values found with ack' . $#valori;
    say 'time sub6 ack' . tv_interval($time);
}
[download]

Comment on push to an array lines matching a pattern, and the lines before and after the matching Download Code

Replies are listed 'Best First'.
Re: push to an array lines matching a pattern, and the lines before and after the matching by JavaFan (Canon) on Mar 02, 2012 at 21:40 UTC
Your Perl solution slurps in the entire file, one line per element, copies the entire contents to @_, copies that once more into @n lexical to the sub. Unix grep doesn't do any of that. I would not be ashamed to call out to an external program. No need to reinvent the wheel, even if the wheel isn't found on CPAN.	[reply]
Re^2: push to an array lines matching a pattern, and the lines before and after the matching by gianni (Novice) on Mar 04, 2012 at 17:23 UTC
Perl grep, that is the slow part... and yes, unix grep is 10 times faster	[reply]
Re: push to an array lines matching a pattern, and the lines before and after the matching by ww (Archbishop) on Mar 02, 2012 at 23:54 UTC
Re your desire to capture a match-line and two lines around it, pseudocode: `Open ( $filehandle "<" "path/to/filename.txt" ) or die, "Can't open $f +ile, $!; Read a $line; $buffer = $line; if ($line =~ /pattern/ { push @array, $buffer; push array $line; # the matching line read another $line; push it too; #and now, except for a match on the first line of a +file, or the last, we have 3 lines captured } else { do it all again... }` [download]	[reply] [d/l]
Re^2: push to an array lines matching a pattern, and the lines before and after the matching by LanX (Saint) on Mar 03, 2012 at 00:04 UTC
Technically, I wouldn't be surprised if using seek to look behind in a sliding window is more efficient than buffering the last line.š But it's hard to believe that there isn't already an full featured unix-grep emulation in pure perl, and we are reinventing the wheel. Have to admit, I'm too lazy to search CPAN now ... Cheers Rolf UPDATE: š) and a regex even more.	[reply]
Re^3: push to an array lines matching a pattern, and the lines before and after the matching by JavaFan (Canon) on Mar 03, 2012 at 00:41 UTC
But it's hard to believe that there isn't already an full featured unix-grep emulation in pure perl, and we are reinventing the wheel. It's called ack. And it is a reinvention of the wheel. `acks` usefulness doesn't come from the fact it's implemented in Perl, but because it has some features that grep doesn't have, and has some better defaults. In the OP's case, he's better of with calling grep, then with ack. (grep is expected to be faster (as it's in C, not in Perl))	[reply] [d/l]
Re^4: push to an array lines matching a pattern, and the lines before and after the matching by LanX (Saint) on Mar 03, 2012 at 03:25 UTC
Re^4: push to an array lines matching a pattern, and the lines before and after the matching by gianni (Novice) on Mar 04, 2012 at 10:08 UTC
Re^2: push to an array lines matching a pattern, and the lines before and after the matching by gianni (Novice) on Mar 04, 2012 at 10:20 UTC
in reply to "Re your desire to capture a match-line and two lines around it, pseudocode: ... thanks but it is too slow and it doesn't allow easily to change the number of rows captured	[reply]
Re^3: push to an array lines matching a pattern, and the lines before and after the matching by ww (Archbishop) on Mar 04, 2012 at 11:51 UTC
gianni... "...it is too slow" Yes, it may be, particularly since it's only pseudocode. So I'd be more confident about ruleing out such an approach if I knew "how did you test it?" You may be -- even probably are -- right, but I'm not buying that until I see how you translated the pseudocode into compilable code... and what your benchmarks look like. Somewhat similarly, your statement that `ack` is " is slower than the "pure perl" implementation when using anchors or \b,\s etc " is hard to credit. I see your timing code in "sub6," but not the quantified results, nor any test results on the system (external) `ack` (see JavaFan's Re^3: push to an array lines matching a pattern, and the lines before and after the matching0 or `grep`. It's all to easy to implement timings in a manner that gives misleading results. And as to " chang(ing) the number of rows captured," that requires additional buffers (or, if you're slurping the entire file into an array, an index of array elements you've checked.) The extra buffers, and moving the data thru them would certainly slow the process... but again, I'd like to see a benchmark. OTOH, that may be better solved by tweaking LanX's observation about using `seek` in a sliding window.	[reply] [d/l] [select]
Re^4: push to an array lines matching a pattern, and the lines before and after the matching by gianni (Novice) on Mar 04, 2012 at 16:28 UTC