in reply to push to an array lines matching a pattern, and the lines before and after the matching

Re your desire to capture a match-line and two lines around it, pseudocode:

Open ( $filehandle "<" "path/to/filename.txt" ) or die, "Can't open $f +ile, $!; Read a $line; $buffer = $line; if ($line =~ /pattern/ { push @array, $buffer; push array $line; # the matching line read another $line; push it too; #and now, except for a match on the first line of a +file, or the last, we have 3 lines captured } else { do it all again... }
  • Comment on Re: push to an array lines matching a pattern, and the lines before and after the matching
  • Download Code

Replies are listed 'Best First'.
Re^2: push to an array lines matching a pattern, and the lines before and after the matching
by LanX (Saint) on Mar 03, 2012 at 00:04 UTC
    Technically, I wouldn't be surprised if using seek to look behind in a sliding window is more efficient than buffering the last line.¹

    But it's hard to believe that there isn't already an full featured unix-grep emulation in pure perl, and we are reinventing the wheel.

    Have to admit, I'm too lazy to search CPAN now ...

    Cheers Rolf

    UPDATE: ¹) and a regex even more.

      But it's hard to believe that there isn't already an full featured unix-grep emulation in pure perl, and we are reinventing the wheel.
      It's called ack. And it is a reinvention of the wheel. acks usefulness doesn't come from the fact it's implemented in Perl, but because it has some features that grep doesn't have, and has some better defaults.

      In the OP's case, he's better of with calling grep, then with ack. (grep is expected to be faster (as it's in C, not in Perl))

        Thanks! I remember now...

        Great name BTW, easy to guess! (sarcasm)

        At least man -k grep lists it.

        Cheers Rolf

        thanks for the ack suggestion, but it is slower than the "pure perl" implementation when using anchors or \b,\s etc
        sub sub6 { #perl ack my $p = $_[0]; #pattern my $mR = $_[1]; #more rows my @values; my $time = [gettimeofday]; my @valori = qx (ack -C $mR "$p" textMatchInAfile.txt) or die "system @values` failed: $?"; say 'number of values found with ack' . $#valori; say 'time sub6 ack' . tv_interval($time); }
Re^2: push to an array lines matching a pattern, and the lines before and after the matching
by gianni (Novice) on Mar 04, 2012 at 10:20 UTC
    in reply to
    "Re your desire to capture a match-line and two lines around it, pseudocode:
    ...
    thanks but it is too slow and it doesn't allow easily to change the number of rows captured
      gianni...
      "...it is too slow"

      Yes, it may be, particularly since it's only pseudocode. So I'd be more confident about ruleing out such an approach if I knew "how did you test it?" You may be -- even probably are -- right, but I'm not buying that until I see how you translated the pseudocode into compilable code... and what your benchmarks look like.

      Somewhat similarly, your statement that ack is " is slower than the "pure perl" implementation when using anchors or \b,\s etc " is hard to credit. I see your timing code in "sub6," but not the quantified results, nor any test results on the system (external) ack (see JavaFan's Re^3: push to an array lines matching a pattern, and the lines before and after the matching0 or grep. It's all to easy to implement timings in a manner that gives misleading results.

      And as to " chang(ing) the number of rows captured," that requires additional buffers (or, if you're slurping the entire file into an array, an index of array elements you've checked.) The extra buffers, and moving the data thru them would certainly slow the process... but again, I'd like to see a benchmark. OTOH, that may be better solved by tweaking LanX's observation about using seek in a sliding window.

        sorry, I updated only the first post with the correct timings
        ack from the command line, takes 15 seconds
        this code, simpler than your pseudo code, takes 10 seconds
        use 5.014; use warnings; use Time::HiRes qw(usleep ualarm gettimeofday tv_interval); my @array; my $pattern = '\bsala|che|relazione|di|questo|coso|^qui\$'; open( my $filehandle, "<textMatchInAfile.txt" ); my $time = [gettimeofday]; while (<$filehandle>) { if ( $_ =~ /$pattern/ ) { push @array; } } say 'time while' . tv_interval($time);