BUU has asked for the wisdom of the Perl Monks concerning the following question:

Basically I have a semi large file (5 megs+) that contains output sperated by new lines. What I want to do is get the last $n lines that match a certain criteria.
My immediate thought was to use tail, in the spirit of not reinventing the wheel. But (as far as I can tell) tail only has an option to specify the number of lines from the very bottom. Thus if the $n lines weren't contained in the first sample (default 10 lines or what not) then I would have to specify a larger number of lines from the bottom, tail -n 20 and so on until I find the $n lines I want. While this is workable I suppose, probably in the form of:
my @lines; while(@lines < $n) { @lines = grep/criteria/,split/\n/,`tail -n $i file.foo`; $i+=$n; }
But the idea of having to reparse the same lines over and over (bottom 10, then bottom 20) and so on repeatedly kind of rankles. Perhaps it's not an overwhelming problem, but it bothers me =]. And of course I would have to make constant calls to tail, perhaps 5 or more calls from a single invocation. This doesn't seem to bode well for efficiency..

My second thought was that I could do something along the lines of:
my @file = <FILEHANDLE>; for(@file) { push @lines,$_ if /criteria/; last if @lines>$n; }
But of course slurping a semi massive file into memory is going to incur even worse penalties then using tail.

Last but and probably least, I could do something along the lines of tail -f and have a deamon that constantly watches the file in question and keeps some sort of database of the last 10 lines that match my criteria. This might be the most efficient of the three, but it seems vastly more complicated, in that I have to maintain the deamon, make sure it's running and configured properly, etc. Any thoughts?

Replies are listed 'Best First'.
Re: last $n lines that match a criteria
by Anonymous Monk on Nov 17, 2003 at 07:00 UTC
    #!/usr/bin/perl -w use strict; use File::ReadBackwards; my @lines = (); my $n = 10; my $elif = File::ReadBackwards->new('somefile') || die $!; while(defined(my $line = $elif->readline())){ unshift @lines, $line if $line =~ /criteria/; last if @lines >= $n } print @lines;
Re: last $n lines that match a criteria
by Zaxo (Archbishop) on Nov 17, 2003 at 06:42 UTC

    Your second example gets the first $n, not the last. Both are pretty wasteful of resources. Here's one way to do it,

    { local $_; while (<FILEHANDLE>) { push @lines, $_ if /criteria/; shift @lines if @lines > $n; } }
    I've localized $_ since while (<>) {...} does not, but that is usually not necessary.

    After Compline,
    Zaxo

Re: last $n lines that match a criteria
by cleverett (Friar) on Nov 17, 2003 at 08:01 UTC
    Heh, what's CPAN good for if you don't use it?
    • Read file backwards until you have 10 instances
      #!/usr/bin/perl use strict; use File::ReadFileBackwards; my @instances = (); $file=File::ReadFileBackwards->new("/some/log/file"); while (@instances < 10 and defined($line = $file->readline)) { push @instances, $line; }
    • A daemon that tails the file (needs lots more to be a real daemon)
      #!/usr/bin/perl use strict; use File::Tail; my @instances = (); $file=File::Tail->new("/some/log/file"); while (defined($line=$file->read)) { if ($line =~ m/criteria/) { my $discard = pop @instances if @instances > 10; push @instances, $line; } }
Re: last $n lines that match a criteria
by davido (Cardinal) on Nov 17, 2003 at 07:08 UTC
    Here is a FIFO approach:

    my @lastmatches; my $keep = 5; while ( my $line = <FILEHANDLE> ) { next unless $line =~ /criteria/; push @lastmatches, $line; unshift @lastmatches if --$keep < 1; }

    I don't know if unshift is "expensive" from a time-critical standpoint, but where the array is never more than five elements long, it probably isn't terribly efficient to use it in this way. I've essentially created a fifo list that won't grow to larger than five elements. It does scale pretty well though, and passed my tests.

    Or there's this grep and list slice approach:

    my @lastfive = ( grep { /criteria/ } <FILEHANDLE> ) [ -5 .. -1 ];

    UPDATE: I created a 5mb file and used the grep method along with a list slice to gather the last five using the following snippet. On the machine I tested it with, it took about 5 seconds to grep the file using a simple regex. ... that on an old beat up 266mhz Pentium II notebook. Again, I'm not sure how time critical the OP's needs are, and while I know the grep method is slower than the File::ReadBackwards method, it's pretty simple, and seems to work just fine as long as it's ok to take a few seconds per 5mb file. Here's the test snippet:

    use strict; use warnings; # Create the 5mb file. my @alphabet = ( "A".."Z", "a".."z", " ", "\n"); open OUTFILE, ">file.txt" or die; print OUTFILE $alphabet[ rand( @alphabet) ] for 1 .. (1024 * 1024 * 5) +; close OUTFILE; # Find the last five occurrences of 'abc'. print "Testing grep method:\n"; open IN, "file.txt" or die; my @lastfive = ( grep { /abc/ } <IN> ) [-5 .. -1]; close IN; my $count = 5; print $count--, ".: ", $_ foreach @lastfive;


    Dave


    "If I had my life to live over again, I'd be a plumber." -- Albert Einstein
      ...it probably isn't terribly inefficient to use it in this way

      But it *is* inefficient to read *every* line in the file, test *every* line against the regex, and push *every* matching line onto the array and shift all but $keep matching lines back off the array. File::ReadBackwards was designed for this kind of problem.

      In your updated second example you are in fact reading the entire file into memory (something the OP wanted to avoid), and creating the entire grep list in memory (at the same time), and then skimming the final N lines off that list. If the pattern occurs on every other line, you'll actually have the entire file plus half again in memory at once.

Re: last $n lines that match a criteria
by sgifford (Prior) on Nov 17, 2003 at 08:13 UTC

    An alternative to File::ReadBackwards would be the tac(1) command:

    NAME
           tac - concatenate and print files in reverse
    

    For example:

    open(ELIF,"tac file.foo |") or die "Couldn't tac file: $!\n"; while (<ELIF>) { push @lines,$_ if /criteria/; last if @lines > $n; }
Re: last $n lines that match a criteria
by Roger (Parson) on Nov 17, 2003 at 07:30 UTC
    I am a lazy programmer. I would combine the Unix grep and tail utilities to do those kind of things with a one-liner. ;-)

    my @lines = split /\n/, `grep criteria file.foo | tail -$i`;
    Provided what you are searching for is not *too* complicated of cause.

Re: last $n lines that match a criteria
by jmcnamara (Monsignor) on Nov 17, 2003 at 08:54 UTC

    Here is a one-liner, change 5 and the match criteria to suit.
    perl -ne '$a[($i+=1)%=5] = $_ if /foo/; END{print @a[$i+1..@a,0..$ +i]}' file

    This reads all the way through the file so it will be less efficient than methods that read backwards through the file.

    --
    John.

      I think you mean ..$#a, not ..@a. This works, too: perl -wpe'$a[($i+=1)%=5]=$_ if /foo/} for(@a[$i+1..$#a,0..$i]){' file

        No, I meant @_ because it came from some golf code and that was one character shorter. :-)

        Here is the progression of code that I took it from (note this is for tail and not the last $n matching lines but the change is minor).

        # tail perl -ne '$i=$.%5; $a[$i]=$_; END{print @a[$i+1..$#a,0..$i]}' file perl -ne 'END{print@a[$i+1..$#a,0..$i]}$a[$i=$.%5]=$_' file perl -ne 'END{print@a[$i+1..@a,0..$i]}$a[$i=$.%5]=$_' file perl -pe '$a[$i=$.%5]=$_}{print@a[$i+1..@a,0..$i]' file perl -pe '$_[$==$.%5]=$_}{print@_[$=+1..@_,0..$=]' file

        Also, I left out -w on purpose for cases where there were less matches than $n. Try this:

        echo foo | perl -wpe 'your code here' file

        That is also why I was able to get away with @a.

        --
        John.