comment on

I wanted to check here and see if this seems like a good use-case for Perl

Certainly, text processing is one of the things Perl is great at, and AFAIK it was heavily inspired by awk. If you want to get started with Perl, there are lots of good places, like perlintro, Tutorials, http://learn.perl.org, and lots of books.

I'm not an awk expert, but I think this is a pretty direct translation of your program (the automated translator script that comes with Perl, a2p, is giving me strange results and I didn't look into that yet):

#!/usr/bin/env perl
use warnings;
use strict;

my $timestamp;
my $entryline;
my $timestamp_regex = qr/^#([[:digit:]]{10})$/;
my $exclusion_regex = qr/^(?:ls?|man|cat)\b/;
my $state = "begin";

while (<>) {
    if ($state eq "begin") {
        if (/$timestamp_regex/) {
            $timestamp = $_;
            $state = "readtimestamp";
        }
        else {
            print;
            $state = "printedline";
        }
    }
    elsif ($state eq "printedline") {
        if (/$timestamp_regex/) {
            $timestamp = $_;
            $state = "readtimestamp";
        }
        else {
            print;
            $state = "printedline";
        }
    }
    elsif ($state eq "readtimestamp") {
        if (/$timestamp_regex/ && $1 >= $timestamp) {
            $timestamp = $_;
            $state = "readtimestamp";
        }
        elsif (/$exclusion_regex/) {
            $entryline = $_;
            $state = "readentryline";
        }
        else {
            print $timestamp;
            print;
            $state = "printedline";
        }
    }
    elsif ($state eq "readentryline") {
        if (/$timestamp_regex/) {
            $entryline = "";
            $timestamp = $_;
            $state = "readtimestamp";
        }
        else {
            print $timestamp;
            print $entryline;
            print;
            $state = "printedline";
        }
    }
}
[download]

Note I had to make a few tweaks to the regexes, and I'm not sure whether I translated the intent of "if ($0 ~ timestamp_regex && $0 >= timestamp)" correctly (Update: $1 is a special variable referring to the first set of capturing parentheses, see e.g. perlretut). Also, your original code could be reduced a little, note how your states begin and printedline are identical (you could drop begin and start in printedline), and you don't need to assign to state in every branch.

But anyway, when you switch to Perl, you get lots of powerful tools at your disposal, and thereby more ways to solve the same problem. While the above is certainly one way to write your code, one of Perl's mottos is There Is More Than One Way To Do It (TIMTOWTDI, "tim toady"), so here are two more ways to implement the code. The first one is how I might have written it. (The trick with eof is just for a bit of code reduction, the alternative is to repeat the code that checks and prints @buf after the loop ends.) Disclaimer: I haven't run the following code through a whole lot of test cases, mostly just the samples you provided, so I may have missed some edge cases - my motivation for showing this code is not to say these are "better" but to demonstrate TIMTOWTDI and different ways of approaching the problem. (Update: Note I assume the format of the input file never varies from the format you showed, i.e. each entry is #timestamp\n followed by one or more lines.)

use warnings;
use strict;

my $exclre = qr/^(?:ls?|man|cat)\b/;

my @buf;
while (<>) {
    if (/^#\d+$/ || eof()) {
        push @buf, $_ if eof();
        if (@buf>2 || @buf==2 && $buf[1]!~$exclre) {
            print @buf;
        }
        @buf=();
    }
    push @buf, $_;
}
[download]

Here's a solution that gets clever with the input record separator $/ and regular expressions (perhaps too clever, since it assumes that "\n#" won't occur in other places in the history file):

use warnings;
use strict;

my $exclre = qr/(?:ls?|man|cat)\b/;

$/ = "\n#";
while (<>) {
    s/\n#?\z//;
    print /\A#/?():'#', $_, "\n"
        unless /\A#?\d+\n$exclre(?!.*\n)/;
}
[download]

In reply to Re: Filtering certain multi-line patterns from a file by haukex
in thread Filtering certain multi-line patterns from a file by ivanbrennan

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.