Filtering certain multi-line patterns from a file

ivanbrennan has asked for the wisdom of the Perl Monks concerning the following question:

I wrote a script to filter "uninteresting" commands (ls, cat, man) from my .bash_history, because I wanted them included in the current session's history but not persisted for future sessions (using Bash's HISTIGNORE variable would exclude them from both).

I've configured Bash to save multiline history entries with embedded newlines, and entries are separated by unix timestamps, like:

#1501293767
foo() {
echo foo
}
#1501293785
ls
[download]

I wanted to remove the "uninteresting" single-line entries, but keep all multiline entries. I figure if a command was complex enough to warrant multiple lines, it's worth remembering. So, for example, this entry should be removed:

#1501293785
cat afile
[download]

whereas this (somewhat contrived) entry should be kept:

#1501293785
cat afile | while read -r line; do
  echo "line: " $line
done
[download]

I implemented it as a finite-state machine using Awk, and was impressed with its performance. It processes a 50,000 line file in about 70 milliseconds. My .bash_history is unlikely to grow beyond 25,000 lines, so that's great, especially since I trigger this in the background when exiting the shell.

Nonetheless, I'm curious whether Perl might be a better tool for the job. The Awk code is not particularly elegant, and I've heard Perl is a performant scripting language. I've never written any though, so I wanted to check here and see if this seems like a good use-case for Perl.

I'm not necessarily asking how to translate this into Perl, though I'm open to doing so, but wondering if Perl offers other approaches to solving this problem.

A graph of the finite-state machine can be seen here: https://i.stack.imgur.com/fLG4K.png

For reference here's the Awk code:

BEGIN {
  timestamp = ""
  entryline = ""
  timestamp_regex = "^#[[:digit:]]{10}$"
  exclusion_regex = "^(ls?|man|cat)$"
  state = "begin"
}
{
  if (state == "begin")
  {
    if ($0 ~ timestamp_regex)
    {
      timestamp = $0
      state = "readtimestamp"
    }
    else
    {
      print
      state = "printedline"
    }
  }
  else if (state == "printedline")
  {
    if ($0 ~ timestamp_regex)
    {
      timestamp = $0
      state = "readtimestamp"
    }
    else
    {
      print
      state = "printedline"
    }
  }
  else if (state == "readtimestamp")
  {
    if ($0 ~ timestamp_regex && $0 >= timestamp)
    {
      timestamp = $0
      state = "readtimestamp"
    }
    else if ($1 ~ exclusion_regex)
    {
      entryline = $0
      state = "readentryline"
    }
    else
    {
      print timestamp
      print
      state = "printedline"
    }
  }
  else if (state == "readentryline")
  {
    if ($0 ~ timestamp_regex)
    {
      entryline = ""
      timestamp = $0
      state = "readtimestamp"
    }
    else
    {
      print timestamp
      print entryline
      print
      state = "printedline"
    }
  }
}
[download]

Comment on Filtering certain multi-line patterns from a file Select or Download Code

Replies are listed 'Best First'.

Re: Filtering certain multi-line patterns from a file
by haukex (Archbishop) on Jul 30, 2017 at 14:59 UTC

I wanted to check here and see if this seems like a good use-case for Perl

Certainly, text processing is one of the things Perl is great at, and AFAIK it was heavily inspired by awk. If you want to get started with Perl, there are lots of good places, like perlintro, Tutorials, http://learn.perl.org, and lots of books.

I'm not an awk expert, but I think this is a pretty direct translation of your program (the automated translator script that comes with Perl, a2p, is giving me strange results and I didn't look into that yet):

Read more... (2 kB)

Note I had to make a few tweaks to the regexes, and I'm not sure whether I translated the intent of "if ($0 ~ timestamp_regex && $0 >= timestamp)" correctly (Update: $1 is a special variable referring to the first set of capturing parentheses, see e.g. perlretut). Also, your original code could be reduced a little, note how your states begin and printedline are identical (you could drop begin and start in printedline), and you don't need to assign to state in every branch.

But anyway, when you switch to Perl, you get lots of powerful tools at your disposal, and thereby more ways to solve the same problem. While the above is certainly one way to write your code, one of Perl's mottos is There Is More Than One Way To Do It (TIMTOWTDI, "tim toady"), so here are two more ways to implement the code. The first one is how I might have written it. (The trick with eof is just for a bit of code reduction, the alternative is to repeat the code that checks and prints @buf after the loop ends.) Disclaimer: I haven't run the following code through a whole lot of test cases, mostly just the samples you provided, so I may have missed some edge cases - my motivation for showing this code is not to say these are "better" but to demonstrate TIMTOWTDI and different ways of approaching the problem. (Update: Note I assume the format of the input file never varies from the format you showed, i.e. each entry is #timestamp\n followed by one or more lines.)

use warnings;
use strict;

my $exclre = qr/^(?:ls?|man|cat)\b/;

my @buf;
while (<>) {
    if (/^#\d+$/ || eof()) {
        push @buf, $_ if eof();
        if (@buf>2 || @buf==2 && $buf[1]!~$exclre) {
            print @buf;
        }
        @buf=();
    }
    push @buf, $_;
}
[download]

Here's a solution that gets clever with the input record separator $/ and regular expressions (perhaps too clever, since it assumes that "\n#" won't occur in other places in the history file):

use warnings;
use strict;

my $exclre = qr/(?:ls?|man|cat)\b/;

$/ = "\n#";
while (<>) {
    s/\n#?\z//;
    print /\A#/?():'#', $_, "\n"
        unless /\A#?\d+\n$exclre(?!.*\n)/;
}
[download]

[reply]
[d/l]
[select]

Re^2: Filtering certain multi-line patterns from a file

by ivanbrennan (Initiate) on Jul 30, 2017 at 19:31 UTC

Thanks for helping satisfy my curiosity! Your alternate solutions are amazingly concise. Storing the buffered lines in an array is an interesting idea, and now I'm curious to try the same in Awk.

The "$1 >= timestamp" was my attempt at dealing with a timestamp line occurring immediately after another timestamp line, with no recognizable history entry between them. This could occur if the user manually typed in a line that matches the timestamp regex. It's a corner-case, and I sort of fudged it, saying, "as long as the second timestamp isn't less than the first, I'll assume it is a real timestamp and forget about the previous one.

So I was relying on string comparison even though the intent was mathematical comparison, since I knew both strings had the same # prefix followed by all digits.

I tested both of your alternate solutions out, and the only difference from my Awk script was how the @buf solution handled the above-mentioned corner-case (adjacent timestamp lines). It simply removed both adjacent timestamps, which is fine in my opinion.

Again, thanks for the intro by example!

[reply]

Re: Filtering certain multi-line patterns from a file
by haukex (Archbishop) on Jul 30, 2017 at 15:49 UTC

It processes a 50,000 line file in about 70 milliseconds. ... I've heard Perl is a performant scripting language.

With those kinds of execution times, personally I wouldn't even worry about it to begin with. But just to demonstrate that Perl isn't going to be a lot slower, here's an example benchmark from my system (somewhat simple, just average execution time over a couple of runs). The first row is your code (unchanged), and the following three rows are the three pieces of code I posted:

	Input file size
Solution	12 lines	120_000 lines	1_200_000 lines
awk	26ms	66ms	417ms
awk to Perl	26ms	105ms	782ms
First example	27ms	75ms	520ms
Second example	27ms	72ms	450ms

This is Perl 5.24.1 on Linux. As you can see, although awk might have a minor advantage, I don't think you have anything to worry about in terms of speed when it comes to your use case. If it ever came to be an issue, there are lots of ways to optimize code in Perl (e.g. Benchmark and profilers like Devel::NYTProf).

[reply]

Re: Filtering certain multi-line patterns from a file
by karlgoethebier (Abbot) on Jul 31, 2017 at 13:35 UTC

You could try it with Bash::History::Read.

Best regards, Karl

ŤThe Crux of the Biscuit is the Apostropheť

perl -MCrypt::CBC -E 'say Crypt::CBC->new(-key=>'kgb',-cipher=>"Blowfish")->decrypt_hex($ENV{KARL});'Help

[reply]
[d/l]