Removing all lines from an array which match against a second array

skx has asked for the wisdom of the Perl Monks concerning the following question:

Greetings fellow monks

I have a problem of efficiency in a little tool I'm writing.

Essentially I have two arrays, one containing lines from a logfile, as produced by syslog, and one containing regular expressions.

I wish to remove from the loglines array all lines which match any of the regular expressions in my list.

My current code looks like this:

my @REGEXPS = ( 
   "^\w{3} [ :0-9]{11} [._[:alnum:]-]+ pppd\[[0-9]+\]: (sent|rcvd) \[L
+CP EchoReq id=[[:alnum:]]+ magic=[ [:alnum:]]+\]$",
   "^\w{3} [ :0-9]{11} [._[:alnum:]-]+ ssh\(pam_[[:alnum:]]+\)\[[0-9]+
+\]: session opened for user [[:alnum:]-]+ by \(uid=[0-9]+\)$",
     );

my @KEEP = ();

foreach my $line ( @SYSLOG )
{
    my $match = 0;

    foreach my $r ( @REGEXPS )
    {    
       if ( $line =~ /$r/)
       { 
          $match = 1;
       }
    }

    if (! $match )
    {
    push @KEEP, $line;
    }
}
[download]

This is slow, because I'm testing each regular expression against each line, making the number of matchs N x N.

What are my alternatives?

I've considered moving the lines into a hash instead, and using delete upon them - but I'm not sure how much of a gain this would be.

I can't help thinking I should be using map, or grep for this - but I'm not entirely sure how to do this.

(Yes this is replicating the functionality of the logcheck tool - it's a prototype rewrite in perl I'm producing).

Steve
---
steve.org.uk

Comment on Removing all lines from an array which match against a second array Download Code

Replies are listed 'Best First'.
Re: Removing all lines from an array which match against a second array by broquaint (Abbot) on Sep 26, 2003 at 13:41 UTC
Just conglomerate your regexes into a single regex and `splice` out the undesired elements e.g `my $main_re = join '\|', map "(?:$_)", @REGEXPS; my $idx = 0; while($idx <= $#SYSLOG) { splice( @SYSLOG, $idx, 1 ) and $idx-- if $SYSLOG[$idx] =~ $main_re; $idx++; }` [download] That will will remove any elements in `@SYSLOG` that match any of the regexes in `@REGEXPS`. Or if you want a non-destructive version `my @KEEP = grep { not /$main_re/ } @SYSLOG;` [download] HTH `_________ broquaint` updated: fixed first (and now second) example to be not broken (thanks to merlyn) and simplified second example	[reply] [d/l] [select]
•Re: Re: Removing all lines from an array which match against a second array by merlyn (Sage) on Sep 26, 2003 at 14:15 UTC
The code `$SYSLOG[$_] =~ $main_re and splice @SYSLOG, $_, 1 for 0 .. $#SYSLOG;` [download] is broken. As soon as you've removed the first entry, your numbers will be off by one. Second entry, off by two, etc. Two ways to fix. Insert `reverse` after `for` (do them backwards), or just use a `grep`: `@SYSLOG = grep !/$main_re/, @SYSLOG;` [download] -- Randal L. Schwartz, Perl hacker Be sure to read my standard disclaimer if this is a reply.	[reply] [d/l] [select]
Re: Removing all lines from an array which match against a second array by tilly (Archbishop) on Sep 28, 2003 at 01:01 UTC
You can get a marginal improvement using named loop control to break processing ASAP: `LINE: foreach my $line (@SYSLOG) { foreach my $r (@REGEXPS) { next LINE if $line =~ /$r/; } push @KEEP, $line; }` [download] This does not address the scalability issue, but it will speed things up. If most of the lines are being rejected by the first few REs, it might speed it up a lot. Beyond that, if you had simpler regular expressions you could try tricks like trieing them together to stop the RE engine from doing so much redundant work. Unfortunately that will be hard with the examples that you gave. However you might take some of your complex REs which are closely related and find ways to combine them... Incidentally I see that a lot of the work being done by your REs looks like you are parsing the syslog format for various specific strings. If that is so, then you could parse each line into a small data structure and then make what you are currently doing by an RE match be doable in a simpler fashion. However trying to do anything fancy may lose on overhead. You pretty much have to try it and see. But in the end if you want to do lots of arbitrary checks against lots of arbitrary strings, lots of work will need to be done.	[reply] [d/l]