Re: String similarities and pattern matching

i don't know of any existing modules, but it is an interesting question so i gave it a shot (rough draft!):

use strict;
use warnings;
use Data::Dumper::Simple;

my $stop = qr/[\s:]/;  # modify as appropriate!
my @lines;
my $pattern;

## test one..
@lines = split /\n/, <<'EOD';
Error 123 on SystemA file not found error
Error 123:on SystemB file not found error
Error 123 on SystemC file not found error
EOD

$pattern = get_pattern( \@lines );
print Dumper($pattern);

## test two..
@lines = split /\n/, <<'EOD';
Error 124 on User1:FileA no space left
Error 124 on User2:FileB no space left
Error 124 on User3:FileC no space left
EOD

$pattern = get_pattern( \@lines );
print Dumper($pattern);


sub get_pattern {
    my @lines = @{ +shift };    # this should take a copy..

    my @known_words = split /($stop)/, shift @lines;
    my @pattern     = @known_words;  # assume everything matches..
    my %modified    = ();            # track indices
    my $count = 0;     # $x incrementor


    LINE:
    foreach my $line (@lines) {
        my @words = split /($stop)/, $line;

        unless ( @words == @known_words ) {
            warn "ignoring (word count does not match first line): $li
+ne";
            next LINE;
        }

        WORD:
        foreach my $i ( 0 .. $#words ) {
            next WORD if $modified{$i};     # already noted this spot

            my $this = $words[$i];
            next WORD if $this =~ $stop;    # questionable.. are all s
+tops 'equal'?

            my $that = $known_words[$i];
            next WORD if $this eq $that;    # everything looks ok so f
+ar

            $pattern[$i] = '$' . ++$count;
            $modified{$i}++;
        }
    }

    return join '', @pattern;
}
[download]

produces:

$pattern = 'Error 123 on $1 file not found error';
$pattern = 'Error 124 on $1:$2 no space left';

(updated to move 'my $count' to proper scope)

Comment on Re: String similarities and pattern matching Download Code

Replies are listed 'Best First'.
Re^2: String similarities and pattern matching by Phalcon123 (Initiate) on Sep 08, 2006 at 11:43 UTC
This thread certainly received several very good replies. There is enough here to keep me busy investigating for a while. This response by mreece seems to provide what I need. I am trying to do some event monitoring on syslog. I have an application that drops messages into syslog with certain parts which are variable. I am planning on running through the history and use Levenshtein matching to group the strings together based on similarity, then using this routine to develop patterns. Based on the patterns, I will be sending pages out to the proper support groups. For example, if I receive the no space left message I will page it out to the storage group to add space. This works perfectly for the examples I have pulled out so far. There are hundreds of messages I have to parse, so I will continue to test. So far, I have not found any instances where the number of words/tokens is different but may post an update that takes it into account (perhaps using a question mark to indicate an optional token). I may also output two strings, one as shown providing the person utilizing this with placeholder numbers and another with actual regexps to use. Thank you for the help! I did not expect that I would receive such spot-on responses so quickly. Actually, I did not expect I worded my request well enough to get these responses. Thanks	[reply]

Replies are listed 'Best First'.

Re^2: String similarities and pattern matching
by Phalcon123 (Initiate) on Sep 08, 2006 at 11:43 UTC

This thread certainly received several very good replies. There is enough here to keep me busy investigating for a while.

This response by mreece seems to provide what I need. I am trying to do some event monitoring on syslog. I have an application that drops messages into syslog with certain parts which are variable. I am planning on running through the history and use Levenshtein matching to group the strings together based on similarity, then using this routine to develop patterns. Based on the patterns, I will be sending pages out to the proper support groups. For example, if I receive the no space left message I will page it out to the storage group to add space.

This works perfectly for the examples I have pulled out so far. There are hundreds of messages I have to parse, so I will continue to test. So far, I have not found any instances where the number of words/tokens is different but may post an update that takes it into account (perhaps using a question mark to indicate an optional token). I may also output two strings, one as shown providing the person utilizing this with placeholder numbers and another with actual regexps to use.

Thank you for the help! I did not expect that I would receive such spot-on responses so quickly. Actually, I did not expect I worded my request well enough to get these responses.

Thanks

[reply]