in reply to String similarities and pattern matching

i don't know of any existing modules, but it is an interesting question so i gave it a shot (rough draft!):
use strict; use warnings; use Data::Dumper::Simple; my $stop = qr/[\s:]/; # modify as appropriate! my @lines; my $pattern; ## test one.. @lines = split /\n/, <<'EOD'; Error 123 on SystemA file not found error Error 123:on SystemB file not found error Error 123 on SystemC file not found error EOD $pattern = get_pattern( \@lines ); print Dumper($pattern); ## test two.. @lines = split /\n/, <<'EOD'; Error 124 on User1:FileA no space left Error 124 on User2:FileB no space left Error 124 on User3:FileC no space left EOD $pattern = get_pattern( \@lines ); print Dumper($pattern); sub get_pattern { my @lines = @{ +shift }; # this should take a copy.. my @known_words = split /($stop)/, shift @lines; my @pattern = @known_words; # assume everything matches.. my %modified = (); # track indices my $count = 0; # $x incrementor LINE: foreach my $line (@lines) { my @words = split /($stop)/, $line; unless ( @words == @known_words ) { warn "ignoring (word count does not match first line): $li +ne"; next LINE; } WORD: foreach my $i ( 0 .. $#words ) { next WORD if $modified{$i}; # already noted this spot my $this = $words[$i]; next WORD if $this =~ $stop; # questionable.. are all s +tops 'equal'? my $that = $known_words[$i]; next WORD if $this eq $that; # everything looks ok so f +ar $pattern[$i] = '$' . ++$count; $modified{$i}++; } } return join '', @pattern; }
produces:
$pattern = 'Error 123 on $1 file not found error';
$pattern = 'Error 124 on $1:$2 no space left';
(updated to move 'my $count' to proper scope)

Replies are listed 'Best First'.
Re^2: String similarities and pattern matching
by Phalcon123 (Initiate) on Sep 08, 2006 at 11:43 UTC

    This thread certainly received several very good replies. There is enough here to keep me busy investigating for a while.

    This response by mreece seems to provide what I need. I am trying to do some event monitoring on syslog. I have an application that drops messages into syslog with certain parts which are variable. I am planning on running through the history and use Levenshtein matching to group the strings together based on similarity, then using this routine to develop patterns. Based on the patterns, I will be sending pages out to the proper support groups. For example, if I receive the no space left message I will page it out to the storage group to add space.

    This works perfectly for the examples I have pulled out so far. There are hundreds of messages I have to parse, so I will continue to test. So far, I have not found any instances where the number of words/tokens is different but may post an update that takes it into account (perhaps using a question mark to indicate an optional token). I may also output two strings, one as shown providing the person utilizing this with placeholder numbers and another with actual regexps to use.

    Thank you for the help! I did not expect that I would receive such spot-on responses so quickly. Actually, I did not expect I worded my request well enough to get these responses.

    Thanks