Phalcon123 has asked for the wisdom of the Perl Monks concerning the following question:

Hi.

I have a series of strings. I would like to compare them and determine a "match" string automagically.

For example:

Error 123 on SystemA file not found error Error 123 on SystemB file not found error Error 123 on SystemC file not found error

I would like to determine the differences (splitting on stopwords such as space, colon, etc) and get a pattern which matches such as:

Error 123 on $1 file not found error

I plan on running several strings through something such as Levenshtein to determine how close they are. If they are close enough, I'll try to get a pattern. If they are significantly different, obviously I won't even try.

Another example to show multiple variables:

Error 124 on User1:FileA no space left Error 124 on User2:FileB no space left Error 124 on User3:FileC no space left Error 124 on $1:$2 no space left

Does such a beastie already exist in CPAN? I have been unsuccessful in finding one.

Thanks

Replies are listed 'Best First'.
Re: String similarities and pattern matching
by GrandFather (Saint) on Sep 07, 2006 at 20:31 UTC
Re: String similarities and pattern matching
by planetscape (Chancellor) on Sep 08, 2006 at 03:27 UTC

    Just a thought:

    #! /usr/local/bin/perl -w use Regexp::Assemble; my $ra = Regexp::Assemble->new->add( 'Error 123 on SystemA file not found error', 'Error 123 on SystemB file not found error', 'Error 123 on SystemC file not found error' ); print $ra->re;

    Produces:

    (?-xism:Error 123 on System[ABC] file not found error)

    And:

    #! /usr/local/bin/perl -w use Regexp::Assemble; my $ra = Regexp::Assemble->new->add( 'Error 124 on User1:FileA no space left', 'Error 124 on User2:FileB no space left', 'Error 124 on User3:FileC no space left' ); print $ra->re;

    Outputs:

    (?-xism:Error 124 on User(?:1:FileA|2:FileB|3:FileC) no space left)

    See also:


    Regexp::Assemble
    grinder's scratchpad
    Why machine-generated solutions will never cease to amaze me

    HTH,

    planetscape
Re: String similarities and pattern matching
by mreece (Friar) on Sep 08, 2006 at 00:59 UTC
    i don't know of any existing modules, but it is an interesting question so i gave it a shot (rough draft!):
    use strict; use warnings; use Data::Dumper::Simple; my $stop = qr/[\s:]/; # modify as appropriate! my @lines; my $pattern; ## test one.. @lines = split /\n/, <<'EOD'; Error 123 on SystemA file not found error Error 123:on SystemB file not found error Error 123 on SystemC file not found error EOD $pattern = get_pattern( \@lines ); print Dumper($pattern); ## test two.. @lines = split /\n/, <<'EOD'; Error 124 on User1:FileA no space left Error 124 on User2:FileB no space left Error 124 on User3:FileC no space left EOD $pattern = get_pattern( \@lines ); print Dumper($pattern); sub get_pattern { my @lines = @{ +shift }; # this should take a copy.. my @known_words = split /($stop)/, shift @lines; my @pattern = @known_words; # assume everything matches.. my %modified = (); # track indices my $count = 0; # $x incrementor LINE: foreach my $line (@lines) { my @words = split /($stop)/, $line; unless ( @words == @known_words ) { warn "ignoring (word count does not match first line): $li +ne"; next LINE; } WORD: foreach my $i ( 0 .. $#words ) { next WORD if $modified{$i}; # already noted this spot my $this = $words[$i]; next WORD if $this =~ $stop; # questionable.. are all s +tops 'equal'? my $that = $known_words[$i]; next WORD if $this eq $that; # everything looks ok so f +ar $pattern[$i] = '$' . ++$count; $modified{$i}++; } } return join '', @pattern; }
    produces:
    $pattern = 'Error 123 on $1 file not found error';
    $pattern = 'Error 124 on $1:$2 no space left';
    
    (updated to move 'my $count' to proper scope)

      This thread certainly received several very good replies. There is enough here to keep me busy investigating for a while.

      This response by mreece seems to provide what I need. I am trying to do some event monitoring on syslog. I have an application that drops messages into syslog with certain parts which are variable. I am planning on running through the history and use Levenshtein matching to group the strings together based on similarity, then using this routine to develop patterns. Based on the patterns, I will be sending pages out to the proper support groups. For example, if I receive the no space left message I will page it out to the storage group to add space.

      This works perfectly for the examples I have pulled out so far. There are hundreds of messages I have to parse, so I will continue to test. So far, I have not found any instances where the number of words/tokens is different but may post an update that takes it into account (perhaps using a question mark to indicate an optional token). I may also output two strings, one as shown providing the person utilizing this with placeholder numbers and another with actual regexps to use.

      Thank you for the help! I did not expect that I would receive such spot-on responses so quickly. Actually, I did not expect I worded my request well enough to get these responses.

      Thanks

Re: String similarities and pattern matching
by BrowserUk (Patriarch) on Sep 07, 2006 at 20:45 UTC

    Crackers2 is correct,

    I believe you're missing the OP's point.

    I did completely miss the OP's point.

    On short strings like these, the time it will take to determine the degree of similarity far outweighs the time taken simply use the regex. I see no benefit in avoiding running the match?
    #! perl -slw use strict; my @regex = ( qr[Error 123 on (\S+) file not found error], qr[Error 124 on (\S+):(\S+) no space left], ); while( <DATA> ) { chomp; for my $regex ( @regex ) { print "'$_' ", ( $_ =~ $regex ? 'does ' : 'does not' )," ma +tch $regex"; } } __DATA__ Error 123 on SystemA file not found error Error 123 on SystemB file not found error Error 123 on SystemC file not found error Error 124 on User1:FileA no space left Error 124 on User2:FileB no space left Error 124 on User3:FileC no space left

    Outputs:

    c:\test>junk9 'Error 123 on SystemA file not found error' does match (?-xism:Err +or 123 on (\S+) file not found error) 'Error 123 on SystemA file not found error' does not match (?-xism:Err +or 124 on (\S+):(\S+) no space left) 'Error 123 on SystemB file not found error' does match (?-xism:Err +or 123 on (\S+) file not found error) 'Error 123 on SystemB file not found error' does not match (?-xism:Err +or 124 on (\S+):(\S+) no space left) 'Error 123 on SystemC file not found error' does match (?-xism:Err +or 123 on (\S+) file not found error) 'Error 123 on SystemC file not found error' does not match (?-xism:Err +or 124 on (\S+):(\S+) no space left) 'Error 124 on User1:FileA no space left' does not match (?-xism:Error +123 on (\S+) file not found error) 'Error 124 on User1:FileA no space left' does match (?-xism:Error +124 on (\S+):(\S+) no space left) 'Error 124 on User2:FileB no space left' does not match (?-xism:Error +123 on (\S+) file not found error) 'Error 124 on User2:FileB no space left' does match (?-xism:Error +124 on (\S+):(\S+) no space left) 'Error 124 on User3:FileC no space left' does not match (?-xism:Error +123 on (\S+) file not found error) 'Error 124 on User3:FileC no space left' does match (?-xism:Error +124 on (\S+):(\S+) no space left)

    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal?
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.

      I believe you're missing the OP's point. The regexes are not part of the given data; in fact they're the desired answer.

      I.e. given the data

      Error 123 on SystemA file not found error Error 123 on SystemB file not found error Error 123 on SystemC file not found error

      the task is to come up with a regex that matches these, using as little variables as possible.

      That's how I read it anyway

Re: String similarities and pattern matching
by graff (Chancellor) on Sep 08, 2006 at 00:59 UTC
    If I make a couple assumptions about your task and data, there's a fairly simple solution, which I'll show as a stand-alone script (conversion to an effective module is left as an exercise... ;).

    The assumptions are: (1) splitting lines on  [\s:]+ will give a reasonable "parsing" for creating a regex template (though this may be easy to adjust); (2) the similarity among strings is always as shown in your examples, with every line having the same token count; (3) either you already have similar strings segregated according to their common patterns, or you can easily segregate them (e.g. grepping for a specifc "Error NNN" from a larger list).

    If those assumptions work for you, the following script produces these regexes for your two sets of sample data:

    regex: ^(?-xism:Error\ 123\ on\ )(\w+)(?-xism:\ file\ not\ found\ erro +r)$ regex: ^(?-xism:Error\ 124\ on\ )(\w+)(?-xism:\:)(\w+)(?-xism:\ no\ sp +ace\ left)$
    Here's the script: