String similarities and pattern matching

Phalcon123 has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.
Re: String similarities and pattern matching by GrandFather (Saint) on Sep 07, 2006 at 20:31 UTC
A Super Search on "fuzzy match string" garners quite a few hits. The first couple of threads are a good start: Fuzzy matching of text strings Fuzzy String matching with index? DWIM is Perl's answer to Gödel	[reply]
Re: String similarities and pattern matching by planetscape (Chancellor) on Sep 08, 2006 at 03:27 UTC
Just a thought: `#! /usr/local/bin/perl -w use Regexp::Assemble; my $ra = Regexp::Assemble->new->add( 'Error 123 on SystemA file not found error', 'Error 123 on SystemB file not found error', 'Error 123 on SystemC file not found error' ); print $ra->re;` [download] Produces: `(?-xism:Error 123 on System[ABC] file not found error)` [download] And: `#! /usr/local/bin/perl -w use Regexp::Assemble; my $ra = Regexp::Assemble->new->add( 'Error 124 on User1:FileA no space left', 'Error 124 on User2:FileB no space left', 'Error 124 on User3:FileC no space left' ); print $ra->re;` [download] Outputs: `(?-xism:Error 124 on User(?:1:FileA\|2:FileB\|3:FileC) no space left)` [download] See also: Regexp::Assemble grinder's scratchpad Why machine-generated solutions will never cease to amaze me HTH, planetscape	[reply] [d/l] [select]
Re: String similarities and pattern matching by mreece (Friar) on Sep 08, 2006 at 00:59 UTC
i don't know of any existing modules, but it is an interesting question so i gave it a shot (rough draft!): use strict; use warnings; use Data::Dumper::Simple; my $stop = qr/[\s:]/; # modify as appropriate! my @lines; my $pattern; ## test one.. @lines = split /\n/, <<'EOD'; Error 123 on SystemA file not found error Error 123:on SystemB file not found error Error 123 on SystemC file not found error EOD $pattern = get_pattern( \@lines ); print Dumper($pattern); ## test two.. @lines = split /\n/, <<'EOD'; Error 124 on User1:FileA no space left Error 124 on User2:FileB no space left Error 124 on User3:FileC no space left EOD $pattern = get_pattern( \@lines ); print Dumper($pattern); sub get_pattern { my @lines = @{ +shift }; # this should take a copy.. my @known_words = split /($stop)/, shift @lines; my @pattern = @known_words; # assume everything matches.. my %modified = (); # track indices my $count = 0; # $x incrementor LINE: foreach my $line (@lines) { my @words = split /($stop)/, $line; unless ( @words == @known_words ) { warn "ignoring (word count does not match first line): $li +ne"; next LINE; } WORD: foreach my $i ( 0 .. $#words ) { next WORD if $modified{$i}; # already noted this spot my $this = $words[$i]; next WORD if $this =~ $stop; # questionable.. are all s +tops 'equal'? my $that = $known_words[$i]; next WORD if $this eq $that; # everything looks ok so f +ar $pattern[$i] = '$' . ++$count; $modified{$i}++; } } return join '', @pattern; } [download] produces: $pattern = 'Error 123 on $1 file not found error'; $pattern = 'Error 124 on $1:$2 no space left'; (updated* to move 'my $count' to proper scope)*	[reply] [d/l]
Re^2: String similarities and pattern matching by Phalcon123 (Initiate) on Sep 08, 2006 at 11:43 UTC
This thread certainly received several very good replies. There is enough here to keep me busy investigating for a while. This response by mreece seems to provide what I need. I am trying to do some event monitoring on syslog. I have an application that drops messages into syslog with certain parts which are variable. I am planning on running through the history and use Levenshtein matching to group the strings together based on similarity, then using this routine to develop patterns. Based on the patterns, I will be sending pages out to the proper support groups. For example, if I receive the no space left message I will page it out to the storage group to add space. This works perfectly for the examples I have pulled out so far. There are hundreds of messages I have to parse, so I will continue to test. So far, I have not found any instances where the number of words/tokens is different but may post an update that takes it into account (perhaps using a question mark to indicate an optional token). I may also output two strings, one as shown providing the person utilizing this with placeholder numbers and another with actual regexps to use. Thank you for the help! I did not expect that I would receive such spot-on responses so quickly. Actually, I did not expect I worded my request well enough to get these responses. Thanks	[reply]
Re: String similarities and pattern matching by BrowserUk (Patriarch) on Sep 07, 2006 at 20:45 UTC
Crackers2 is correct, I believe you're missing the OP's point. I did completely miss the OP's point. On short strings like these, the time it will take to determine the degree of similarity far outweighs the time taken simply use the regex. I see no benefit in avoiding running the match? #! perl -slw use strict; my @regex = ( qr[Error 123 on (\S+) file not found error], qr[Error 124 on (\S+):(\S+) no space left], ); while( <DATA> ) { chomp; for my $regex ( @regex ) { print "'$_' ", ( $_ =~ $regex ? 'does ' : 'does not' )," ma +tch $regex"; } } __DATA__ Error 123 on SystemA file not found error Error 123 on SystemB file not found error Error 123 on SystemC file not found error Error 124 on User1:FileA no space left Error 124 on User2:FileB no space left Error 124 on User3:FileC no space left [download] Outputs: c:\test>junk9 'Error 123 on SystemA file not found error' does match (?-xism:Err +or 123 on (\S+) file not found error) 'Error 123 on SystemA file not found error' does not match (?-xism:Err +or 124 on (\S+):(\S+) no space left) 'Error 123 on SystemB file not found error' does match (?-xism:Err +or 123 on (\S+) file not found error) 'Error 123 on SystemB file not found error' does not match (?-xism:Err +or 124 on (\S+):(\S+) no space left) 'Error 123 on SystemC file not found error' does match (?-xism:Err +or 123 on (\S+) file not found error) 'Error 123 on SystemC file not found error' does not match (?-xism:Err +or 124 on (\S+):(\S+) no space left) 'Error 124 on User1:FileA no space left' does not match (?-xism:Error +123 on (\S+) file not found error) 'Error 124 on User1:FileA no space left' does match (?-xism:Error +124 on (\S+):(\S+) no space left) 'Error 124 on User2:FileB no space left' does not match (?-xism:Error +123 on (\S+) file not found error) 'Error 124 on User2:FileB no space left' does match (?-xism:Error +124 on (\S+):(\S+) no space left) 'Error 124 on User3:FileC no space left' does not match (?-xism:Error +123 on (\S+) file not found error) 'Error 124 on User3:FileC no space left' does match (?-xism:Error +124 on (\S+):(\S+) no space left) [download] Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal? "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice.	[reply] [d/l] [select]
Re^2: String similarities and pattern matching by Crackers2 (Parson) on Sep 08, 2006 at 00:20 UTC
I believe you're missing the OP's point. The regexes are not part of the given data; in fact they're the desired answer. I.e. given the data `Error 123 on SystemA file not found error Error 123 on SystemB file not found error Error 123 on SystemC file not found error` [download] the task is to come up with a regex that matches these, using as little variables as possible. That's how I read it anyway	[reply] [d/l]
Re: String similarities and pattern matching by graff (Chancellor) on Sep 08, 2006 at 00:59 UTC
If I make a couple assumptions about your task and data, there's a fairly simple solution, which I'll show as a stand-alone script (conversion to an effective module is left as an exercise... ;). The assumptions are: (1) splitting lines on `[\s:]+` will give a reasonable "parsing" for creating a regex template (though this may be easy to adjust); (2) the similarity among strings is always as shown in your examples, with every line having the same token count; (3) either you already have similar strings segregated according to their common patterns, or you can easily segregate them (e.g. grepping for a specifc "Error NNN" from a larger list). If those assumptions work for you, the following script produces these regexes for your two sets of sample data: `regex: ^(?-xism:Error\ 123\ on\ )(\w+)(?-xism:\ file\ not\ found\ erro +r)$ regex: ^(?-xism:Error\ 124\ on\ )(\w+)(?-xism:\:)(\w+)(?-xism:\ no\ sp +ace\ left)$` [download] Here's the script: Read more... (2 kB)	[reply] [d/l] [select]