The assumptions are: (1) splitting lines on [\s:]+ will give a reasonable "parsing" for creating a regex template (though this may be easy to adjust); (2) the similarity among strings is always as shown in your examples, with every line having the same token count; (3) either you already have similar strings segregated according to their common patterns, or you can easily segregate them (e.g. grepping for a specifc "Error NNN" from a larger list).
If those assumptions work for you, the following script produces these regexes for your two sets of sample data:
Here's the script:regex: ^(?-xism:Error\ 123\ on\ )(\w+)(?-xism:\ file\ not\ found\ erro +r)$ regex: ^(?-xism:Error\ 124\ on\ )(\w+)(?-xism:\:)(\w+)(?-xism:\ no\ sp +ace\ left)$
(updated to remove an extraneous variable)#!/usr/bin/perl use strict; use warnings; my @tknhist; my %tkns_per_line; while (<>) { chomp; my @tkns = split /([\s:]+)/; $tkns_per_line{ scalar @tkns }++; for my $i ( 0 .. $#tkns ) { $tknhist[$i]{$tkns[$i]}++; } } if ( scalar keys %tkns_per_line > 1 ) { warn "Data lines have variable token counts:\n"; for my $len ( sort {$a<=>$b} keys %tkns_per_line ) { warn sprintf( "%8d lines have %3d tokens\n", $tkns_per_line{$len}, $len ); } die "This is not a situation we can deal with\n"; } my $template = '^'; # begin with "start of string" my $subtemplate = my $lastcond = ''; for my $i ( 0 .. $#tknhist ) { my @types = keys %{$tknhist[$i]}; if ( @types == 1 ) { $subtemplate .= $types[0]; $lastcond = 'matched'; } else { my $ch = ( $types[0] =~ /\w/ ) ? '\w' : '\W'; if ( $lastcond eq 'matched' ) { $template .= qr/\Q$subtemplate\E/; $lastcond = $subtemplate = ''; } $template .= join '', '(', $ch, '+)'; } } $template .= qr/\Q$subtemplate\E/ if ( $lastcond eq 'matched' ); $template .= '$'; # finish with "end of string" print "regex: $template\n";
In reply to Re: String similarities and pattern matching
by graff
in thread String similarities and pattern matching
by Phalcon123
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |