comment on

If I make a couple assumptions about your task and data, there's a fairly simple solution, which I'll show as a stand-alone script (conversion to an effective module is left as an exercise... ;).

The assumptions are: (1) splitting lines on [\s:]+ will give a reasonable "parsing" for creating a regex template (though this may be easy to adjust); (2) the similarity among strings is always as shown in your examples, with every line having the same token count; (3) either you already have similar strings segregated according to their common patterns, or you can easily segregate them (e.g. grepping for a specifc "Error NNN" from a larger list).

If those assumptions work for you, the following script produces these regexes for your two sets of sample data:

regex: ^(?-xism:Error\ 123\ on\ )(\w+)(?-xism:\ file\ not\ found\ erro
+r)$

regex: ^(?-xism:Error\ 124\ on\ )(\w+)(?-xism:\:)(\w+)(?-xism:\ no\ sp
+ace\ left)$
[download]

Here's the script:

#!/usr/bin/perl

use strict;
use warnings;

my @tknhist;
my %tkns_per_line;

while (<>) {
    chomp;
    my @tkns = split /([\s:]+)/;
    $tkns_per_line{ scalar @tkns }++;
    for my $i ( 0 .. $#tkns ) {
        $tknhist[$i]{$tkns[$i]}++;
    }
}

if ( scalar keys %tkns_per_line > 1 ) {
    warn "Data lines have variable token counts:\n";
    for my $len ( sort {$a<=>$b} keys %tkns_per_line ) {
        warn sprintf( "%8d lines have %3d tokens\n",
                      $tkns_per_line{$len}, $len );
    }
    die "This is not a situation we can deal with\n";
}

my $template = '^';  # begin with "start of string"
my $subtemplate = my $lastcond = '';

for my $i ( 0 .. $#tknhist ) {
    my @types = keys %{$tknhist[$i]};
    if ( @types == 1 ) {
        $subtemplate .= $types[0];
        $lastcond = 'matched';
    }
    else {
        my $ch = ( $types[0] =~ /\w/ ) ? '\w' : '\W';
        if ( $lastcond eq 'matched' ) {
            $template .= qr/\Q$subtemplate\E/;
            $lastcond = $subtemplate = '';
        }
        $template .= join '', '(', $ch, '+)';
    }
}
$template .= qr/\Q$subtemplate\E/ if ( $lastcond eq 'matched' );
$template .= '$';  # finish with "end of string"

print "regex: $template\n";
[download]

(updated to remove an extraneous variable)

In reply to Re: String similarities and pattern matching by graff
in thread String similarities and pattern matching by Phalcon123

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.