Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hey all, I would like to perform regex's, but to accept a specific number of errors. So for instance, I would like to accept:

my $value '23DX-245C'; if ($value =~ /\d{2}\s{2}-\d{2}\s{1}/) { print "Matched\n"; }

Because the difference between the pattern and the actual value is only one character (an inserted digit). I have one lead already from a CPAN search, but I was wondering if any monks had any experience with this kind of thing.

Replies are listed 'Best First'.
Re: Fuzzy Matching
by halley (Prior) on Aug 26, 2003 at 18:06 UTC
    What kinds of 'errors' are acceptable? Must the length and syntax match but the letters differ? Are missing characters acceptable? Are extra characters acceptable?

    Look for the Levenshtein Distance: Text::Levenshtein.

    There's also a command-line tool called 'agrep' which may help, even if only to focus your question by reading their documentation and widen your search for other answers.

    --
    [ e d @ h a l l e y . c c ]

Re: Fuzzy Matching
by VSarkiss (Monsignor) on Aug 26, 2003 at 18:09 UTC

    If the general structure is the same, and only the lengths vary, you can use the "at least N but not more than M" construct in regular expressions:

    if ($value =~ /\d{1,3}\s{1,2}-\d{3,5}\s{1,4}/)
    This will match 1, 2, or 3 digits, then 1 or 2 whitespace, then a hyphen, then 3, 4, or 5 digits, then 1 to 4 whitespace characters. As an aside, in situations like this, the /x modifier is very handy, because it lets you put the comments right in your pattern:
    if ($value =~ /\d{1,3} # 1 to 3 digits \s{1,2} # 1 or 2 whitespace - # exactly one hyphen \d{3,5} # 3, 4, or 5 digits \s{1,4} # 1 to 4 whitespace /x)
    BTW, the pattern you've given doesn't match your sample data. Did you mean \w instead of \s?

Re: Fuzzy Matching
by CombatSquirrel (Hermit) on Aug 26, 2003 at 20:00 UTC
    I just had another idea. Depending on the size of your search space, you might be interested in the thread Regexp generating strings?. In a reply I gave a program that generates all matching strings for RegExes with certain limitaions. You could extend the program to suit your needs (e.g. add $regex =~ s/\d/'(' . join('|', ('0' .. '9')) . ')'/eg;) and then calculate the minimal distance from your input to the possible matches.
    Hope this helped.
    CombatSquirrel.
    Entropy is the tendency of everything going to hell.
Re: Fuzzy Matching
by dragonchild (Archbishop) on Aug 26, 2003 at 17:53 UTC
    Use a lot of ?'s?

    ------
    We are the carpenters and bricklayers of the Information Age.

    The idea is a little like C++ templates, except not quite so brain-meltingly complicated. -- TheDamian, Exegesis 6

    Please remember that I'm crufty and crochety. All opinions are purely mine and all code is untested, unless otherwise specified.

      Would that allow me to get a count of how many errors had been introduced, though?
        Untested:
        my $string = "abcd-ef1"; my @matches = $string =~ /^(\w{2})?(\d)?(\w{2})?(-)?(\w{2})?(\d)?$/; my $errors = grep { !defined $_ } @matches; print "Errors is $errors\n"; ---- Errors is 1

        Basically, you're expecting everything to match. If it doesn't, then non-matches are errors. You'll have to fiddle with it, I think, to get it to do exactly what you want, but that should give you a good start. (This is, of course, that you have to (re)invent the wheel.)

        ------
        We are the carpenters and bricklayers of the Information Age.

        The idea is a little like C++ templates, except not quite so brain-meltingly complicated. -- TheDamian, Exegesis 6

        Please remember that I'm crufty and crochety. All opinions are purely mine and all code is untested, unless otherwise specified.

Re: Fuzzy Matching
by CombatSquirrel (Hermit) on Aug 26, 2003 at 18:32 UTC
    How about
    my $value = '23DX-445C'; if ($value =~ m!(\d+)(\w+)(-+)(\d+)(\d+)!) { print "Difference: " . (abs(length($1) - 2) + abs(length($2) - 2 + abs(length($3) - 1) + abs(length($4) - 2 + abs(length($5) - 2)) . "\n"; }
    CombatSquirrel.
    Entropy is the tendency of everything going to hell.