Beefy Boxes and Bandwidth Generously Provided by pair Networks
"be consistent"
 
PerlMonks  

Edit distance between regular expression

by future.open (Initiate)
on Sep 08, 2008 at 05:49 UTC ( [id://709711]=perlquestion: print w/replies, xml ) Need Help??

future.open has asked for the wisdom of the Perl Monks concerning the following question:

I need to calculate edit distance~ between two regex's.
e.g distance between *cd* and abcdef should be 0.
distance between *cxd* and adcdef should be 1.
I am not sure about how to proceed.

~http://en.wikipedia.org/wiki/Levenshtein_distance

Here I am talking about glob....and the comparison is between a regex and a static string
  • Comment on Edit distance between regular expression

Replies are listed 'Best First'.
Re: Edit distance between regular expressions
by hawtin (Prior) on Sep 08, 2008 at 08:35 UTC

    First of all I think you actually want the distance from the regex to a fixed string, not between two regex. This is possibly a (slightly) easier problem.

    Secondly I suspect that the problem (at least for regex but maybe not for glob) is undecidable. Consider the matchs:

    /(ab|ac|ce).{3}(ge|ef|gh)/ aceabdef ageabdxf

    Clearly the regex matches the first string (i.e a distance of 0), but for the second you could validly be asking "How many modifications to the regex are required to create a match" or "How many inserts/deletes are needed on the string for the regex to match". These are of course different questions.

    If you think of the pattern space of all possible strings then each string is a single location while a regex is an area. It is quite legitimate to ask for the distance between two points, and that is easy to calculate. It is also possible to ask if a region encloses a point, that is does a string match a regex. The question you are asking is what is the distance from the nearest point in an area. Simple changes in the regex can have large effects on the area covered. The power of regex makes your question hard.

      Secondly I suspect that the problem (at least for regex but maybe not for glob) is undecidable.

      I don't think so, at least in the computer science sense of "undecidable". For every perl regex (that doesn't do evil stuff with code assertions) you can make it match any string with this transformation:

      m/($old_regex)?/

      (Matches the empty string and thus every possible string).

      Which is an edit distance of 3, if we count on a character base. So all you have to check is every possible regex with an edit distance of 1 or 2, which are still very many but finite. (Assuming this kind of transformation is actually allowed)

      I don't know if that leads to any kind of practical solution though, and I don't know if OP actually talks about regexes or globs.

Re: Edit distance between regular expressions
by moritz (Cardinal) on Sep 08, 2008 at 06:47 UTC
    e.g distance between *cd* and abcdef should be 0.

    I hope you're aware that *cd* isn't a valid regex? * is a quantifier, and can't stand at the start of a regex. Maybe you mean globs instead? (The wildcards that you use on the command line).

    Update: More importantly, do you want your problem to be solved for regex, or for globs?

    Another update: Do you want the edit distance between two globs/regex, or between one and a target string?

Re: Edit distance between regular expressions
by Anonymous Monk on Sep 08, 2008 at 06:35 UTC
Re: Edit distance between regular expression
by dHarry (Abbot) on Sep 09, 2008 at 08:21 UTC

    The mathematical concept of "distance" is quite broad but then again rigorously defined. In general the distance function (or metric) has to have certain properties. The Levenshtein Distance (LD) is a metric for measuring the "amount" of difference between two strings.

    In your case the question is different. You don’t compare two strings but a string and a glob. By definition this is not possible unless the glob results in one string only. Otherwise the LD is simply not defined.

    In the example you give: you suggest that LD(*cxd*,adcdef)=1. But this is only so when *cxd* expands into one string only which is one edit operation away from adcdef.

    Maybe you could generalize the distance function for your particular application. It might not be a proper distance function anymore but may still be useful depending on your particular problem. After expanding the glob into {s1, s2, …, sn} you could determine the LD for all those strings with the original string: {d1, d2, …, dn}. Depending on what meaning it has to you, I don’t know the particular application of it, you can then take the lowest, highest, average or whatever value that makes sense to you.

    I wonder what the particular application is you're after.

      Hi! then, do you think that the edit distance between two regular expressions is undecidable?

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://709711]
Approved by lidden
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others sharing their wisdom with the Monastery: (5)
As of 2024-04-24 08:42 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found