Re: approximate regular expression

Some variations on a Text::LevenshteinXS distance theme. The substr version (d) might be best (i.e., fastest), but I haven't Benchmark-ed anything. All these solutions find overlapping matches. All examples are of distance 1 (Update: but other distances could be used). (Sorry for any line-wrap in the output.)

>perl -wMstrict -le
"use Text::LevenshteinXS qw(distance);
 ;;
 my $pattern = 'JEJE';
 my $string  = 'EJKJUJHJDJEJEJEDEJOJOJJJAHJHJSHJEFEJUJEJUJKIJS';
 ;;
 my @matches = match_near_a(1, $pattern, \$string);
 printf 'a: ';  printf q{'%s' at %2d  }, @$_ for @matches;
 print '';
 ;;
 @matches = match_near_b(1, $pattern, \$string);
 printf 'b: ';  printf qq{'$_'        } for @matches;
 print ''
 ;;
 @matches = match_near_c(1, $pattern, \$string);
 printf 'c: ';  printf q{'%s' at %2d  }, @$_ for @matches;
 print '';
 ;;
 @matches = match_near_d(1, $pattern, \$string);
 printf 'd: ';  printf q{'%s' at %2d  }, @$_ for @matches;
 ;;
 sub match_near_a {
   my ($dist, $p, $sr) = @_;
   ;;
   my $len = length $p;
   local our @m;
   use re 'eval';
   $$sr =~ m{
     (.{$len})
     (?{ push @m, [ $^N, $-[1] ] if distance($^N, $p) <= $dist })
     (*FAIL)
     }xms;
   return @m;
   }
 ;;
 sub match_near_b {
   my ($dist, $p, $sr) = @_;
   ;;
   my $len = length $p;
   return
     grep distance($_, $p) <= $dist,
       $$sr =~ m{ (?= (.{$len})) }xmsg;
   }
 ;;
 sub match_near_c {
   my ($dist, $p, $sr) = @_;
   ;;
   my $len = length $p;
   my @matches;
   distance($1, $p) <= $dist and push @matches, [ $1, $-[1] ]
     while $$sr =~ m{ (?= (.{$len})) }xmsg;
   return @matches;
   }
 ;;
 sub match_near_d {
   my ($dist, $p, $sr) = @_;
   ;;
   my $len = length $p;
   my $max_offset = length($$sr) - $len;
   return
     map {
       my $ss = substr $$sr, $_, $len;
       distance($ss, $p) <= $dist ? [ $ss, $_ ] : ();
       }
     0 .. $max_offset
     ;
   }
"
a: 'JDJE' at  7  'JEJE' at  9  'JEJE' at 11  'JEDE' at 13  'JEFE' at 3
+1  'JUJE' at 35  'JEJU' at 37

b: 'JDJE'        'JEJE'        'JEJE'        'JEDE'        'JEFE'     
+   'JUJE'        'JEJU'

c: 'JDJE' at  7  'JEJE' at  9  'JEJE' at 11  'JEDE' at 13  'JEFE' at 3
+1  'JUJE' at 35  'JEJU' at 37

d: 'JDJE' at  7  'JEJE' at  9  'JEJE' at 11  'JEDE' at 13  'JEFE' at 3
+1  'JUJE' at 35  'JEJU' at 37
[download]

Comment on Re: approximate regular expression Download Code

Replies are listed 'Best First'.
Re^2: approximate regular expression by jrblas (Initiate) on Mar 23, 2012 at 12:38 UTC
This is a very good option (I've tried as you specify and it works). However, I need to put in $pattern a regular expression. This regexp would vary both in nature and size. For instance: /^D{3,12}DJH^F{1,4}../ How could I explore for 'approximate matches to this regexp'?	[reply]
Re^3: approximate regular expression by AnomalousMonk (Archbishop) on Mar 23, 2012 at 15:42 UTC
As always, the Devil is in the details. Unless and until you can make clear (first to yourself, then to others) the meaning of the phrase "an approximate match to a regex", you may make but little progress. To me, for instance, the notion of a "regex match" already incorporates vague notions of fuzzyness and approximation. Just what does it mean to add approximation to approximation, or to measure its degree?	[reply]
Re^4: approximate regular expression by jrblas (Initiate) on Mar 25, 2012 at 07:18 UTC
Thanks for the suggestion Imagine, for instance, a regexp like this `$pattern="[^D]{2,3}FGR{2}..H";` There is not fuzzyness on this. From a list of strings, some will match and some will NOT match this regular expression. The question is: considering ONLY substitutions, allow a maximum of 2 mismatches when evaluating if the string matches or not to $pattern. This will result in the migration of some strings from the 'non-matching group' to the 'matching group'. How to do this in Perl?	[reply] [d/l]
Re^5: approximate regular expression by Anonymous Monk on Mar 25, 2012 at 14:48 UTC
Re^3: approximate regular expression by JavaFan (Canon) on Mar 23, 2012 at 12:45 UTC
Unless you write your regular expression engine, and use the plugin system to use it, regular expressions are really the wrong way to attack this.	[reply]