in reply to approximate regular expression

Some variations on a Text::LevenshteinXS distance theme. The substr version (d) might be best (i.e., fastest), but I haven't Benchmark-ed anything. All these solutions find overlapping matches. All examples are of distance 1 (Update: but other distances could be used). (Sorry for any line-wrap in the output.)

>perl -wMstrict -le "use Text::LevenshteinXS qw(distance); ;; my $pattern = 'JEJE'; my $string = 'EJKJUJHJDJEJEJEDEJOJOJJJAHJHJSHJEFEJUJEJUJKIJS'; ;; my @matches = match_near_a(1, $pattern, \$string); printf 'a: '; printf q{'%s' at %2d }, @$_ for @matches; print ''; ;; @matches = match_near_b(1, $pattern, \$string); printf 'b: '; printf qq{'$_' } for @matches; print '' ;; @matches = match_near_c(1, $pattern, \$string); printf 'c: '; printf q{'%s' at %2d }, @$_ for @matches; print ''; ;; @matches = match_near_d(1, $pattern, \$string); printf 'd: '; printf q{'%s' at %2d }, @$_ for @matches; ;; sub match_near_a { my ($dist, $p, $sr) = @_; ;; my $len = length $p; local our @m; use re 'eval'; $$sr =~ m{ (.{$len}) (?{ push @m, [ $^N, $-[1] ] if distance($^N, $p) <= $dist }) (*FAIL) }xms; return @m; } ;; sub match_near_b { my ($dist, $p, $sr) = @_; ;; my $len = length $p; return grep distance($_, $p) <= $dist, $$sr =~ m{ (?= (.{$len})) }xmsg; } ;; sub match_near_c { my ($dist, $p, $sr) = @_; ;; my $len = length $p; my @matches; distance($1, $p) <= $dist and push @matches, [ $1, $-[1] ] while $$sr =~ m{ (?= (.{$len})) }xmsg; return @matches; } ;; sub match_near_d { my ($dist, $p, $sr) = @_; ;; my $len = length $p; my $max_offset = length($$sr) - $len; return map { my $ss = substr $$sr, $_, $len; distance($ss, $p) <= $dist ? [ $ss, $_ ] : (); } 0 .. $max_offset ; } " a: 'JDJE' at 7 'JEJE' at 9 'JEJE' at 11 'JEDE' at 13 'JEFE' at 3 +1 'JUJE' at 35 'JEJU' at 37 b: 'JDJE' 'JEJE' 'JEJE' 'JEDE' 'JEFE' + 'JUJE' 'JEJU' c: 'JDJE' at 7 'JEJE' at 9 'JEJE' at 11 'JEDE' at 13 'JEFE' at 3 +1 'JUJE' at 35 'JEJU' at 37 d: 'JDJE' at 7 'JEJE' at 9 'JEJE' at 11 'JEDE' at 13 'JEFE' at 3 +1 'JUJE' at 35 'JEJU' at 37

Replies are listed 'Best First'.
Re^2: approximate regular expression
by jrblas (Initiate) on Mar 23, 2012 at 12:38 UTC

    This is a very good option (I've tried as you specify and it works). However, I need to put in $pattern a regular expression. This regexp would vary both in nature and size. For instance: /^D{3,12}DJH^F{1,4}../ How could I explore for 'approximate matches to this regexp'?

      As always, the Devil is in the details. Unless and until you can make clear (first to yourself, then to others) the meaning of the phrase "an approximate match to a regex", you may make but little progress. To me, for instance, the notion of a "regex match" already incorporates vague notions of fuzzyness and approximation. Just what does it mean to add approximation to approximation, or to measure its degree?

        Thanks for the suggestion

        Imagine, for instance, a regexp like this

        $pattern="[^D]{2,3}FGR{2}..H";

        There is not fuzzyness on this. From a list of strings, some will match and some will NOT match this regular expression. The question is: considering ONLY substitutions, allow a maximum of 2 mismatches when evaluating if the string matches or not to $pattern. This will result in the migration of some strings from the 'non-matching group' to the 'matching group'.

        How to do this in Perl?

      Unless you write your regular expression engine, and use the plugin system to use it, regular expressions are really the wrong way to attack this.