in reply to Nonrepeating characters in an RE

I'm somewhat surprised by the amount of confusion in this thread, and am starting to wonder if somehow it is me that is being stupid.

From the first paragraph I discern that BernieC has a template, and from that wants to derive a regular expression that will match things according to the pattern described by the template.

The second paragraph shows that from the template "abcdefa" the regular expression should look something like /(.)?????\1/, and he wants help filling in the middle part, where each '?' should be replaced by a regexp fragment that ensures it represents a distinct letter.

So the template forms a pattern that inputs should match, where each letter in the template should match some arbitrary character of the input - but the same template letter should match the same character each time, and distinct template letters should always match distinct characters.

Thus an input like "Mississippi" would match a template "abccbccbddb". Given the template "abccbccbddb" he wants to generate a regular expression that will match an input like "Mississippi".

That's what I get from reading the OP.

Replies are listed 'Best First'.
Re^2: Nonrepeating characters in an RE (updated)
by AnomalousMonk (Archbishop) on Aug 17, 2022 at 18:48 UTC

    An interesting statement of the problem. Here's a non-regex approach. (I think a regex approach would be so complex as to be more trouble than it's worth. (Update: No: LanX's regex solution here is IMHO quite simple and maintainable. It's also much faster!))

    Withdrawn code. Nothing really wrong with it. Just don't like it.

    Update: Niftier code. Avoids use of substr. Probably faster.

    Win8 Strawberry 5.8.9.5 (32) Wed 08/17/2022 18:25:54 C:\@Work\Perl\monks >perl use strict; use warnings; use autodie; use Data::Dump qw(dd pp); # word-per-line dictionary. use constant DICTIONARY => 'C:/@Work/moby/mwords/354984si.ngl'; open my $fh, '<', DICTIONARY; chomp(my @dictionary = <$fh>); # remove all newlines close $fh; printf "%d words in dictionary \n", scalar @dictionary; my @templates = qw( neocene abcdbab 0123101 &*?+*&* Mississippi abccbccbddb 0000 ); printf "%d scan templates \n", scalar @templates; my $start = time; my $template_scans; TEMPLATE: for my $template (@templates) { my $tt = tokenize($template); printf "template '$template' (tokenized %s): \n", pp $tt; my $count = 0; WORD: for my $word (@dictionary) { chomp $word; my $tw = tokenize($word); next WORD unless $tt eq $tw; print " '$word' \n"; ++$count; } # end while WORD print " ($count words) \n"; ++$template_scans; } # end for TEMPLATE printf "%d template scans. total time for all scans: %d secs. \n", $template_scans, time - $start; INIT { # begin pre-initialized closure for tokenize() # @tokens array must not contain any false char. my @tokens = grep $_, map chr, 0 .. 0xff; # should be enough :) sub tokenize { my ($string, ) = @_; use warnings FATAL => qw(uninitialized); # guard $t out of range my $t = 0; # must not access beyond @tokens array my $toks; my %seen; $toks .= $seen{$_} ||= $tokens[$t++] for split '', $string; return $toks; # # also works. a bit slower, probably because it # # uses 2 intermediate arrays instead of 1. # my %seen; # return join '', map $seen{$_} ||= $tokens[$t++], split '', $string +; } } # end closure for tokenize() ^Z 354984 words in dictionary 7 scan templates template 'neocene' (tokenized "\0\1\2\3\1\0\1"): 'kaitaka' 'lauhala' 'metreme' 'neocene' 'tempete' (5 words) template 'abcdbab' (tokenized "\0\1\2\3\1\0\1"): 'kaitaka' 'lauhala' 'metreme' 'neocene' 'tempete' (5 words) template '0123101' (tokenized "\0\1\2\3\1\0\1"): 'kaitaka' 'lauhala' 'metreme' 'neocene' 'tempete' (5 words) template '&*?+*&*' (tokenized "\0\1\2\3\1\0\1"): 'kaitaka' 'lauhala' 'metreme' 'neocene' 'tempete' (5 words) template 'Mississippi' (tokenized "\0\1\2\2\1\2\2\1\3\3\1"): 'mississippi' (1 words) template 'abccbccbddb' (tokenized "\0\1\2\2\1\2\2\1\3\3\1"): 'mississippi' (1 words) template '0000' (tokenized "\0\0\0\0"): 'mmmm' 'oooo' (2 words) 7 template scans. total time for all scans: 68 secs.
    Also run under Strawberry 5.30.3.1. (Runs about 16% faster under 5.30 for some reason.)

    (Kaitaka, lauhala and etcetera? They all seem to be real words, although not all English.)

    WRT the template "Mississippi": if a lower-case "m" had also been present in this template, it would have been treated as a separate character token. Maybe a little more template pre-processing is needed.


    Give a man a fish:  <%-{-{-{-<

Re^2: Nonrepeating characters in an RE
by LanX (Saint) on Aug 17, 2022 at 19:59 UTC
    TIMTOInterpretation (and I would never call you stupid ;-)

    But I like this one, so:

    use v5.12; use warnings; my $re = templ2regex("Mississippi"); for my $inp (qw/Mississippi Mossossoppo Miiiiiiippi/) { if ( my @matches = ( $inp =~ /$re/ ) ) { say "Match: $inp re: $re" if is_uniq(@matches); } } # Mississippi -> (.)(.)(.)\3\2\3\3\2(.)\4\2 sub templ2regex { my ($template) = @_; my ( $re, %grp, $cnt ); $re .= $grp{$_} // do { $grp{$_} = "\\" . ++$cnt; "(.)" } for split //, $template; return $re; } # @_ elements all distinct? sub is_uniq { my %uniq; @uniq{@_} = (); return @_ == keys %uniq; }

    =>

    Match: Mississippi re: (.)(.)(.)\3\2\3\3\2(.)\4\2 Match: Mossossoppo re: (.)(.)(.)\3\2\3\3\2(.)\4\2

    if you really want to have it all in one regex, consider putting the uniq test into a (?{...}) block which does a *FAIL

    Both are not canonical regexes but IMHO far better to maintain.

    Cheers Rolf
    (addicted to the Perl Programming Language :)
    Wikisyntax for the Monastery

    update

    refactoring: better documentation and variable names

      > if you really want to have it all in one regex, consider putting the uniq test into a (?{...}) block which does a *FAIL

      For completeness, here we go.

      But the need for use re 'eval'; surprised me, and might reduce the general usability.

      use v5.12; use warnings; use re 'eval'; my $re = templ2regex("Mississippi"); for my $inp (qw/Mississippi Mossossoppo Miiiiiiippi/) { if ( my @matches = ( $inp =~ /$re/ ) ) { say "Match: $inp re: $re"; } } # Mississippi -> (.)(.)(.)\3\2\3\3\2(.)\4\2 sub templ2regex { my ($template) = @_; my ( $re, %grp, $cnt ); # not sure if that's better readable than before $re = join "", map { $grp{$_} // do { $grp{$_} = "\\" . ++$cnt; "(.)" } } split //, $template; $re .= '(?(?{not is_uniq( @{^CAPTURE} ) }) (*FAIL) )'; return $re; } # @_ elements all distinct? sub is_uniq { my %uniq; @uniq{@_} = (); return @_ == keys %uniq; }

      Cheers Rolf
      (addicted to the Perl Programming Language :)
      Wikisyntax for the Monastery

        Here's a variation of your all-in-one regex code here that's designed to run under Perl version 5.8. (I think it'll even run under 5.6, but I can't test this.) It's been tested under Strawberries 5.8.9.5 (32-bit) and 5.30.3.1 (64-bit).

        I was quite pleasantly surprised by its speed. However, the code runs about three times more slowly under 5.30 than under 5.8. I assume this is due to the many modifications made to the regex engine over the years to accommodate Unicode. Any comment on this would be of interest. I added an optimization to avoid the "uniqueness" test if the generated template regex has only one capture group. This is of trivial benefit, but was such low-hanging fruit that I thought it would be a shame not to enjoy it.

        Win8 Strawberry 5.8.9.5 (32) Sat 08/20/2022 10:53:36 C:\@Work\Perl\monks >perl use 5.008; # i think code should work with 5.6, but this is untested use strict; use warnings; use autodie; use List::MoreUtils qw(uniq); use Data::Dump qw(dd pp); # for debug # word-per-line dictionary. use constant DICTIONARY => 'C:/@Work/moby/mwords/354984si.ngl'; open my $fh, '<', DICTIONARY; chomp(my @dictionary = <$fh>); # remove all newlines close $fh; printf "%d words in dictionary \n", scalar @dictionary; my @templates = ('', ' ', qw( neocene abcdbab 0123101 &*?+*&* Mississippi abccbccbddb 0000 )); printf "%d scan templates \n", scalar @templates; my $start = time; my $template_scans; TEMPLATE: for my $template (@templates) { my $rx_template = templ2regex($template); $rx_template = qr{ \A $rx_template \z }xms; # add appropriate bou +nds # print "template '$template' $rx_template: \n"; <STDIN>; # for de +bug print "template '$template': \n"; my $count = 0; WORD: for my $word (@dictionary) { chomp $word; next WORD unless $word =~ $rx_template; print " '$word' \n"; ++$count; } # end while WORD print " ($count words) \n"; ++$template_scans; } # end for TEMPLATE printf "%d template scans. total time for all scans: %d secs. \n", $template_scans, time - $start; sub templ2regex { my ($template, ) = @_; # empty template never matches. # (what is proper match behavior of empty template?) return qr{ (?!) }xms if $template eq ''; my %t2g; # map template char => capture group number my $n_cg = 1; # number of current capture group my $rs = # regex string join ' ', map $t2g{$_} ? "\\$t2g{$_}" : scalar($t2g{$_} = $n_cg++, '(. +)'), split '', $template ; --$n_cg; # undo last post-increment: now true highest capture gro +up use re 'eval'; # print "=== \$template '$template' \$rs $rs \n"; <STDIN>; # for +debug return qr{ $rs }xms if $n_cg < 2; # no uniqueness test needed my $cap_vars = join ', ', map "\$$_", 1 .. $n_cg; # capture vars + active $rs .= "\n(?(?{ $n_cg != uniq $cap_vars }) (?!))"; # add uniquene +ss test # print "=== \$template '$template' \$rs $rs \n"; <STDIN>; # for +debug return qr{ $rs }xms; } # end sub templ2regex() ^Z 354984 words in dictionary 9 scan templates template '': (0 words) template ' ': '2' 'a' 'b' 'c' 'd' 'e' 'f' 'g' 'h' 'i' 'j' 'k' 'l' 'm' 'n' 'o' 'p' 'q' 'r' 's' 't' 'u' 'v' 'w' 'x' 'y' 'z' (27 words) template 'neocene': 'kaitaka' 'lauhala' 'metreme' 'neocene' 'tempete' (5 words) template 'abcdbab': 'kaitaka' 'lauhala' 'metreme' 'neocene' 'tempete' (5 words) template '0123101': 'kaitaka' 'lauhala' 'metreme' 'neocene' 'tempete' (5 words) template '&*?+*&*': 'kaitaka' 'lauhala' 'metreme' 'neocene' 'tempete' (5 words) template 'Mississippi': 'mississippi' (1 words) template 'abccbccbddb': 'mississippi' (1 words) template '0000': 'mmmm' 'oooo' (2 words) 9 template scans. total time for all scans: 4 secs.

        Update: Fixed dropped word/too many words in "... pleasantly surprised its speed. I notice, however, that the code ..." in second paragraph.


        Give a man a fish:  <%-{-{-{-<