in reply to Nonrepeating characters in an RE

that's tricky with a regex, because you need to hack a negative lookahead for each position

DB<130> @a = ("abc","aab","abb","aba") DB<131> for (@a) { say "$_ : $1,$2" if /^(.)(?!.*\1)(.)(?!.*\2)/ } abc : a,b DB<132>

There are advanced features for repeating patterns where you don't need to know the length in advance. (see perlretut )

But I'd rather prefer a simple split and %seen-hash solution, which is far easier to maintain.

update

Wait ... please try (and test!!!) this solution with relative backrefrences:

DB<142> @a = ("abc","aab","abb", "aba","abcd","abca" ) DB<143> for (@a) { say "$_" if /^ ( (.) (?!.*\g-1) )+ $/x } abc abcd DB<144>

update

still this regex solution will have O(n^2), while a %seen hash solution can do in O(n)

Cheers Rolf
(addicted to the Perl Programming Language :)
Wikisyntax for the Monastery

Replies are listed 'Best First'.
Re^2: Nonrepeating characters in an RE (updated)
by hv (Prior) on Aug 16, 2022 at 02:46 UTC

    I think it's not so bad doing it with a regexp if you're generating it programmatically. And dealing with it via split is easy when all letters must be distinct, but also gets more complex when you have to honour the template.

    I think the direct version for the given template would look something like the below, which doesn't look too hard to generate programmatically either for this template or for other templates more generally:

    m{ (.) (?!\1) (.) (?!\1|\2) (.) (?!\1|\2|\3) (.) (?!\1|\2|\3|\4) (.) (?!\1|\2|\3|\4|\5) (.) \1 }x

    For your update version, thanks - I've not noticed before that we have \g-1 (though I think I'd want to add the optional braces - it doesn't look remotely atomic to me without them). But I believe your example is also for "all distinct letters", it isn't clear to me how you'd use it for BernieC's templates.

    For performance: I'm assuming this is for matching something like English words in a dictionary, so the length is unlikely to be a problem - I'd look for a different approach if the templates were commonly going to have more than 10-20 distinct letters in them. There's a lot gained by the fact that matching a regexp is a single perl op: a split-based solution is likely to lose a lot more on the op overhead than it gains on the algorithmic complexity, if the target really is something like dictionary words.

      I think a hash approach is likely to be faster.

      If the regex approach is taken, a simpler regex would seem to be
          qr{ ([\Q$template\E]) .*? \1 }xms
      where $template is the template string. Any $string that matches against this regex has a repeated character of the template. Repeated characters in the string that are not in the template are ignored (I assume this is what BernieC wants). If there are repeated characters in the template, the repeats are effectively ignored. However, this qr// will not handle an empty-string template.


      Give a man a fish:  <%-{-{-{-<

      > OP > For example, my "template" might look like this: "abcdefa" and I already have the code that generates (.)?????\1. I can't figure how to make the "?"s say "these guys all have to be distinct".

      > But I believe your example is also for "all distinct letters", it isn't clear to me how you'd use it for BernieC's templates.

      OK taking under assumption, that what the OP meant was to only match strings build from a "template" with an anchored substring having the constraint of (certain?) unique letters, one can easily adapt the above regex solution.

      use v5.12; use warnings; use Data::Dump; #ddx my @words = map { "..X${_}X.." } <{a,b,c,d}{a,b,c,d}{a,b,c,d}>; my $class = "[abc]"; #$class = '.'; #uncomment to match d too my $len = 3; for (@words) { say "$_" if $_ =~ / X # start anchor ( ($class) (?! .{0,$len} \g{-1} # relative backreference .*? X ) ){$len} X # end anchor /x; }

      ..XabcX.. ..XacbX.. ..XbacX.. ..XbcaX.. ..XcabX.. ..XcbaX..

      Hope that's clearer now. :)

      Cheers Rolf
      (addicted to the Perl Programming Language :)
      Wikisyntax for the Monastery

      update

      added part

      .*? X # end anchor

      update

      replaced .*? with .{0,$len}

      NB: technically $len-1 would be better, but the end anchor makes this redundant

      update

      OK I agree that this is a mess. But I'd rather wait for clarifiation from the OP before suggesting better solutions.

      > But I believe your example is also for "all distinct letters", it isn't clear to me how you'd use it for BernieC's templates.

      I find the wording "template" confusing and thought it's about constructing a complicated regex by templates, i.e. like it's done with HTML.

      If "template" is supposed to mean character class $chars = "abc.." whose chars are never repeated anywhere in the string, a negated approach is probably the simplest

      $str !~ / ([$chars]) .* \1 /x

      edit

      use v5.12; use warnings; my @words = qw"abc aab abb aba abcd abca"; my $chars = "ad"; for (@words) { say "$_" if $_ !~ / ([$chars]) .* \1 /x }

      abc abb # NB: b wasn't in chars abcd

      Cheers Rolf
      (addicted to the Perl Programming Language :)
      Wikisyntax for the Monastery

        When the OP talks about "a template that might look like abcdefa", I interpret that to represent a pattern that words should follow, as in a substitution cypher: the first and seventh letters should be the same, all other letters should be different from those and from each other. Thus it should match "suitors" and "realtor", but not "bracken" or "suffers" or "straits" or "albania".

        So the regexp translation becomes: for each letter in the template, if it has not been seen before introduce a new capture, and insist it doesn't match any of the previous captures; if it has been seen before, just match the appropriate previous capture:

        sub template_to_regexp { my($template) = @_; my $seen_count; my %seen_at; my $regexp = ''; for my $i (0 .. length($template) - 1) { my $chr = substr($template, $i, 1); my $seen = $seen_at{$chr}; if (defined $seen) { $regexp .= "\\$seen"; next; } # else it's a new template character $regexp .= sprintf '(?!%s)', join '|', map "\\$_", 1 .. $seen_coun +t if $seen_count; $seen_at{$chr} = ++$seen_count; $regexp .= '(.)'; } return qr{$regexp}; }
Re^2: Nonrepeating characters in an RE (updated)
by BernieC (Pilgrim) on Aug 16, 2022 at 00:59 UTC
    Thanks. I agree a seen-hash is the right thing: use the simple RE to do a first-cut and then a seen-hash to weed out the dupes. Thanks again...