Re^2: Regex: matching character which happens exactly once

> Just to make sure I understand the question, you're looking for a regex that does the equivalent of the following regex,

Yes, even without testing this looks right, your embedded Perl excludes all former characters in a character class.*

> Personally, I don't find this particular (??{ }) construct too terrible...

It's a theoretical question, I wouldn't use a regex anyway. It's about exploring the limits.

> but without (?{ }) or (??{ })? (Anything else that's not allowed?)

Basically yes.

The less cheating the better, ie a portable solution would be perfect. (Bigly doubt)

Using implementation details or experimental features less so. (Sad )

> Update: A bit of research on variable-width lookbehind:

Regarding the older discussions:

I don't see how \K could be applied here, the perldocs are IMHO wrong to claim it was generally allowing variable length look behinds.

IMHO does \K only apply for s/// .

Cheers Rolf
_{(addicted to the Perl Programming Language and ☆☆☆☆ :)

Je suis Charlie!}

*) that's the approach I would try to implement with recursive patterns

Comment on Re^2: Regex: matching character which happens exactly once

Replies are listed 'Best First'.
Re^3: Regex: matching character which happens exactly once by haukex (Archbishop) on Oct 23, 2017 at 19:26 UTC
I spent some more time playing with this, trying out different things, like recursion, or backtracking control verbs, similar to what you did here. I also tried inverting the condition, that is, match strings where each character is repeated at least once. Sadly, test cases like `"abacbc"` and `"abaxcbc"` leave me doubting whether it is possible at all with pure regexes, although now that I've said that some regex wizard will probably prove me wrong ;-) So despite not having achieved success yet, I thought I'd share one or two of the avenues I explored. One thing I tried was thinking about it as a prefix + single character + suffix. Defining the latter two is fairly easy, but the prefix is not, since for the test case `"abaxcbc"` I haven't yet found a way for either the prefix to "mark" the second `b` as "seen", or the suffix to look back and notice that `b` has been seen before. This is what I was meant regarding variable-width lookbehind, the `(SKIP)(FAIL)` trick shown by AnomalousMonk looked interesting, but does not seem applicable here. Another, simpler problem is that in a string like `"abccacb"`, the regex will easily misidentify the second `a` as the "single" character, since I couldn't find a way to look back - while I did manage to cover some of those cases with recursion, by doing stuff like `/ (?<r1>.)\g{r1}* (?&subpat) \g{r1}++ /`, I still couldn't cover all the cases. For example, I couldn't come up with a way to match `"abaxcbc"` against `/a.a/`, `/b.b/`, and `/c.c/` simultaneously and in a way that advanced the match position in a useful manner. For example, in `"abaxcbc"=~/b(.)b/`, `$1 eq 'axc'`, and while `cbc` could be matched with a lookahead, how does one determine that the `a` has been seen before? Alternatively, one could match `"abaxcbc"=~/ab(?=.b)a/`, but when you get to `cbc`, how do you determine that `b` has been seen before? (This is basically the same problem as mentioned above, no variable-width lookbehind.) Of course, I may have painted myself into a corner with all my (over)thinking, so now I'm really curious if there is a solution :-) (Update:* There is!) I also spent some time on generating test cases, which was an interesting little problem in itself. Although I am out of time for now, maybe something in the following is useful: #!/usr/bin/env perl use warnings; use strict; use feature 'say'; use Test::More; my $re = qr{ \A ## Reference Implementation ## (?<prefix> .* ) (?<single> (??{ length $+{prefix} ? '[^'.quotemeta($+{prefix}).']' : '.' }) ) (?<suffix> (?: (?! \g{single} ) . )* ) \z }msx; my $re_nope = qr{ \A ## one of my early failed attempts # the prefix contains only characters that repeat later on (?<prefix> (?: (?<rep> . ) (?= .* \g{rep} ) )* ) # the single character obviously can't be part of the prefix (?<single> (?! \g{rep} ) . ) # the suffix must not contain the "single" character (?<suffix> (?: (?! \g{single} ) . )* ) \z }msx; # run the regex against the test cases for my $len (1..7) { my $odo = deduped_odo( map ['a'..chr(ord('a')+$_-1)], 1..$len ); while (my @c = $odo->()) { my $str = join '', @c; my $expect = do { my %h; $h{$_}++ for @c; my %r = reverse %h; exists $r{1} }; is $str=~$re, $expect, $str.($expect?' =~':' !~').' re' # or, for regexes that match strings with no single chars: #is $str!~$re2, $expect, $str.($expect?' !~':' =~').' re2' or diag explain [ \%-, map [ $-[$_], eval "\$$_", $+[$_] ], 1..@+ ]; } } done_testing; say "FAIL: $_->{name}" # a shorter summary of failures for grep { !$_->{ok} } Test::More->builder->details; exit; # test case generation stuff follows sub odometer { # http://www.perlmonks.org/?node_id=1197785 my @w = map { [ 1, ref eq 'ARRAY' ? @$_ : $_ ] } @_; my $done; return sub { if ($done) { $done=0; return } my @cur = map {$$_[$$_[0]]} @w; for (my $i=$#w; $i>=0; $i--) { last if ++$w[$i][0]<@{$w[$i]}; $w[$i][0]=1; $done=1 unless $i; } return wantarray ? @cur : join '', @cur; } } sub deduped_odo { my $odo = odometer(@_); # this sequence can probably be generated directly... my %seen; return sub { while (1) { my @next = $odo->() or return; my $pat = do { my ($i,%h)=('A'); join '', map {$h{$_}//=$i++} @next }; if ( not $seen{$pat}++ ) { return wantarray ? @next : join '', @next } } } } [download] Update 2019-10-05: I've just released Algorithm::Odometer::Tiny!	[reply] [d/l] [select]
Re^4: Regex: matching character which happens exactly once by AnomalousMonk (Archbishop) on Oct 24, 2017 at 03:55 UTC
... the `(SKIP)(FAIL)` trick ... looked interesting, but does not seem applicable here. I, too, have been trying to press this trick into service for this application and all my efforts have ended in tears. ... how does one determine that [some character] has been seen before? The `(SKIP)(FAIL)` trick works with interpolations, variable alternations and variably-quantified atoms that are known at compile time. But as you noted, you can't know what characters you will encounter in an arbitrary string until run time, and then it's too late. Maybe that's why such impure practices as `(?{ ...code... })` were invented in the first place. ... doubting whether it is possible at all with pure regexes ... I'm slowly and reluctantly being driven to this conclusion myself — but I haven't yet given up all hope! (BTW: I don't consider a solution fully acceptable unless it can be adapted to extract all singleton characters from a string in one pass. All the "pure" solutions I've seen so far only seem to work for a single singleton per string. Oh, well...) Give a man a fish: `<%-{-{-{-<`	[reply] [d/l] [select]