I spent some more time playing with this, trying out different things, like recursion, or backtracking control verbs, similar to what you did here. I also tried inverting the condition, that is, match strings where each character is repeated at least once. Sadly, test cases like "abacbc" and "abaxcbc" leave me doubting whether it is possible at all with pure regexes, although now that I've said that some regex wizard will probably prove me wrong ;-)
So despite not having achieved success yet, I thought I'd share one or two of the avenues I explored. One thing I tried was thinking about it as a prefix + single character + suffix. Defining the latter two is fairly easy, but the prefix is not, since for the test case "abaxcbc" I haven't yet found a way for either the prefix to "mark" the second b as "seen", or the suffix to look back and notice that b has been seen before. This is what I was meant regarding variable-width lookbehind, the (*SKIP)(*FAIL) trick shown by AnomalousMonk looked interesting, but does not seem applicable here.
Another, simpler problem is that in a string like "abccacb", the regex will easily misidentify the second a as the "single" character, since I couldn't find a way to look back - while I did manage to cover some of those cases with recursion, by doing stuff like / (?<r1>.)\g{r1}* (?&subpat) \g{r1}++ /, I still couldn't cover all the cases. For example, I couldn't come up with a way to match "abaxcbc" against /a.*a/, /b.*b/, and /c.*c/ simultaneously and in a way that advanced the match position in a useful manner. For example, in "abaxcbc"=~/b(.*)b/, $1 eq 'axc', and while cbc could be matched with a lookahead, how does one determine that the a has been seen before? Alternatively, one could match "abaxcbc"=~/ab(?=.*b)a/, but when you get to cbc, how do you determine that b has been seen before? (This is basically the same problem as mentioned above, no variable-width lookbehind.)
Of course, I may have painted myself into a corner with all my (over)thinking, so now I'm really curious if there is a solution :-) (Update: There is!)
I also spent some time on generating test cases, which was an interesting little problem in itself. Although I am out of time for now, maybe something in the following is useful:
#!/usr/bin/env perl
use warnings;
use strict;
use feature 'say';
use Test::More;
my $re = qr{ \A ## Reference Implementation ##
(?<prefix> .* )
(?<single> (??{ length $+{prefix}
? '[^'.quotemeta($+{prefix}).']' : '.' }) )
(?<suffix> (?: (?! \g{single} ) . )* )
\z }msx;
my $re_nope = qr{ \A ## one of my early failed attempts
# the prefix contains only characters that repeat later on
(?<prefix> (?: (?<rep> . ) (?= .* \g{rep} ) )* )
# the single character obviously can't be part of the prefix
(?<single> (?! \g{rep} ) . )
# the suffix must not contain the "single" character
(?<suffix> (?: (?! \g{single} ) . )* )
\z }msx;
# run the regex against the test cases
for my $len (1..7) {
my $odo = deduped_odo( map ['a'..chr(ord('a')+$_-1)], 1..$len );
while (my @c = $odo->()) {
my $str = join '', @c;
my $expect = do { my %h; $h{$_}++ for @c;
my %r = reverse %h; exists $r{1} };
is $str=~$re, $expect, $str.($expect?' =~':' !~').' re'
# or, for regexes that match strings with no single chars:
#is $str!~$re2, $expect, $str.($expect?' !~':' =~').' re2'
or diag explain
[ \%-, map [ $-[$_], eval "\$$_", $+[$_] ], 1..@+ ];
}
}
done_testing;
say "FAIL: $_->{name}" # a shorter summary of failures
for grep { !$_->{ok} } Test::More->builder->details;
exit;
# test case generation stuff follows
sub odometer { # http://www.perlmonks.org/?node_id=1197785
my @w = map { [ 1, ref eq 'ARRAY' ? @$_ : $_ ] } @_;
my $done;
return sub {
if ($done) { $done=0; return }
my @cur = map {$$_[$$_[0]]} @w;
for (my $i=$#w; $i>=0; $i--) {
last if ++$w[$i][0]<@{$w[$i]};
$w[$i][0]=1;
$done=1 unless $i;
}
return wantarray ? @cur : join '', @cur;
}
}
sub deduped_odo {
my $odo = odometer(@_);
# this sequence can probably be generated directly...
my %seen;
return sub {
while (1) {
my @next = $odo->() or return;
my $pat = do { my ($i,%h)=('A');
join '', map {$h{$_}//=$i++} @next };
if ( not $seen{$pat}++ )
{ return wantarray ? @next : join '', @next }
}
}
}
Update 2019-10-05: I've just released Algorithm::Odometer::Tiny! | [reply] [d/l] [select] |
... the (*SKIP)(*FAIL) trick ... looked interesting, but does not seem applicable here.
I, too, have been trying to press this trick into service for this application and all my efforts have ended in tears.
... how does one determine that [some character] has been seen before?
The (*SKIP)(*FAIL) trick works with interpolations, variable alternations and variably-quantified atoms that are known at compile time. But as you noted, you can't know what characters you will encounter in an arbitrary string until run time, and then it's too late. Maybe that's why such impure practices as (?{ ...code... }) were invented in the first place.
... doubting whether it is possible at all with pure regexes ...
I'm slowly and reluctantly being driven to this conclusion myself — but I haven't yet given up all hope!
(BTW: I don't consider a solution fully acceptable unless it can be adapted to extract all singleton characters from a string in one pass. All the "pure" solutions I've seen so far only seem to work for a single singleton per string. Oh, well...)
Give a man a fish: <%-{-{-{-<
| [reply] [d/l] [select] |