Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid
 
PerlMonks  

Nonrepeating characters in an RE

by BernieC (Pilgrim)
on Aug 15, 2022 at 23:32 UTC ( [id://11146148]=perlquestion: print w/replies, xml ) Need Help??

BernieC has asked for the wisdom of the Perl Monks concerning the following question:

I have an odd problem that's hurting my head: I'm trying to construct an RE that will only match if the letter in any position does *NOT* match any other character in the string. I'm constructing this RE with a perl program and building the RE from a template. It is the *template* that says "these letters should be distinct" and then I want to run through a few thousand words to pick out the words that "match".

For example, my "template" might look like this: "abcdefa" and I already have the code that generates (.)?????\1. I can't figure how to make the "?"s say "these guys all have to be distinct".

Replies are listed 'Best First'.
Re: Nonrepeating characters in an RE (updated)
by LanX (Saint) on Aug 16, 2022 at 00:33 UTC
    that's tricky with a regex, because you need to hack a negative lookahead for each position

    DB<130> @a = ("abc","aab","abb","aba") DB<131> for (@a) { say "$_ : $1,$2" if /^(.)(?!.*\1)(.)(?!.*\2)/ } abc : a,b DB<132>

    There are advanced features for repeating patterns where you don't need to know the length in advance. (see perlretut )

    But I'd rather prefer a simple split and %seen-hash solution, which is far easier to maintain.

    update

    Wait ... please try (and test!!!) this solution with relative backrefrences:

    DB<142> @a = ("abc","aab","abb", "aba","abcd","abca" ) DB<143> for (@a) { say "$_" if /^ ( (.) (?!.*\g-1) )+ $/x } abc abcd DB<144>

    update

    still this regex solution will have O(n^2), while a %seen hash solution can do in O(n)

    Cheers Rolf
    (addicted to the Perl Programming Language :)
    Wikisyntax for the Monastery

      I think it's not so bad doing it with a regexp if you're generating it programmatically. And dealing with it via split is easy when all letters must be distinct, but also gets more complex when you have to honour the template.

      I think the direct version for the given template would look something like the below, which doesn't look too hard to generate programmatically either for this template or for other templates more generally:

      m{ (.) (?!\1) (.) (?!\1|\2) (.) (?!\1|\2|\3) (.) (?!\1|\2|\3|\4) (.) (?!\1|\2|\3|\4|\5) (.) \1 }x

      For your update version, thanks - I've not noticed before that we have \g-1 (though I think I'd want to add the optional braces - it doesn't look remotely atomic to me without them). But I believe your example is also for "all distinct letters", it isn't clear to me how you'd use it for BernieC's templates.

      For performance: I'm assuming this is for matching something like English words in a dictionary, so the length is unlikely to be a problem - I'd look for a different approach if the templates were commonly going to have more than 10-20 distinct letters in them. There's a lot gained by the fact that matching a regexp is a single perl op: a split-based solution is likely to lose a lot more on the op overhead than it gains on the algorithmic complexity, if the target really is something like dictionary words.

        I think a hash approach is likely to be faster.

        If the regex approach is taken, a simpler regex would seem to be
            qr{ ([\Q$template\E]) .*? \1 }xms
        where $template is the template string. Any $string that matches against this regex has a repeated character of the template. Repeated characters in the string that are not in the template are ignored (I assume this is what BernieC wants). If there are repeated characters in the template, the repeats are effectively ignored. However, this qr// will not handle an empty-string template.


        Give a man a fish:  <%-{-{-{-<

        > OP > For example, my "template" might look like this: "abcdefa" and I already have the code that generates (.)?????\1. I can't figure how to make the "?"s say "these guys all have to be distinct".

        > But I believe your example is also for "all distinct letters", it isn't clear to me how you'd use it for BernieC's templates.

        OK taking under assumption, that what the OP meant was to only match strings build from a "template" with an anchored substring having the constraint of (certain?) unique letters, one can easily adapt the above regex solution.

        use v5.12; use warnings; use Data::Dump; #ddx my @words = map { "..X${_}X.." } <{a,b,c,d}{a,b,c,d}{a,b,c,d}>; my $class = "[abc]"; #$class = '.'; #uncomment to match d too my $len = 3; for (@words) { say "$_" if $_ =~ / X # start anchor ( ($class) (?! .{0,$len} \g{-1} # relative backreference .*? X ) ){$len} X # end anchor /x; }

        ..XabcX.. ..XacbX.. ..XbacX.. ..XbcaX.. ..XcabX.. ..XcbaX..

        Hope that's clearer now. :)

        Cheers Rolf
        (addicted to the Perl Programming Language :)
        Wikisyntax for the Monastery

        update

        added part

        .*? X # end anchor

        update

        replaced .*? with .{0,$len}

        NB: technically $len-1 would be better, but the end anchor makes this redundant

        update

        OK I agree that this is a mess. But I'd rather wait for clarifiation from the OP before suggesting better solutions.

        > But I believe your example is also for "all distinct letters", it isn't clear to me how you'd use it for BernieC's templates.

        I find the wording "template" confusing and thought it's about constructing a complicated regex by templates, i.e. like it's done with HTML.

        If "template" is supposed to mean character class $chars = "abc.." whose chars are never repeated anywhere in the string, a negated approach is probably the simplest

        $str !~ / ([$chars]) .* \1 /x

        edit

        use v5.12; use warnings; my @words = qw"abc aab abb aba abcd abca"; my $chars = "ad"; for (@words) { say "$_" if $_ !~ / ([$chars]) .* \1 /x }

        abc abb # NB: b wasn't in chars abcd

        Cheers Rolf
        (addicted to the Perl Programming Language :)
        Wikisyntax for the Monastery

      Thanks. I agree a seen-hash is the right thing: use the simple RE to do a first-cut and then a seen-hash to weed out the dupes. Thanks again...
Re: Nonrepeating characters in an RE
by kcott (Archbishop) on Aug 16, 2022 at 02:50 UTC

    G'day BernieC,

    I considered a solution using List::Util::uniqstr(). Unfortunately, your OP is somewhat ambiguous: you first say 'the *template* that says "these letters should be distinct"'; then you say 'my "template" might look like this: "abcdefa"' (where, with two 'a's, the letters are not distinct). I've provided two solutions; hopefully, one of them does what you want.

    [Caveat: uniqstr first appeared in List::Util v1.45; the first stable Perl version with that version of List::Util is v5.26; if you have an older version of Perl, you can get a newer version of List::Util from CPAN.]

    Solution 1 (pm_11146148_uniq_str_chars.pl):

    #!/usr/bin/env perl use strict; use warnings; use List::Util 1.45 'uniqstr'; use Test::More; my @tests = ( [abcdefa => 0], [abcdef => 1], [xyz => 1], [zzz => 0], ); plan tests => 0+@tests; for my $test (@tests) { my ($str, $exp) = @$test; is length($str) == uniqstr(split //, $str), !!$exp; }

    Output:

    $ ./pm_11146148_uniq_str_chars.pl 1..4 ok 1 ok 2 ok 3 ok 4

    Solution 2 (pm_11146148_uniq_str_chars_2.pl):

    #!/usr/bin/env perl use strict; use warnings; use List::Util 1.45 'uniqstr'; use Test::More; my @tests = ( ['abc', abcdefa => 0], ['def', abcdefa => 1], ['abc', abcdef => 1], ['xyz', xyz => 1], ['xyz', zzz => 0], ['xy', zzz => 1], ); plan tests => 0+@tests; for my $test (@tests) { my ($tmpl, $str, $exp) = @$test; my %tmpls = map +($_ => 1), split //, $tmpl; my @chars = split //, $str; my @tmpl_chars = grep exists $tmpls{$_}, @chars; is 0+@tmpl_chars == uniqstr(@tmpl_chars), !!$exp; }

    Output:

    $ ./pm_11146148_uniq_str_chars_2.pl 1..6 ok 1 ok 2 ok 3 ok 4 ok 5 ok 6

    — Ken

Re: Nonrepeating characters in an RE
by tybalt89 (Monsignor) on Aug 16, 2022 at 15:11 UTC

    Assuming you are working on something like a cryptogram problem, something like this:

    #!/usr/bin/perl use strict; # https://perlmonks.org/?node_id=11146148 use warnings; my $template = 'abcdefa'; my $regex = ''; my %used; for my $char ( split //, $template ) { if( exists $used{$char} ) # seen before { $regex .= "\\$used{$char}"; } else # new different character { $regex .= join '', map("(?!\\$_)", values %used), "(.)"; $used{$char} = 1 + keys %used; } } my $qr = qr/^$regex$/; # the real regex use Data::Dump 'dd'; dd $qr; @ARGV = '/usr/share/dict/words'; /$qr/ and print, exit while <>; # show first match

    Outputs:

    qr/^(.)(?!\1)(.)(?!\2)(?!\1)(.)(?!\2)(?!\1)(?!\3)(.)(?!\4)(?!\3)(?!\2) +(?!\1)(.)(?!\5)(?!\2)(?!\1)(?!\3)(?!\4)(.)\1$/ algebra

    EDIT: a couple of newlines makes the regex easier to read (still same algorithm)

    #!/usr/bin/perl use strict; # https://perlmonks.org/?node_id=11146148 use warnings; my $template = 'abcdefa'; my $regex = ''; my %used; for my $char ( split //, $template ) { if( exists $used{$char} ) # seen before { $regex .= "\\$used{$char}\n"; } else # new different character { $regex .= join '', map("(?!\\$_)", sort values %used), "(.)\n"; $used{$char} = 1 + keys %used; } } my $qr = qr/^$regex$/x; # the real regex use Data::Dump 'dd'; dd $qr; @ARGV = '/usr/share/dict/words'; use List::AllUtils qw{ sample }; print sample 10, grep /$qr/, <>

    Outputs:

    qr/^(.) (?!\1)(.) (?!\1)(?!\2)(.) (?!\1)(?!\2)(?!\3)(.) (?!\1)(?!\2)(?!\3)(?!\4)(.) (?!\1)(?!\2)(?!\3)(?!\4)(?!\5)(.) \1 $/x swipe's enforce tugboat discard slogans sailors neutron shock's singles snipe's
Re: Nonrepeating characters in an RE
by hippo (Bishop) on Aug 16, 2022 at 16:24 UTC

    Hello BernieC. You might have noticed from the replies that there is some confusion about what it is you are trying to achieve. If you have a read through the excellent How to ask better questions using Test::More and sample data you will see how you can provide an SSCCE with tests to show unambiguously what it is that you require. That way folks won't spend time solving a problem which you don't have. :-)


    🦛

Re: Nonrepeating characters in an RE
by atcroft (Abbot) on Aug 16, 2022 at 03:36 UTC

    Interesting problem. Out of curiousity, I played around with the idea and got the following (ugly!) one-liner to work (reformatted for readability):

    $ perl -le ' my @arry = qw/ a b cd gefge hhi jjkk l m nn o /; print q{Originals: }, join q{ }, @arry; my @n_arry = grep { if ( $_ !~ m/(.).*?\g-1/ ) { $_; } } @arry; print q{Uniques: }, join q{ }, @n_arry; ' Originals: a b cd gefeg hhi jjk l m nn o Uniques: a b cd l m o $

    It can probably be cleaned up or made shorter (as I was just trying to get something working).

    I originally approached the idea of doing a split-sort-join on the original string to get an ordered string, testing that for duplicate characters, but after getting it working I realized the grep() was the meat of the code, and the rest could be eliminated.

    For anyone curious, here was my original code:

    Hope that helps.

    Here's a shorter value for a single string:

    $ perl -le ' my $str = shift; print q{Original: }, $str; print q{Unique: }, grep { if ( $_ !~ m/(.).*?\g-1/ ) { $_; }; } ( $str, ); ' foobar Original: foobar Unique: $ perl -le ' my $str = shift; print q{Original: }, $str; print q{Unique: }, grep { if ( $_ !~ m/(.).*?\g-1/ ) { $_; }; } ( $str, ); ' bar Original: bar Unique: bar

    Update: (2022-08-15) - Removed unnecessary '/g' and '/x' from the regex in the grep. Removed an errant line of text that was left while editing the response. Removed READMORE tags around first code snippet. Fixed punctuation error in response text. Added shorter code examples at end.

Re: Nonrepeating characters in an RE
by hv (Prior) on Aug 17, 2022 at 15:19 UTC

    I'm somewhat surprised by the amount of confusion in this thread, and am starting to wonder if somehow it is me that is being stupid.

    From the first paragraph I discern that BernieC has a template, and from that wants to derive a regular expression that will match things according to the pattern described by the template.

    The second paragraph shows that from the template "abcdefa" the regular expression should look something like /(.)?????\1/, and he wants help filling in the middle part, where each '?' should be replaced by a regexp fragment that ensures it represents a distinct letter.

    So the template forms a pattern that inputs should match, where each letter in the template should match some arbitrary character of the input - but the same template letter should match the same character each time, and distinct template letters should always match distinct characters.

    Thus an input like "Mississippi" would match a template "abccbccbddb". Given the template "abccbccbddb" he wants to generate a regular expression that will match an input like "Mississippi".

    That's what I get from reading the OP.

      An interesting statement of the problem. Here's a non-regex approach. (I think a regex approach would be so complex as to be more trouble than it's worth. (Update: No: LanX's regex solution here is IMHO quite simple and maintainable. It's also much faster!))

      Withdrawn code. Nothing really wrong with it. Just don't like it.

      Update: Niftier code. Avoids use of substr. Probably faster.

      Win8 Strawberry 5.8.9.5 (32) Wed 08/17/2022 18:25:54 C:\@Work\Perl\monks >perl use strict; use warnings; use autodie; use Data::Dump qw(dd pp); # word-per-line dictionary. use constant DICTIONARY => 'C:/@Work/moby/mwords/354984si.ngl'; open my $fh, '<', DICTIONARY; chomp(my @dictionary = <$fh>); # remove all newlines close $fh; printf "%d words in dictionary \n", scalar @dictionary; my @templates = qw( neocene abcdbab 0123101 &*?+*&* Mississippi abccbccbddb 0000 ); printf "%d scan templates \n", scalar @templates; my $start = time; my $template_scans; TEMPLATE: for my $template (@templates) { my $tt = tokenize($template); printf "template '$template' (tokenized %s): \n", pp $tt; my $count = 0; WORD: for my $word (@dictionary) { chomp $word; my $tw = tokenize($word); next WORD unless $tt eq $tw; print " '$word' \n"; ++$count; } # end while WORD print " ($count words) \n"; ++$template_scans; } # end for TEMPLATE printf "%d template scans. total time for all scans: %d secs. \n", $template_scans, time - $start; INIT { # begin pre-initialized closure for tokenize() # @tokens array must not contain any false char. my @tokens = grep $_, map chr, 0 .. 0xff; # should be enough :) sub tokenize { my ($string, ) = @_; use warnings FATAL => qw(uninitialized); # guard $t out of range my $t = 0; # must not access beyond @tokens array my $toks; my %seen; $toks .= $seen{$_} ||= $tokens[$t++] for split '', $string; return $toks; # # also works. a bit slower, probably because it # # uses 2 intermediate arrays instead of 1. # my %seen; # return join '', map $seen{$_} ||= $tokens[$t++], split '', $string +; } } # end closure for tokenize() ^Z 354984 words in dictionary 7 scan templates template 'neocene' (tokenized "\0\1\2\3\1\0\1"): 'kaitaka' 'lauhala' 'metreme' 'neocene' 'tempete' (5 words) template 'abcdbab' (tokenized "\0\1\2\3\1\0\1"): 'kaitaka' 'lauhala' 'metreme' 'neocene' 'tempete' (5 words) template '0123101' (tokenized "\0\1\2\3\1\0\1"): 'kaitaka' 'lauhala' 'metreme' 'neocene' 'tempete' (5 words) template '&*?+*&*' (tokenized "\0\1\2\3\1\0\1"): 'kaitaka' 'lauhala' 'metreme' 'neocene' 'tempete' (5 words) template 'Mississippi' (tokenized "\0\1\2\2\1\2\2\1\3\3\1"): 'mississippi' (1 words) template 'abccbccbddb' (tokenized "\0\1\2\2\1\2\2\1\3\3\1"): 'mississippi' (1 words) template '0000' (tokenized "\0\0\0\0"): 'mmmm' 'oooo' (2 words) 7 template scans. total time for all scans: 68 secs.
      Also run under Strawberry 5.30.3.1. (Runs about 16% faster under 5.30 for some reason.)

      (Kaitaka, lauhala and etcetera? They all seem to be real words, although not all English.)

      WRT the template "Mississippi": if a lower-case "m" had also been present in this template, it would have been treated as a separate character token. Maybe a little more template pre-processing is needed.


      Give a man a fish:  <%-{-{-{-<

      TIMTOInterpretation (and I would never call you stupid ;-)

      But I like this one, so:

      use v5.12; use warnings; my $re = templ2regex("Mississippi"); for my $inp (qw/Mississippi Mossossoppo Miiiiiiippi/) { if ( my @matches = ( $inp =~ /$re/ ) ) { say "Match: $inp re: $re" if is_uniq(@matches); } } # Mississippi -> (.)(.)(.)\3\2\3\3\2(.)\4\2 sub templ2regex { my ($template) = @_; my ( $re, %grp, $cnt ); $re .= $grp{$_} // do { $grp{$_} = "\\" . ++$cnt; "(.)" } for split //, $template; return $re; } # @_ elements all distinct? sub is_uniq { my %uniq; @uniq{@_} = (); return @_ == keys %uniq; }

      =>

      Match: Mississippi re: (.)(.)(.)\3\2\3\3\2(.)\4\2 Match: Mossossoppo re: (.)(.)(.)\3\2\3\3\2(.)\4\2

      if you really want to have it all in one regex, consider putting the uniq test into a (?{...}) block which does a *FAIL

      Both are not canonical regexes but IMHO far better to maintain.

      Cheers Rolf
      (addicted to the Perl Programming Language :)
      Wikisyntax for the Monastery

      update

      refactoring: better documentation and variable names

        > if you really want to have it all in one regex, consider putting the uniq test into a (?{...}) block which does a *FAIL

        For completeness, here we go.

        But the need for use re 'eval'; surprised me, and might reduce the general usability.

        use v5.12; use warnings; use re 'eval'; my $re = templ2regex("Mississippi"); for my $inp (qw/Mississippi Mossossoppo Miiiiiiippi/) { if ( my @matches = ( $inp =~ /$re/ ) ) { say "Match: $inp re: $re"; } } # Mississippi -> (.)(.)(.)\3\2\3\3\2(.)\4\2 sub templ2regex { my ($template) = @_; my ( $re, %grp, $cnt ); # not sure if that's better readable than before $re = join "", map { $grp{$_} // do { $grp{$_} = "\\" . ++$cnt; "(.)" } } split //, $template; $re .= '(?(?{not is_uniq( @{^CAPTURE} ) }) (*FAIL) )'; return $re; } # @_ elements all distinct? sub is_uniq { my %uniq; @uniq{@_} = (); return @_ == keys %uniq; }

        Cheers Rolf
        (addicted to the Perl Programming Language :)
        Wikisyntax for the Monastery

Re: Nonrepeating characters in an RE (updated)
by AnomalousMonk (Archbishop) on Aug 16, 2022 at 08:12 UTC

    Here are a couple of solutions that assume that in a string being tested for character repetition, the only characters that will be checked for repetition are those that appear in a "template". E.g., if the template is 'abc', the strings 'xxx' and 'xxxabcxxx' have no repetition, and the strings 'abca' and 'xaxax' do.

    Hash solution (likely the fastest, but I haven't Benchmark-ed):

    Regex approach: Also tested under Strawberry 5.30.3.1.

    Update: Oops... My first posting had some Frankenstein code, the sewn-together bits of a few ideas. The heart of the OPed build_rx_repeated() function was

    my ($rx) = map { length() ? qr{ ([\Q$_\E]) .*? \1 }xms : qr/(?!)/ } join '', split '', $template ; return $rx;
    I've calmed down a bit and the code's less crazy now.


    Give a man a fish:  <%-{-{-{-<

Re: Nonrepeating characters in an RE (updated)
by AnomalousMonk (Archbishop) on Aug 16, 2022 at 07:45 UTC

    Hi BernieC.

    I'm a bit confused about your requirements. The other replies seem to address the problem of detecting if a string has repeated characters. In this case, I don't understand the purpose of the "template" (with or without repeated characters in it).

    If the template is 'abc', does the string 'xxxabcxxx' have any repeated characters? I assume 'abca' does. How about 'xxaxxaxx', and if so, what are the specific repeats?

    IOW, I don't understand the relationship of the "template" to the strings presumably being tested against it. Can you clarify?

    Update: hippo has already made this point, but it bears repeating: Please see the articles How to ask better questions using Test::More and sample data and Short, Self-Contained, Correct Example for ways to clearly state a problem or a set of requirements associated with a problem so that others may easily and directly address your question. Please help us to help you.


    Give a man a fish:  <%-{-{-{-<

Re: Nonrepeating characters in an RE
by ikegami (Patriarch) on Aug 17, 2022 at 06:30 UTC

    Since you have a template and talk about generating a pattern,

    my %captures; my $pat = join "", map { if ( $captures{ $_ } ) { "\\$captures{$_}" } else { my $capture = 1 + keys( %captures ); $captures{ $_ } = $capture; if ( $capture == 1 ) { "(.)" } else { "(?!".( join "|", map { "\\$_" } 1 .. $capture-1 ).")(.)" } } } split //, $template; my $re = qr/^$pat\z/s;

    For

    my $template = "abcdefa";

    this generates

    ^ (.) (?!\1)(.) (?!\1|\2)(.) (?!\1|\2|\3)(.) (?!\1|\2|\3|\4)(.) (?!\1|\2|\3|\4|\5)(.) \1 \z

    (Line breaks added for readability.)

      This might be faster:

      ^ (.) (?: (?: \1 ) (*COMMIT) (*FAIL) | (.) ) (?: (?: \1 | \2 ) (*COMMIT) (*FAIL) | (.) ) (?: (?: \1 | \2 | \3 ) (*COMMIT) (*FAIL) | (.) ) (?: (?: \1 | \2 | \3 | \4 ) (*COMMIT) (*FAIL) | (.) ) (?: (?: \1 | \2 | \3 | \4 | \5 ) (*COMMIT) (*FAIL) | (.) ) \1 \z
Re: Nonrepeating characters in an RE
by rsFalse (Chaplain) on Oct 27, 2022 at 22:04 UTC
    Not REGEX approach, but with two hashes.
    #!/usr/bin/perl -l use warnings; use strict; my $template = 'neocene'; my $uniq = 4; my $len = length $template; my @dict; @dict = <>; # <-- 354984si.ngl chomp @dict; my @good = grep { my %occ1; while( $template =~ m/./g ){ my $key = $& . substr $_, $-[0], 1; $occ1{ $key } ++; } my %occ2; map $occ2{ $_ } ++, map { ( split '' )[ 1 ] } keys %occ1; $uniq == keys %occ1 and $uniq == keys %occ2; } grep $len == length, @dict; print for @good;
    OUTPUT:
    kaitaka lauhala metreme neocene tempete
    UPD.removed mistakenly added word from the output which is absent in the dictionary.
Re: Nonrepeating characters in an RE
by Anonymous Monk on Aug 17, 2022 at 08:47 UTC

    I'm possibly missing something since all the other responses are so complex, but here goes anyway:

    my @test = qw/abc aab abb aba/; for (@test) { say if not /(.).*?\1/; }

      What you and I and, it seems, everyone else are missing is a clear understanding of BernieC's requirements.

      What I understand (or imagine I understand) is that one must detect repetition only of a set of characters given in a "template" string. So if the template is 'abc', the string 'xxxabcxxx' has no repetition and the string 'aba' does. See Re^3: Nonrepeating characters in an RE (updated) and Re: Nonrepeating characters in an RE (updated) for my take on solutions to what I imagine BernieC's requirements to be.


      Give a man a fish:  <%-{-{-{-<

        Well, in that case...

        my @test = qw/xyzabcxxx xyzaabxxx xyzabbxxx xyzabaxxx/; for (@test) { # only check for 'abc' say if not /([abc]).*?\1/; }

        Still quite simple, isn't it?

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://11146148]
Approved by LanX
Front-paged by marto
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others exploiting the Monastery: (4)
As of 2024-03-29 10:28 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found