Re: Nonrepeating characters in an RE

Replies are listed 'Best First'.
Re^2: Nonrepeating characters in an RE (updated) by AnomalousMonk (Archbishop) on Aug 17, 2022 at 18:48 UTC
An interesting statement of the problem. Here's a non-regex approach. (I think a regex approach would be so complex as to be more trouble than it's worth. (Update: No: LanX's regex solution here is IMHO quite simple and maintainable. It's also much faster!)) Withdrawn code. Nothing really wrong with it. Just don't like it. Read more... (3 kB) Update: Niftier code. Avoids use of substr. Probably faster. Win8 Strawberry 5.8.9.5 (32) Wed 08/17/2022 18:25:54 C:\@Work\Perl\monks >perl use strict; use warnings; use autodie; use Data::Dump qw(dd pp); # word-per-line dictionary. use constant DICTIONARY => 'C:/@Work/moby/mwords/354984si.ngl'; open my $fh, '<', DICTIONARY; chomp(my @dictionary = <$fh>); # remove all newlines close $fh; printf "%d words in dictionary \n", scalar @dictionary; my @templates = qw( neocene abcdbab 0123101 &?+&* Mississippi abccbccbddb 0000 ); printf "%d scan templates \n", scalar @templates; my $start = time; my $template_scans; TEMPLATE: for my $template (@templates) { my $tt = tokenize($template); printf "template '$template' (tokenized %s): \n", pp $tt; my $count = 0; WORD: for my $word (@dictionary) { chomp $word; my $tw = tokenize($word); next WORD unless $tt eq $tw; print " '$word' \n"; ++$count; } # end while WORD print " ($count words) \n"; ++$template_scans; } # end for TEMPLATE printf "%d template scans. total time for all scans: %d secs. \n", $template_scans, time - $start; INIT { # begin pre-initialized closure for tokenize() # @tokens array must not contain any false char. my @tokens = grep $_, map chr, 0 .. 0xff; # should be enough :) sub tokenize { my ($string, ) = @_; use warnings FATAL => qw(uninitialized); # guard $t out of range my $t = 0; # must not access beyond @tokens array my $toks; my %seen; $toks .= $seen{$_} \|\|= $tokens[$t++] for split '', $string; return $toks; # # also works. a bit slower, probably because it # # uses 2 intermediate arrays instead of 1. # my %seen; # return join '', map $seen{$_} \|\|= $tokens[$t++], split '', $string +; } } # end closure for tokenize() ^Z 354984 words in dictionary 7 scan templates template 'neocene' (tokenized "\0\1\2\3\1\0\1"): 'kaitaka' 'lauhala' 'metreme' 'neocene' 'tempete' (5 words) template 'abcdbab' (tokenized "\0\1\2\3\1\0\1"): 'kaitaka' 'lauhala' 'metreme' 'neocene' 'tempete' (5 words) template '0123101' (tokenized "\0\1\2\3\1\0\1"): 'kaitaka' 'lauhala' 'metreme' 'neocene' 'tempete' (5 words) template '&?+&*' (tokenized "\0\1\2\3\1\0\1"): 'kaitaka' 'lauhala' 'metreme' 'neocene' 'tempete' (5 words) template 'Mississippi' (tokenized "\0\1\2\2\1\2\2\1\3\3\1"): 'mississippi' (1 words) template 'abccbccbddb' (tokenized "\0\1\2\2\1\2\2\1\3\3\1"): 'mississippi' (1 words) template '0000' (tokenized "\0\0\0\0"): 'mmmm' 'oooo' (2 words) 7 template scans. total time for all scans: 68 secs. [download] Also run under Strawberry 5.30.3.1. (Runs about 16% faster under 5.30 for some reason.) (Kaitaka, lauhala and etcetera? They all seem to be real words, although not all English.) WRT the template "Mississippi": if a lower-case "m" had also been present in this template, it would have been treated as a separate character token. Maybe a little more template pre-processing is needed. Give a man a fish: `<%-{-{-{-<`	[reply] [d/l] [select]
Re^2: Nonrepeating characters in an RE by LanX (Saint) on Aug 17, 2022 at 19:59 UTC
TIMTOInterpretation (and I would never call you stupid ;-) But I like this one, so: use v5.12; use warnings; my $re = templ2regex("Mississippi"); for my $inp (qw/Mississippi Mossossoppo Miiiiiiippi/) { if ( my @matches = ( $inp =~ /$re/ ) ) { say "Match: $inp re: $re" if is_uniq(@matches); } } # Mississippi -> (.)(.)(.)\3\2\3\3\2(.)\4\2 sub templ2regex { my ($template) = @_; my ( $re, %grp, $cnt ); $re .= $grp{$_} // do { $grp{$_} = "\\" . ++$cnt; "(.)" } for split //, $template; return $re; } # @_ elements all distinct? sub is_uniq { my %uniq; @uniq{@_} = (); return @_ == keys %uniq; } [download] => `Match: Mississippi re: (.)(.)(.)\3\2\3\3\2(.)\4\2 Match: Mossossoppo re: (.)(.)(.)\3\2\3\3\2(.)\4\2` [download] if you really want to have it all in one regex, consider putting the uniq test into a `(?{...})` block which does a `*FAIL` Both are not canonical regexes but IMHO far better to maintain. Cheers Rolf _{(addicted to the Perl Programming Language :) Wikisyntax for the Monastery} update refactoring: better documentation and variable names	[reply] [d/l] [select]
Re^3: Nonrepeating characters in an RE by LanX (Saint) on Aug 18, 2022 at 18:30 UTC
> if you really want to have it all in one regex, consider putting the uniq test into a (?{...}) block which does a FAIL* For completeness, here we go. But the need for `use re 'eval';` surprised me, and might reduce the general usability. use v5.12; use warnings; use re 'eval'; my $re = templ2regex("Mississippi"); for my $inp (qw/Mississippi Mossossoppo Miiiiiiippi/) { if ( my @matches = ( $inp =~ /$re/ ) ) { say "Match: $inp re: $re"; } } # Mississippi -> (.)(.)(.)\3\2\3\3\2(.)\4\2 sub templ2regex { my ($template) = @_; my ( $re, %grp, $cnt ); # not sure if that's better readable than before $re = join "", map { $grp{$_} // do { $grp{$_} = "\\" . ++$cnt; "(.)" } } split //, $template; $re .= '(?(?{not is_uniq( @{^CAPTURE} ) }) (*FAIL) )'; return $re; } # @_ elements all distinct? sub is_uniq { my %uniq; @uniq{@_} = (); return @_ == keys %uniq; } [download] Cheers Rolf _{(addicted to the Perl Programming Language :) Wikisyntax for the Monastery}	[reply] [d/l] [select]
Re^4: Nonrepeating characters in an RE by AnomalousMonk (Archbishop) on Aug 20, 2022 at 19:17 UTC
Here's a variation of your all-in-one regex code here that's designed to run under Perl version 5.8. (I think it'll even run under 5.6, but I can't test this.) It's been tested under Strawberries 5.8.9.5 (32-bit) and 5.30.3.1 (64-bit). I was quite pleasantly surprised by its speed. However, the code runs about three times more slowly under 5.30 than under 5.8. I assume this is due to the many modifications made to the regex engine over the years to accommodate Unicode. Any comment on this would be of interest. I added an optimization to avoid the "uniqueness" test if the generated template regex has only one capture group. This is of trivial benefit, but was such low-hanging fruit that I thought it would be a shame not to enjoy it. Win8 Strawberry 5.8.9.5 (32) Sat 08/20/2022 10:53:36 C:\@Work\Perl\monks >perl use 5.008; # i think code should work with 5.6, but this is untested use strict; use warnings; use autodie; use List::MoreUtils qw(uniq); use Data::Dump qw(dd pp); # for debug # word-per-line dictionary. use constant DICTIONARY => 'C:/@Work/moby/mwords/354984si.ngl'; open my $fh, '<', DICTIONARY; chomp(my @dictionary = <$fh>); # remove all newlines close $fh; printf "%d words in dictionary \n", scalar @dictionary; my @templates = ('', ' ', qw( neocene abcdbab 0123101 &?+&* Mississippi abccbccbddb 0000 )); printf "%d scan templates \n", scalar @templates; my $start = time; my $template_scans; TEMPLATE: for my $template (@templates) { my $rx_template = templ2regex($template); $rx_template = qr{ \A $rx_template \z }xms; # add appropriate bou +nds # print "template '$template' $rx_template: \n"; <STDIN>; # for de +bug print "template '$template': \n"; my $count = 0; WORD: for my $word (@dictionary) { chomp $word; next WORD unless $word =~ $rx_template; print " '$word' \n"; ++$count; } # end while WORD print " ($count words) \n"; ++$template_scans; } # end for TEMPLATE printf "%d template scans. total time for all scans: %d secs. \n", $template_scans, time - $start; sub templ2regex { my ($template, ) = @_; # empty template never matches. # (what is proper match behavior of empty template?) return qr{ (?!) }xms if $template eq ''; my %t2g; # map template char => capture group number my $n_cg = 1; # number of current capture group my $rs = # regex string join ' ', map $t2g{$_} ? "\\$t2g{$_}" : scalar($t2g{$_} = $n_cg++, '(. +)'), split '', $template ; --$n_cg; # undo last post-increment: now true highest capture gro +up use re 'eval'; # print "=== \$template '$template' \$rs $rs \n"; <STDIN>; # for +debug return qr{ $rs }xms if $n_cg < 2; # no uniqueness test needed my $cap_vars = join ', ', map "\$$_", 1 .. $n_cg; # capture vars + active $rs .= "\n(?(?{ $n_cg != uniq $cap_vars }) (?!))"; # add uniquene +ss test # print "=== \$template '$template' \$rs $rs \n"; <STDIN>; # for +debug return qr{ $rs }xms; } # end sub templ2regex() ^Z 354984 words in dictionary 9 scan templates template '': (0 words) template ' ': '2' 'a' 'b' 'c' 'd' 'e' 'f' 'g' 'h' 'i' 'j' 'k' 'l' 'm' 'n' 'o' 'p' 'q' 'r' 's' 't' 'u' 'v' 'w' 'x' 'y' 'z' (27 words) template 'neocene': 'kaitaka' 'lauhala' 'metreme' 'neocene' 'tempete' (5 words) template 'abcdbab': 'kaitaka' 'lauhala' 'metreme' 'neocene' 'tempete' (5 words) template '0123101': 'kaitaka' 'lauhala' 'metreme' 'neocene' 'tempete' (5 words) template '&?+&': 'kaitaka' 'lauhala' 'metreme' 'neocene' 'tempete' (5 words) template 'Mississippi': 'mississippi' (1 words) template 'abccbccbddb': 'mississippi' (1 words) template '0000': 'mmmm' 'oooo' (2 words) 9 template scans. total time for all scans: 4 secs. [download] Update:* Fixed dropped word/too many words in "... pleasantly surprised its speed. I notice, however, that the code ..." in second paragraph. Give a man a fish: `<%-{-{-{-<`	[reply] [d/l] [select]
Re^5: Nonrepeating characters in an RE (performance) by hv (Prior) on Aug 20, 2022 at 22:40 UTC
Re^6: Nonrepeating characters in an RE (performance) by LanX (Saint) on Aug 21, 2022 at 18:29 UTC
Re^5: Nonrepeating characters in an RE by LanX (Saint) on Aug 21, 2022 at 00:22 UTC
Re^6: Nonrepeating characters in an RE by hv (Prior) on Aug 21, 2022 at 21:10 UTC
Some notes below your chosen depth have not been shown here
Re^5: Nonrepeating characters in an RE by vr (Curate) on Aug 22, 2022 at 07:37 UTC
Re^6: Nonrepeating characters in an RE by rsFalse (Chaplain) on Oct 27, 2022 at 23:29 UTC

update