An interesting statement of the problem. Here's a non-regex approach. (I think a regex approach would be so complex as to be more trouble than it's worth. (Update: No: LanX's regex solution here is IMHO quite simple and maintainable. It's also much faster!))

Withdrawn code. Nothing really wrong with it. Just don't like it.

Win8 Strawberry 5.8.9.5 (32) Wed 08/17/2022 14:20:47 C:\@Work\Perl\monks >perl use strict; use warnings; use autodie; use Data::Dump qw(dd pp); # word-per-line dictionary. use constant DICTIONARY => 'C:/@Work/moby/mwords/354984si.ngl'; use constant TOKENS => pack('C*', 0 .. 255); open my $fh, '<', DICTIONARY; TEMPLATE: for my $template (qw(neocene abcdbab Mississippi abccbccbddb)) { my $et = essence($template); printf "template '$template' (tokenized %s): \n", pp $et; my $count = 0; WORD: while (my $word = <$fh>) { chomp $word; my $ew = essence($word); next WORD unless $et eq $ew; print " '$word' \n"; ++$count; } # end while WORD printf " (%d words) \n", $count; seek $fh, 0, 0; # for next dictionary pass } # end for TEMPLATE close $fh; sub essence { my ($string, ) = @_; my $essence; my $t = 0; my %seen; $essence .= $seen{$_} ? $seen{$_} : ($seen{$_} = substr TOKENS, $t +++, 1) for split '', $string; return $essence; } ^Z template 'neocene' (tokenized "\0\1\2\3\1\0\1"): 'kaitaka' 'lauhala' 'metreme' 'neocene' 'tempete' (5 words) template 'abcdbab' (tokenized "\0\1\2\3\1\0\1"): 'kaitaka' 'lauhala' 'metreme' 'neocene' 'tempete' (5 words) template 'Mississippi' (tokenized "\0\1\2\2\1\2\2\1\3\3\1"): 'mississippi' (1 words) template 'abccbccbddb' (tokenized "\0\1\2\2\1\2\2\1\3\3\1"): 'mississippi' (1 words)

Update: Hmmm... On second thought...
In the essence() function, the statement
    $essence .= $seen{$_} ? ... : ... for split '', $string;
is buggy. If the tokens string TOKENS contains the equivalent of a '0' character which is ever actually used as a token, $seen{$_} will be evaluated as false if the same character is seen again and tokenization will be incorrect. I think a proper statement would be
    $essence .= exists $seen{$_} ? ... : ... for split '', $string;
but I haven't tested this.

Update: Niftier code. Avoids use of substr. Probably faster.

Win8 Strawberry 5.8.9.5 (32) Wed 08/17/2022 18:25:54 C:\@Work\Perl\monks >perl use strict; use warnings; use autodie; use Data::Dump qw(dd pp); # word-per-line dictionary. use constant DICTIONARY => 'C:/@Work/moby/mwords/354984si.ngl'; open my $fh, '<', DICTIONARY; chomp(my @dictionary = <$fh>); # remove all newlines close $fh; printf "%d words in dictionary \n", scalar @dictionary; my @templates = qw( neocene abcdbab 0123101 &*?+*&* Mississippi abccbccbddb 0000 ); printf "%d scan templates \n", scalar @templates; my $start = time; my $template_scans; TEMPLATE: for my $template (@templates) { my $tt = tokenize($template); printf "template '$template' (tokenized %s): \n", pp $tt; my $count = 0; WORD: for my $word (@dictionary) { chomp $word; my $tw = tokenize($word); next WORD unless $tt eq $tw; print " '$word' \n"; ++$count; } # end while WORD print " ($count words) \n"; ++$template_scans; } # end for TEMPLATE printf "%d template scans. total time for all scans: %d secs. \n", $template_scans, time - $start; INIT { # begin pre-initialized closure for tokenize() # @tokens array must not contain any false char. my @tokens = grep $_, map chr, 0 .. 0xff; # should be enough :) sub tokenize { my ($string, ) = @_; use warnings FATAL => qw(uninitialized); # guard $t out of range my $t = 0; # must not access beyond @tokens array my $toks; my %seen; $toks .= $seen{$_} ||= $tokens[$t++] for split '', $string; return $toks; # # also works. a bit slower, probably because it # # uses 2 intermediate arrays instead of 1. # my %seen; # return join '', map $seen{$_} ||= $tokens[$t++], split '', $string +; } } # end closure for tokenize() ^Z 354984 words in dictionary 7 scan templates template 'neocene' (tokenized "\0\1\2\3\1\0\1"): 'kaitaka' 'lauhala' 'metreme' 'neocene' 'tempete' (5 words) template 'abcdbab' (tokenized "\0\1\2\3\1\0\1"): 'kaitaka' 'lauhala' 'metreme' 'neocene' 'tempete' (5 words) template '0123101' (tokenized "\0\1\2\3\1\0\1"): 'kaitaka' 'lauhala' 'metreme' 'neocene' 'tempete' (5 words) template '&*?+*&*' (tokenized "\0\1\2\3\1\0\1"): 'kaitaka' 'lauhala' 'metreme' 'neocene' 'tempete' (5 words) template 'Mississippi' (tokenized "\0\1\2\2\1\2\2\1\3\3\1"): 'mississippi' (1 words) template 'abccbccbddb' (tokenized "\0\1\2\2\1\2\2\1\3\3\1"): 'mississippi' (1 words) template '0000' (tokenized "\0\0\0\0"): 'mmmm' 'oooo' (2 words) 7 template scans. total time for all scans: 68 secs.
Also run under Strawberry 5.30.3.1. (Runs about 16% faster under 5.30 for some reason.)

(Kaitaka, lauhala and etcetera? They all seem to be real words, although not all English.)

WRT the template "Mississippi": if a lower-case "m" had also been present in this template, it would have been treated as a separate character token. Maybe a little more template pre-processing is needed.


Give a man a fish:  <%-{-{-{-<


In reply to Re^2: Nonrepeating characters in an RE (updated) by AnomalousMonk
in thread Nonrepeating characters in an RE by BernieC

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.