Beefy Boxes and Bandwidth Generously Provided by pair Networks
Think about Loose Coupling
 
PerlMonks  

Regex question once-only use of chars in a charset

by gje21c (Acolyte)
on May 15, 2011 at 06:19 UTC ( [id://904925]=perlquestion: print w/replies, xml ) Need Help??

gje21c has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks,

Immediate confession ... this is a Scrabble-related question of low importance ! And it's a pure regex question, but I've found the Perl Monks are the best for these things, just as Perl regex is certainly the best, I always use it, even from the command line.

I'd like to search a list of words (a dictionary list) to find those I can make with my letters. So, if I have AABCDEF I'd like to find all words with 1 to 7 of those letters in. I need a regexp like AABCDEF{1,7} but which only allows each letter to occur once, to mimic the fact that you have only one physical tablet per letter (stating the obvious here but, the plainer the context the better).

I hope this isn't obvious. I can't think of an option that does it, and intuitively that once-only using up of tokens is a counting thing quite alien to regex's pattern matching. But you never know.

ATdhvaannkcse, Greg E

Replies are listed 'Best First'.
Re: Regex question once-only use of chars in a charset
by CountZero (Bishop) on May 15, 2011 at 13:19 UTC
    I think it can be as simple as this:
    use Modern::Perl; open my $WORDLIST, '<', './wordlist.txt' or die $!; my $available = 'AABCDEF'; $available = join '?', sort split '', $available; $available .= '?'; while (<$WORDLIST>) { chomp; my $sorted = join '', sort split ''; say if $sorted =~ /^$available$/io; }
    Running this script with your 'AABCDEF' gives me the following results:
    abe ace aced baa bad bade be bead bed cab cad cade cafe dab dace deaf deb decaf fab facade face faced fad fade fed
    I use a 58,000 elements wordlist and it needed less than a few seconds to generate this result.

    CountZero

    A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

      Cool! Sorting the word as well is a game changer! Then you just check off the letters in order.
        It is an old trick. Transform both sides of the comparison to a canonical form and the whole problems becomes much easier to solve.

        CountZero

        A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

      What's the o in say if $sorted =~ /^$available$/io;?
      I know i is for case-insensitive matching, but I can't find an o modifier in the documentation.

        The  /o modifier means "compile once" for a regex it modifies. Consider these examples:

        >perl -wMstrict -le "my $s = '1a2b3c'; ;; print qq{no /o}; for my $i (qw(3 2 1)) { print qq{matched '$1'} if $s =~ m{ ($i.) }xms; } ;; print qq{with /o}; for my $i (qw(3 2 1)) { print qq{matched '$1'} if $s =~ m{ ($i.) }xmso; } " no /o matched '3c' matched '2b' matched '1a' with /o matched '3c' matched '3c' matched '3c'

        The function of the  /o modifier has been generally replaced by the qr// regex object builder (see in perlop).

        I was a bit surprised not to see anything about  /o in perlre, but it is (briefly and obliquely) discussed in qr/STRING/msixpodual (5.14), and the following remains in perlretut (at least through 5.12):

        Optimizing pattern evaluation
        We pointed out earlier that variables in regexps are substituted before the regexp is evaluated:
        $pattern = 'Seuss'; while (<>) { print if /$pattern/; }
        This will print any lines containing the word "Seuss". It is not as efficient as it could be, however, because Perl has to re-evaluate (or compile) $pattern each time through the loop. If $pattern won't be changing over the lifetime of the script, we can add the "//o" modifier, which directs Perl to only perform variable substitutions once:
        #!/usr/bin/perl # Improved simple_grep $regexp = shift; while (<>) { print if /$regexp/o; # a good deal faster }
Re: Regex question once-only use of chars in a charset
by GrandFather (Saint) on May 15, 2011 at 07:04 UTC

    I doubt this fits what you have described, but it may be of interest:

    #!/usr/local/bin/perl use strict; use warnings; my @words = split /\s+/, <<WORDS; a alien all allows always an and are as best but can certainly command confession context counting dictionary does each even fact find for fo +und from have here hope i if immediate importance intuitively is just letter le +tters like list low make mimic monks my need never obvious occur of one only opti +on pattern per perl physical plainer pure question quite regex regexp search tabl +et that the these thing think this those to tokens up use using which with wor +ds you WORDS my %normLu; push @{$normLu{normWord ($_)}}, $_ for @words; my $match = buildRegex ('adeilln'); my @matches = map {@{$normLu{$_}}} grep {/^$match$/} keys %normLu; print "@matches\n"; sub normWord { return join '', sort split '', $_[0]; } sub buildRegex { my ($word) = @_; my %freq; ++$freq{$_} for split '', $word; return join '', map {"$_\{0,$freq{$_}\}"} sort keys %freq; }

    Prints:

    alien all and an a i
    True laziness is hard work
Re: Regex question once-only use of chars in a charset
by Anonymous Monk on May 15, 2011 at 06:45 UTC
    Use the various search options ( http://search.cpan.org,...) and find code to generate permutation as you want, and generate a list of words (I believe there is an example in Higher Order Perl).

    Then give this list to Regexp::Trie or Regexp::Assemble.

    It might be a roundabout way of doing it, and it might run a bit slow to generate the regex pattern, but it won't require learning anything :O and the regex will be optimized :D

Re: Regex question once-only use of chars in a charset
by JavaFan (Canon) on May 15, 2011 at 21:42 UTC
    Eh, why a regexp? This is what I use:
    #!/usr/bin/perl use 5.010; use strict; use warnings; use Getopt::Long; use List::Util 'sum'; my %tiles = qw [ A 1 B 3 C 3 D 2 E 1 F 4 G 2 H 4 I 1 J 8 K 5 L 1 M 3 N 1 O 1 P 3 Q 10 R 1 S 1 T 1 U 1 V 4 W 4 X 8 Y 4 Z 10 ]; die "No rack?\n" unless @ARGV; my ($rack, $blanks) = (@ARGV, 0); my $full = length($rack) + $blanks == 7; my %target; $target{$_}++ for split //, $rack; my @words = `cat /usr/share/dict/words`; chomp @words; my @good; my %discount; WORD: foreach my $word (@words) { next unless $word =~ /^[a-z]+$/; my %copy = %target; my $b = 0; my $d = 0; foreach my $c (split //, $word) { if (--$copy{$c} < 0) { $b++; $d += $tiles{uc $c}; $discount{$word} = $d; } next WORD if $b > $blanks; } push @good, $word; } my $tl = length($rack) + $blanks; @good = sort {$a->[1] <=> $b->[1]} map {[$_, sum (map {$tiles{uc $_}} split //) - ($discount{$_} || 0) + ($full && length($_) == $tl ? 50 : 0)]} @good; printf "%2d: %s\n", $_->[1], $_->[0] for @good; __END__
    Usage: ./program rack [nr-of-blanks], were rack are the non-blank tiles, and there's an optional number of blanks.
Re: Regex question once-only use of chars in a charset
by John M. Dlugosz (Monsignor) on May 15, 2011 at 10:32 UTC
    I think you can start with a simple regex that pulls those words from the main dictionary that contain only the allowed letters and are not too long. Make the regex a subset of the full test. But that will cull the list of a quarter million words down to a much smaller number, in an in-memory @list.

    Then apply the full analysis to only the entries on that list. You might employ subsequent passes with patters generated for the specific case: if you have 1 letter E, look for anything containing two E's and scratch them off, etc. You could generate such a case for each unique letter and then join them together with '|' to make a rejection test. Final pattern would be /e.*e|f.*f.*f|…/ to tell you if the candidate word has more than 1 E, or more than 2 Fs, etc.

    So you see the test for more than N of the same letter in your rack is N+1 copies separated by ".*".

Re: Regex question once-only use of chars in a charset
by gje21c (Acolyte) on May 20, 2011 at 22:23 UTC
    Thanks everyone.

    Those approaches will be very useful. I could have made it clearer I guess, that I can program my way to a solution, but was wondering if regex would have it built-in, so I could just use  "cat dictionary-file | grep /Z1[AABCDEF]/Z2where Z1 or Z2 is a magic modifer that forces each letter in the  [ ]set to be used only once. It looks like not. But you never know, there's a world of functionality in regexp and anything that feels like it could be in there, or would have been asked before, usually is in there.

    Well back to WordsWithFriends on iPhone ! The dictionary for that game has 173000 words and a vast number I have never heard of, not to mention English/US/other spellings etc. My friends just experiment with words until they find something it accepts and I need to fight back ! Not really fair but I need a few wins.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://904925]
Approved by lidden
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others musing on the Monastery: (6)
As of 2024-03-29 00:05 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found