in reply to Turning A Problem Upside Down

Seems to me that you've fallen into a pattern of looking at things in terms of combinations, permutations & powersets et al.

The basic problem is a simply a lookup problem, but as you point out, hashes don't work for this because of ordering. You could build a trie, but they are not efficient built in terms of perl's available data structures.

The alternative is to use a set of bitstrings to to index the words containing each of the letters. The bitstring for 'a', contains a set bit at the offset corresponding to any word in the dictionary that contains an 'a'. Same for 'b', etc.

To find all the words in the dictionary that contain only those given letters, you first OR all the bitstrings for the given letters together, and then, AND NOT the result with each of the remaining alphabet. You end up with a mask where each set bit corresponds to a complient word in the dictionary.

Not sure how my crude implementation stack up against yours, but it should compare favourably (assuming I understood the rules):

#! perl -slw use strict; use Data::Dump qw[ pp ]; sub uniq{ my %x; @x{@_} = (); keys %x } my @words = do{ local *ARGV = ['words.txt']; <> }; chomp @words; @words = grep length() > 2, @words; my %index; @index{ 'a' .. 'z' } = map chr(0) x int( ( @words + 8 )/8 ), 1 .. 26; for my $iWords ( 0 .. $#words ) { for my $char ( sort uniq split '', $words[ $iWords ] ) { vec( $index{ $char }, $iWords, 1 ) = 1; } } while( chomp( my $given = <STDIN> ) ) { my @given = split '', $given; my @excludes = grep{ !(1+index $given, $_ ) } 'a'..'z'; my $mask = chr(0) x int( ( @words + 8 )/8 ); $mask |= $_ for @index{ @given }; $mask &= ~ $index{ $_ } for @excludes; my $count = unpack "%32b*", $mask; print "Found $count words:\n"; vec( $mask, $_, 1 ) and print $words[ $_ ] for 0 .. $#words; print "\n\n"; } __END__ c:\test>790206 fred Found 30 words: deed deeded deer def defer deferred deffer ere err erred fed fee feed feeder free freed freer red redder reed reef reefed reefer ref refer referee refereed referred referrer reffed

Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
"Science is about questioning the status quo. Questioning authority".
In the absence of evidence, opinion is indistinguishable from prejudice.
RIP PCW It is as I've been saying!(Audio until 20090817)

Replies are listed 'Best First'.
Re^2: Turning A Problem Upside Down
by Limbic~Region (Chancellor) on Aug 22, 2009 at 13:54 UTC
    BrowserUk,
    assuming I understood the rules

    I was just rudely awoken on my day off by work (brain not yet engaged) but I don't think you have. That's probably my fault an not yours. The object of the game is to in fact find subsets of the given set of letters (which is probably why I was focused there). Let me adjust the rules to see if they make more sense and I will give an example as well:

    Original: You are presented with 7 randomly selected characters (dups allowed). The object is to come up with as many words comprised of 3 or more of those characters within a certain time-frame.

    Updated: You are presented with 7 randomly selected characters which may contain duplicates. The object is to come up with as many words at least 3 characters long that are comprised entirely of a subset of the given characters in a certain time-frame.

    Given: apetpxl # The following are all acceptable (albeit not the entire list) ape tap tape apple lap # The following are not acceptable because they contain letters not in + given apples sex

    Cheers - L~R

      Ah! So, a final check is required to eliminate words that use too many of any given letter. Could probably be achieved more efficiently, but since it's only run on a very short list and short circuits...

      #! perl -slw use strict; use Data::Dump qw[ pp ]; sub finalCheck{ my( $poss, $given ) = @_; $given =~ s[$_][] or return for split '', $poss; return 1; } sub uniq{ my %x; @x{@_} = (); keys %x } my @words = do{ local *ARGV = ['words.txt']; <> }; chomp @words; @words = grep length() > 2, @words; my %index; @index{ 'a' .. 'z' } = map chr(0) x int( ( @words + 8 )/8 ), 1 .. 26; for my $iWords ( 0 .. $#words ) { for my $char ( sort uniq split '', $words[ $iWords ] ) { vec( $index{ $char }, $iWords, 1 ) = 1; } } while( chomp( my $given = <STDIN> ) ) { my @given = split '', $given; my @excludes = grep{ !(1+index $given, $_ ) } 'a'..'z'; my $mask = chr(0) x int( ( @words + 8 )/8 ); $mask |= $_ for @index{ @given }; $mask &= ~ $index{ $_ } for @excludes; vec( $mask, $_, 1 ) and finalCheck( $words[ $_ ], $given ) and print $words[ $_ ] for 0 .. $#words; print "\n\n"; } __END__ fred def fed red ref apetpxl ale alp ape apex apple applet apt ate axe axle eat eta exalt lap lappet late latex lax lea leap leapt lept lepta let pal pale pap pat pate pea peal peat pelt pep pet petal plat plate plea pleat tale tap tape tax tea teal

      BTW: Your example helped, but is badly chosen. The problem was never including words that contained characters not in the supplied list (your 's'), but rather words that contained too many of one or more of the given letters. Eg. the second 'e' in 'freed' given 'fred'.


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
        BrowserUk,
        I am sorry my example still wasn't optimal. The dictionary I was using is the Scrabble Tournament Words List (TWL) as referenced in this somewhat related post. It comes in at just under 179,000 words. The Word Twister game only has two possible input lengths but I only play the 7 length version as it is the max. Of course, when I took up the quest of turning the problem on its head, I had already written a solver that could produce all valid words 4 or 5 times in the allocated period. My quest was to go beyond that and see what I could come up with if the dictionary and length of original input weren't so low. Again, this was just an academic exercise in exploration and my solution still assumes that the dictionary is relatively static even if it could be much larger.

        Thanks again for your responses. I will review them tonight when I have peace and quiet.

        Cheers - L~R

Re^2: Turning A Problem Upside Down
by Limbic~Region (Chancellor) on Oct 16, 2010 at 22:38 UTC
    BrowserUk,
    Seems to me that you've fallen into a pattern of looking at things in terms of combinations, permutations & powersets et al.

    You are absolutely correct. I did look at the problem from many different sides but, with the subconscious objective of precomputing everything, they all turned out to be different variations on the same theme.

    You could build a trie, but they are not efficient built in terms of perl's available data structures.

    Assuming that the word list file will remain fairly static and assuming it transforms into a data structure (trie or trie like) small enough that can stay memory resident, this seems like a reasonable approach. Using the word list you linked to earlier (TWL06.txt) with 178,590 eligible words, I use just under 80MB with the following solution:

    The alternative is to use a set of bitstrings to to index the words containing each of the letters.

    I have had a note to go back and figure out what you were doing here for over a year now. Today, I sat down to do just that. Would you mind reviewing what I have and correcting anything I got wrong? Note: I rewrote it in my own style as a mechanism for understanding it.

    Assuming I understood it correctly, there isn't a lot of room for optimizations. Instead of recreating the zeroed bitstring 27 times, just do it once. The finalCheck() could be inlined (or converted to Inline::C). It may be faster to skip candidate words that are longer than the input string. You could also use Storable the same way I did to reduce the constant time construction of the data structure. I feel silly that I didn't spend some time a year ago to try and properly understand this as it is quite beautiful.

    Cheers - L~R

      Using the word list you linked to earlier (TWL06.txt) with 178,590 eligible words, I use just under 80MB with the following solution:

      You might be able to reduce that a bit and gain a little speed by using arrays instead of hashes. I tested this by replacing your trie builder with:

      if (defined $opt{w}) { my @data; open(my $fh, '<', $opt{w}) or die "Unable to open '$opt{w}' for re +ading: $!"; while (<$fh>) { chomp; next if length($_) < 3 || /[^a-zA-Z]/; $_ = lc($_); my $code = join('', 'push @{$data', (map { $_-=97; "[$_]"} sor +t{$a<=>$b} unpack 'C*', $_), "}, '$_';"); eval $code; } store(\@data, $opt{d}) or die "Can't store '%data' in '$opt{d}'\n" +; }

      And the resultant file is just 8MB rather than 80MB. I started to modify the rest of teh code to use it, but then got lost and gave up, but arrays should be quicker than hashes.

      Note: I rewrote it in my own style as a mechanism for understanding it.

      I do it myself all the time. (I hate what you did with it! But that's a (probably pointless) argument for another time. :)

      Assuming I understood it correctly

      You're spot on.

      there isn't a lot of room for optimizations.

      It only takes 5 seconds to build the index, so that doesn't seem to warrant the effort.

      And if I feed it the top 7 characters ordered by frequency in the dictionary: esiarnt--which shoudl be close to worst case--it only takes 0.7 of a second to find the 243 complient words, so there's not much reason to optimise there either.

      it is quite beautiful

      I think that Perl's bitwise logical operators are one of it's best kept secrets.

      They lend themselves to performing a huge variety of set-type operations, very quickly, in a single operation. Effectively 'in parallel'. They can perform a million SELECT-style operations on sets of 8000 items in 1.3 seconds:

      $a = ~( $b = chr(0)x 1e3 ); $t=time; my $x = $a ^ $b for 1 .. 1e6; printf "Took %.6f seconds\n", time()-$t;; Took 1.297000 seconds

      Or a thousand such operations upon sets of 8 million items in 1.7 seconds:

      $a = ~( $b = chr(0)x 1e6 ); $t=time; my $x = $a ^ $b for 1 .. 1e3; printf "Took %.6f seconds\n", time()-$t;; Took 1.645000 seconds

      And storage usage doesn't get much better than 1-bit per item. I've never actually looked to see how tightly coded the underlying opcodes are. I've never found the need to.

      Maybe I should one day though, as these tend to be the dark corners of the codebase that never get revisited. It's quite possible that they are currently written to operate byte-by-byte which is generally far slower than dealing with data in register-sized chunks. With the advent of 64-bit registers, it is quite possible that they could be speed up significantly.


      Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
      "Science is about questioning the status quo. Questioning authority".
      In the absence of evidence, opinion is indistinguishable from prejudice.
        BrowserUk,
        You might be able to reduce that a bit and gain a little speed by using arrays instead of hashes.

        Actually, using an array takes up more space in memory (roughly the same on disk). This is because at a branch where the only valid node is 'z', a hash will insert one item where the array will insert 26 (25 of which are undef). Also, the array code would have to be modified to:

        my $code = join('', 'push @{$data', (map { $_-=97; "[$_]"} sort{$a<=>$ +b} unpack 'C*', $_), "[26]}, '$_';");
        This is because in an array the slots are positional and represent characters so a special slot (much like the key 'words' in the hash) needs to be created to store the words.

        And the resultant file is just 8MB rather than 80MB. I started to modify the rest of teh code to use it, but then got lost and gave up,

        As I indicated via /msg, on disk they are about the same but the hash is about 77-78MB in memory. I am not sure if it was my code that got you lost or the translation from hash to array. In the event that it was my code, let me briefly explain. First, all the words in the dictionary are converted:

        # split and sort art => push @{$data{a}{r}{t}{words}}, 'art'; rat => push @{$data{a}{r}{t}{words}}, 'rat'; tar => push @{$data{a}{r}{t}{words}}, 'tar';
        This allows us to walk the tree and know from where we are, what letters can possibly form a new word. Next, the initial work stack is populated from the input string (also sorted):
        aeinrst => ( $data->{a}, einrst $data->{e}, inrst $data->{i}, nrst $data->{n}, rst $data->{r}, st )
        We can stop there because words must be a minimum of 3 chars. Now each time we take an item off the stack, we check to see if the branch we are on has any words and print them out. We then iterate over the currently unused letters and place any potential branches from our current position:
        my $item = pop @work; my ($tree, $str) = @$item; # $data->{a}, einrst push @work, [$data->{a}{e}, 'inrst'] if $data->{a}{e}; push @work, [$data->{a}{i}, 'nrst'] if $data->{a}{i}; push @work, [$data->{a}{n}, 'rst'] if $data->{a}{n}; push @work, [$data->{a}{r}, 'st'] if $data->{a}{r}; push @work, [$data->{a}{s}, 't'] if $data->{a}{s}; push @work, [$data->{a}{t}, ''] if $data->{a}{t};
        As you can see, I push items on the stack that can't possibly lead to a valid word (too short). I thought about tracking length and short circuiting like I do on initial population but it was just too fast as is to bother with. There are other opportunities for short circuiting due to duplicates in the input string but they too didn't seem to be worth the effort.

        I hate what you did with it!

        I was overly verbose in order to make sure if I didn't understand something, I could break it up into small pieces and comment each one. I doubt I would have been quite as terse as yours had I written it naturally, but I don't like my version much either.

        It only takes 5 seconds to build the index, so that doesn't seem to warrant the effort.

        Sure, but your comment (Not sure how my crude implementation stack up against yours, but it should compare favourably) from a year ago made me look to see if there was anything obvious. I have a knack for wasting time learning things the hard way. In this case, I was just thinking out loud.

        Cheers - L~R