Challenge: Letter Power

Tanktalus has asked for the wisdom of the Perl Monks concerning the following question:

My wife loves puzzles. Crosswords, logic problems, whatever. And she hates to be beat by them. So when she saw these Letter Power puzzles, she was really annoyed at how the authors of the puzzles always came up with way higher scores than her. So this last time, she sat down with crossword dictionaries, yellowpages, and other sources to come up with as many words as possible to fit the puzzle, and then tried to figure out what was the best score she could get - and that's when we decided a computer could plow through it faster than she could.

The puzzle is this: there are 10 categories (some have more, some have less, this one has 10). You get up to 8 letters per answer. And scoring is a simple algorithmic progression. Each time you use a letter, it scores one point more than last time you used that letter. So, for example, the first 'a' is 1 point, the second 'a' is 2 points, and so one. Obviously, if you can get a bunch of words using two or three of the same letter, that can add up over 10 answers. Think of, for example, "US States" as a category. Now think of "Alabama" - 4 a's!

The hard part is coming up with acceptable answers. Assuming you have acceptable answers, how would you find the best combination of answers to come together to give the best score?

I have an initial solution here, but when running it with the list of possibilities shown, I get an "Out of memory!" error. I have 4GB of RAM - and I see it went about 380MB into my swap as well. So it doesn't seem to work - I'm not sure why. Perhaps someone will find it entertaining to find a faster/smaller/more elegant/less buggy solution (one or more of these, or any other variation you want). I'm sure there is some sort of short-circuiting you could do if you weren't using Algorithm::Loop::NestedLoops.

Here's my second try:

#! /usr/bin/perl

use strict;
use warnings;

use Data::Dumper; # just for debugging...

use Algorithm::Loops qw(NestedLoops);
use List::Util;
use Time::HiRes qw(gettimeofday tv_interval);

my $t0 = [gettimeofday];

my @words = readwords(\*DATA);

# Calculate the letters in each - do this up front for speed?
my %letters_cache;
foreach my $inner (@words)
{
    foreach my $word (@$inner)
    {
        my %letters;
        foreach (split '', $word)
        {
            $letters{$_}++;
        }
        $letters_cache{$word} = \%letters;
    }
}

my @best_words;
my $best_score = 0;
my $iter = NestedLoops(\@words);

my $count = 0;
while (my @list = $iter->())
{
    ++$count;
    my @words = @list;
    my $score = calculate_score(@words);
    if ($score > $best_score)
    {
        $best_score = $score;
        @best_words = @list;
    }
    my $t_now = tv_interval($t0, [gettimeofday]);
    if ($count % 15 == 0)
    {
        local $|=1;
        print "$count\r";
    }
};

my $elapsed = tv_interval($t0, [gettimeofday]);
print "The best score is $best_score, using the words:\n", map { "\t$_
+\n" } @best_words;
print "Elapsed time: $elapsed seconds\n";

#print Dumper(\%letters_cache);

sub calculate_score
{
    my $total = 0;
    my %letters;
    foreach my $w (@_)
    {
        $w = $letters_cache{$w} unless ref $w;
        while (my ($l,$n) = each %$w)
        {
            $letters{$l} += $n;
        }
    }

    # add them up.
    our ($a,$b); # get rid of warning?
    List::Util::reduce { $a + calculate_letter_value($b) } 0, values %
+letters;
}

sub calculate_letter_value
{
    my $n = shift;
    ($n * ($n + 1)) / 2;
}

# slurp in all the possible words.
sub readwords
{
    my $fh = shift;
    unless (ref $fh)
    {
        require IO::File;
        $fh = IO::File->new($fh, 'r') || die "Can't open $fh for read:
+ $!";
    }
    local $_;
    my @words;
    while (<$fh>)
    {
        my ($num, $word) = split ' ', lc;
        push @{$words[$num-1]}, $word;
    }
    @words;
}

__DATA__
1 alabama
1 arkansas
1 alaska
1 delaware
1 hawaii
1 indiana
1 kansas
1 montana
1 delaware
2 sazerac
2 sangrita
3 radiator
3 gastank
3 engine
3 heater
3 fender
3 wheel
3 detent
3 battery
3 clutch
3 mirror
3 window
4 alloy
4 amalgam
4 vanadium
4 copper
4 steel
5 rutabaga
5 limabean
5 cress
5 carrot
5 sorrel
5 squash
5 cabbage
5 pepper
5 lettuce
5 beet
5 leek
5 celery
5 endive
5 rhubarb
5 parsnip
5 pumpkin
6 azalia
6 camellia
6 dahlia
6 gardenia
6 gentian
6 vervain
6 canna
6 hepatica
6 bluebell
6 anemone
6 oleander
7 lasagna
7 macaroni
7 pastina
7 gnocchi
7 tortelli
7 alfabeto
8 mascagni
8 britten
8 menotti
9 unamas
9 tacotime
9 pizzahut
9 tacobell
9 panago
9 tacomayo
9 edojapan
9 hardees
10 caramel
10 marzipan
10 taiglach
10 taffy
10 brittle
10 fondant
10 toffee
10 dragee
[download]

I'm re-running it, and it is steadily growing in memory consumption, but I really don't know why.

PS: this isn't even all the possible "reasonable" words she came up with ;-) The list if you want something that executes and comes up with an answer in almost no time and comes to a good answer quickly (beating the one in the book, by the way) is:

1 alabama
1 arkansas
2 sangrita
3 radiator
3 gastank
4 amalgam
4 vanadium
5 rutabaga
5 limabean
6 azalia
6 camellia
6 dahlia
6 gardenia
7 lasagna
7 macaroni
7 pastina
8 mascagni
9 unamas
10 caramel
10 marzipan
10 taiglach
[download]

which gives:

The best score is 474, using the words:
        alabama
        sangrita
        gastank
        amalgam
        rutabaga
        azalia
        lasagna
        mascagni
        unamas
        taiglach
[download]

Comment on Challenge: Letter Power Select or Download Code

Replies are listed 'Best First'.
Re: Challenge: Letter Power by sk (Curate) on Oct 12, 2005 at 07:23 UTC
Tanktalus, Just by looking at the structure of the problem, it appears to be NP-Complete. It has some similarities to MDKP (multi-dimensional knapsacks) but here the capacity is not present but the profit values are multidimensional and they are related to other objects in the bag. An exact solution will be out of the picture when the data becomes large and i am sure your wife would appreciate some kind of approximation if it can be done in couple of mins rather than the best solution which might take days!!! Here is my Mathematical formulation and at this point I am not aware of any MINLP (mixed-integer non linear prog solvers). Let i represent the index (num in your example) and j represent the actual word c represent all the letters ('A'..'Z') Let X[i,j] = { 1 if word (i,j) was chosen, 0 otherwise} The value of each word can be broken into characters/letters For all (i, j) - # you get the value only if the word is chosen V[i,j,c] = number of c's in the word (i,j) * X[i,j] For all c Let Value(c) = (sum {over all i,j} V[i,j,c])*((sum {over all i,j} V[ +i,j,c])+1)/2 Optimization problem =================== Max sum {over c} Value(c) # You can choose only one word from each index (num) subject to: sum {over all j} X[i,j] = 1 for all i. subject to: x[i,j] element of (0,1) for all i, j. [download] Obviously as you can see the objective is Non-linear and regular knapsack/LP approaches will not work. However the fuction appears to be quadartic and you might be able to get some good results by "relaxing" the binary assumption on the `X[i,j]`'s and use a non-linear search designed for quad-objectives. Here a simple heuristic that i can think of and can be improved a lot - 1. For each index/word pair store the number of repeats of each charac +ter. 2. INIT: Start with the index/word pair that has the highest value by +itself (i.e. assuming the value of other index/words are zero). 3. Choose the next index/word pair that increases the value the most ( +let's say your init value for 1/alabama, then you move to index = 2 a +nd look for the best next increase.) 4. Repeat step-3 until you exhaust all indices. 5. Now you have a local optimum based on your starting point. 6. EXCHANGE: Store the current objective, Go back to index = 1 and pic +k the second best word in index = 1. Now check if you could improve t +he previous solution by exchanging index = 2 with something new etc. 7. You can repeat this exercise for certain number of iterations and t +hen you stop with the best solution found thus far. NOTE that you can also use randomized approach to randomly pick an ele +ment to exchange and try to improve the solution. [download] This is just written top of my head without thinking about convergence. If i ever get to implement this i shall post the code. Hope someone can build on what i had in mind very interesting problem! cheers SK	[reply] [d/l] [select]
Re: Challenge: Letter Power by japhy (Canon) on Oct 12, 2005 at 05:33 UTC
Here's my code. It still has SEVERAL hours to run... #!/usr/bin/perl use strict; use warnings; my @text; while (<DATA>) { my ($q, $w) = split; push @{ $text[$q-1] }, $w; } my @best = calc(\@text); print "{@best}\n"; sub calc { my ($t) = @_; my @best = (0); iter($t, [(0) x 26], \@best, 0); return @best; } sub iter { my ($t, $tmp, $best, $i) = @_; if ($i == @$t) { my $s = 0; $s += $_ * ($_+1) / 2 for @$tmp[0..25]; @$best = ($s, @$tmp[26..$#$tmp]) if $s > $best->[0]; return; } for (@{ $t->[$i] }) { $tmp->[ord($_) - ord('a')]++ for split //; push @$tmp, $_; iter($t, $tmp, $best, $i+1); pop @$tmp; $tmp->[ord($_) - ord('a')]-- for split //; } } [download] I'm using your larger data set, of course. Jeff `japhy` Pinyan, P.L., P.M., P.O.D, X.S.: Perl, regex, and `perl` hacker How can we ever be the sold short or the cheated, we who for every service have long ago been overpaid? ~~ Meister Eckhart	[reply] [d/l]
Re: Challenge: Letter Power by dragonchild (Archbishop) on Oct 12, 2005 at 14:51 UTC
A few thoughts: The important thing to note about the game is that you only get one word per category, for a total of 10 words (in this case where you have 10 categories). sk is right in that this seems to be related to the knapsack and that the perfect solution is probably NP-complete. It's immediately clear that optimizing for a single letter is more likely to generate a better score than any other optimization. If you think about it, you're attempting to fill an 8x10 grid with items that have value based on how often that specific item has appeared. (however the items are chosen). Each of the 80 spots must be maximized, so if you can get another G in there, that's worth losing 5 F's (assuming you were optimizing for G and have at least 15-20 more of them than F's). When optimizing for a letter, you have to consider all words in a category that have the maximum number of that letter. You can skip all letters that don't have a word in each category with a minimum number of that letter. For the sample you gave, that minimum appears to be 2. Given that, one solution could be as follows: Read more... (2 kB) This uses your calculate_score() function slightly modified to not change its arguments. I also come up with your 474 list, but in about 0.1 seconds on a PIII-700 with 768M of RAM using the larger dataset you provided. With the full dataset (all 50 states, etc), the prunings I used will remove the best possible score, but I'll get close. My criteria for good software: Does it work? Can someone else come in, make a change, and be reasonably certain no bugs were introduced?	[reply] [d/l]
Re: Challenge: Letter Power (leak) by tye (Sage) on Oct 12, 2005 at 14:41 UTC
Sounds like a simple memory leak. The most likely culprit would be the XS-happy List::Util, IMO. Simply removing List::Util (and Timer::HiRes for obscure reasons) and running your code shows very little and flat memory usage. It would take 11 hours to run, but it doesn't leak for me. - tye	[reply]
Re: Challenge: Letter Power by BrowserUk (Patriarch) on Oct 13, 2005 at 09:24 UTC
My first attempt was brute force, unremarkable, found the 474 after processing 95% of the combinations and took just under 8 hours to run. My second finds the 474 in 0.15 seconds. `P:\test>499358-2 460 : [ alabama sangrita radiator amalgam rutabaga camellia lasagna ma +scagni tacomayo marzipan ] 472 : [ alabama sangrita radiator amalgam rutabaga azalia lasagna masc +agni tacomayo marzipan ] 474 : [ alabama sangrita radiator amalgam rutabaga azalia lasagna masc +agni tacomayo taiglach ] Elapsed: 0.015 seconds.` [download] See comments for the strategy used. #! perl -slw use strict; use Time::HiRes qw[ time ]; my $start = time; my @lookup = map{ ($_+1) * $_ /2 } 0 .. 80; sub calcScore { my( %chars, $score ); $chars{ $_ }++ for map{ split'' } @{ $_[0] }; $score += $lookup[ $_ ] for values %chars; return $score; } my @cats = map{ [ split ' ' ] } <DATA>; close DATA; ## Sort the words in each category by their individual scores (descend +ing) my %wordScores; @{ $cats[ $_ ] } = sort{ ( $wordScores{ $b } \|\|= calcScore [$b] ) <=> ( $wordScores{ $a } \|\|= calcScore [$a] ) } @{ $cats[ $_ ] } for 0 .. $#cats; ## form our guess from the highest scoring words in each category my @guess = map{ $cats[ $_ ][ 0 ] } 0 .. $#cats; my $best = calcScore \@guess; ## Loop over the categories swaping the other words in that cateory ## for the guessed word. If we get a better score, start again. LOOP: for my $iCat ( 0 .. $#cats ) { my $cat = $cats[ $iCat ]; for my $iWord ( 1 .. $#$cat ) { my $test = calcScore [ @guess[ 0 .. $iCat -1 ], $cat->[ $iWord ], @guess[ $iCat+1 .. $#guess ] ]; if( $test > $best ) { $best = $test; $guess[ $iCat ] = $cat->[ $iWord ]; print "$best : [ @guess ]"; redo LOOP; } } } printf "Elapsed: %.3f seconds.\n", time - $start; __DATA__ alabama arkansas alaska delaware hawaii indiana kansas montana delawar +e sazerac sangrita radiator gastank engine heater fender wheel detent battery clutch mirr +or window alloy amalgam vanadium copper steel rutabaga limabean cress carrot sorrel squash cabbage pepper lettuce be +et leek celery endive rhubarb parsnip pumpkin azalia camellia dahlia gardenia gentian vervain canna hepatica bluebel +l anemone oleander lasagna macaroni pastina gnocchi tortelli alfabeto mascagni britten menotti unamas tacotime pizzahut tacobell panago tacomayo edojapan hardees caramel marzipan taiglach taffy brittle fondant toffee dragee [download] Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. Lingua non convalesco, consenesco et abolesco. -- Rule 1 has a caveat! -- Who broke the cabal? "Science is about questioning the status quo. Questioning authority". The "good enough" maybe good enough for the now, and perfection maybe unobtainable, but that should not preclude us from striving for perfection, when time, circumstance or desire allow.	[reply] [d/l] [select]
Re^2: Challenge: Letter Power by Limbic~Region (Chancellor) on Apr 17, 2009 at 00:15 UTC
BrowserUk, I really like this heuristic approach. Take a best guess and then see if it can be refined to get a better score. I am thinking about posting a "Challenge: Letter Power Revisited" thread so I have been considering how to generate puzzles at random. I was able to find an example where your solution does not provide the best solution in just 3 tries: Read more... (3 kB) My heuristic solution is very similar to yours. Rather than picking the best individual score from each group, I calculate the letter frequency across all words. Then I select the word from each group that has the highest score based on that frequency analysis. I have also modified the refinement loop to limit the number of loops to handle the edge case where picking the best individual scores is the worst thing to do. `LOOP: for (1 .. $max_loops) { # your refinement loop if ($better) { # update best next LOOP; } # end of your refinement loop last; }` [download] This way you pick a "good enough" score in a reasonable time frame no matter what. I am now going to try and figure out if I can do a smart exhaustive search to guarantee best solution in a reasonable runtime (less than 2 hours). If I can, I will post a new thread comparing naive brute force, optimized brute force, heuristic, optimized heuristic. Cheers - L~R	[reply] [d/l] [select]
Re^3: Challenge: Letter Power by BrowserUk (Patriarch) on Apr 18, 2009 at 04:41 UTC
It can be show that a better solution exists (not necessarily best): `438 : [ eyesore sleekest steepen spindles baseless rockiest refiners sheered tethers exhausts ]` Yep! That's the problem with heuristic solutions. Unless you also run a full brute force on the same data, you can never be sure that you haven't found some local maxima. On the dataset above, your 438 is better that my 434, but neither are as good as: `441 : [ rendered spindles squires tethers rockiest eyesore sheered sleekest steepen refiners ]` [download] But even using all 4 cpus flat out, my brute force solution can just about manage 1/2 a million combinations/minute. Given there are 7.75e9 combinations to try, I'm projecting over 10 days to verify the heuristic for that one dataset. That's too much for that sake of an intellectual challenge with no useful purpose. Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error. "Science is about questioning the status quo. Questioning authority". In the absence of evidence, opinion is indistinguishable from prejudice. "Too many [] have been sedated by an oppressive environment of political correctness and risk aversion."	[reply] [d/l] [select]
Re^4: Challenge: Letter Power by Limbic~Region (Chancellor) on Apr 18, 2009 at 14:18 UTC
Re^5: Challenge: Letter Power by BrowserUk (Patriarch) on Apr 18, 2009 at 16:44 UTC
Some notes below your chosen depth have not been shown here
Re: Challenge: Letter Power by Cody Pendant (Prior) on Oct 12, 2005 at 06:48 UTC
I don't see how you're going to program Perl to come up with US States when US States are required by the question and flowers when flowers are required and so on. Given a list of flowers, the algorithm to find which scores highest in your game should be something like: `sub calculate_score { my $word = shift(); my $score = 0; my %letter_value = (); my @letters = split( '', $word ); foreach ( @letters ) { if ( defined( $letter_value{$_} ) ) { $letter_value{$_} = $letter_value{$_} * 2; $score += $letter_value{$_}; } else { $score += 1; $letter_value{$_} = 1; } } return $score; }` [download] Which assumes that each letter is worth 1 the first time you use it, 2 the second time, 4 the third and so on. it gives me a score of 18 for "alabama" and a score of 47 for "lillipilli". But as I say, I don't know how you get your lists of words in the first place. ($_='kkvvttuu bbooppuuiiffss qqffssmm iibbddllffss') =~y~b-v~a-z~s; print	[reply] [d/l]
Re: Challenge: Letter Power by Skeeve (Parson) on Oct 12, 2005 at 11:49 UTC
That's easy - if I understood it correct. Simply take all combinations of words (recursive task) and calculate the score. Just store the best score and the best solution found so far. Here is my implementation. Read more... (3 kB) s$$([},&%#}/&/]+}%&{});#$&&s&&$^X.($'^"%]=\&(\|?{%.+=%;.#_}\&"^"-+%*) +.}%:##%}={~=~:.")&e&&s""`$''`"e [download]	[reply] [d/l] [select]
Re: Challenge: Letter Power by Limbic~Region (Chancellor) on Apr 15, 2009 at 20:11 UTC
Tanktalus, I have been hunting the Monastery looking for interesting challenges to work on. This looked like it might be an interesting challenge but you limited it to a specific puzzle. I want to generalize things which means generating unique puzzles but I want to make sure I get things correct. Required Parameters: Number of categories (default = 10) Maximum length of word (default = 8) Minimum number of words that can fit a category (default=2) Maximum number of words that can fit a category (default=15) My approach would be to use a dictionary file to randomly select words to fit a category per the parameters above. It will be possible for the same word to fit multiple categories though the solution chosen may only use that word once. In this way, I can reproduce puzzles at will. I assume every letter in all words across the solution counts for points? Please let me know if I have this all correct. If I don't, please clarify. When I am finished I will post a "Challenge: Letter Power Revisited" post since this thread is three and a half years old. Cheers - L~R	[reply]
Re^2: Challenge: Letter Power by Tanktalus (Canon) on Apr 15, 2009 at 22:23 UTC
The part of the original puzzle that made it difficult for me to envision a computer solution to was where they give arbitrary topics. For example, "US States". Obviously, this has 51 possible answers (ok, really 50 - I'm including Canada :-P), a subset of which would be valid depending on the maximum length of the word. Programmatically solving that would be a challenge unto itself. Now, if you're going to generate categories yourself, you could define it to be solvable. In the original post, my wife had looked for answers in each category trying to maximise the count of 'a's, thinking that would get the best possible score. We don't know if 'e' would have been a better choice, since we didn't try to optimise that. However, a more general application could possibly handle many more answers by having categories with all possible answers in some sort of database. For example, if there is an on-line crossword dictionary, crossword clues make great categories, which is fitting since the original problem came from a puzzle book including crosswords, and then querying from said dictionary should be a relatively solvable problem. Of course, this blows "maximum words in a category" out of the water :-) Then again, even a crossword dictionary won't have every possible answer - just ones used in crosswords. I wouldn't expect a category of "historical English Monarch" to have every monarch ever in the dictionary... Your assumption about points is correct. Every letter in all words counts. Of course, you're free to define your own puzzle that is loosely based on the above. You could define your own categories as groups of words in each category (having overlapping words where they can only be used in one category does make it more interesting). Or, for example, you could try varying the score for each letter, especially since a computer is doing the addition anyway. For example, what happens if each letter counts for double the previous? How about each letter counts for its position in the Fibonacci sequence? Maybe each letter gets its own fibonacci sequence, and the last number for each letter counts (e.g., if there are 2 a's and 3 b's, the a's get 1 point as the second number in the sequence, and the b's get 2 points as the third number in the sequence)? Do these variations really make for any difference in tactic? I doubt it, but that'd be like a premature optimisation that'd need benchmarking :-)	[reply]
Re^3: Challenge: Letter Power by Limbic~Region (Chancellor) on Apr 15, 2009 at 22:32 UTC
Tanktalus, I agree that writing a program to find words that fit a category is a bigger challenge all unto itself. That's why I took your approach of assuming the words that fit were already known. My categories will just be cat 1 .. N. I just want variation to test the robustness of any possible solution. Using an existing database is also a possible approach but goes beyond the scope of what I am intending to do. Regarding the "maximum words in a category". That is a configurable parameter that I envision being used to bound a given puzzle so that an implementation has a realistic chance of solving for the "best" combination of words. On the other hand, if you are going to go with a heuristic approach then it may not matter - but being configurable allows the user to make that decision. Regarding making variations on the rules for the purposes of exploration - sure, but I wanted to make sure I had the basic idea since your original post left me more than a little confused. Now that I know I am on the right track I can think about it further. Thanks! Cheers - L~R	[reply]
Re: Challenge: Letter Power by EvanCarroll (Chaplain) on Oct 12, 2005 at 06:15 UTC
I don't understand this game at all. And I don't imagine I'm the only baffled by your explaination. Can you make a rule sheet and claify the game. Varaible ammnt of "catagories." Assumed a catagory is a set of random words. 10 "answers" 8 letters per "answer" I assume a catagory is a random list, such that it holds a set of X words. The scoring is then word frequency with the trick being linear growth per word? Writing this down helped me; however, I'm still confused, what does the term "answer" refer to the answer to the game, or the resultant set of your guess? Evan Carroll www.EvanCarroll.com	[reply]
Re^2: Challenge: Letter Power by Tanktalus (Canon) on Oct 12, 2005 at 14:54 UTC
The categories are presented. You need to come up with a word that is in the category that will help maximise your points. No word is allowed to be more than 8 letters. I did not include coming up with the word possibilities because that is a completely different problem which I'm not interested in solving in perl. (Maybe someone else is...) Example (which I'm making up now, with out seeing the original puzzle - my wife didn't show it to me): US State Car part Type of precipitation Member of NATO Olympic Games site In the kitchen Note that this is not the game that the words in my list are for, it's just the idea. Each category gets one answer, you need to see how high of a score you can get. An example answer is in the back of the book with a crazy-high total. One way to solve it is to come up with a list of answers for each category, and then see what combination of these answers gives the highest score, which is what we were doing here. My wife came up with the answers in my OP, and then I set about getting the computer to find the best option. In her game, there were ten categories. Not all games have the same number of categories.	[reply]
Re^2: Challenge: Letter Power by ambrus (Abbot) on Oct 12, 2005 at 10:29 UTC
I agree. I don't understand the description at all. I'd especially like to know what 8 letters per answer means.	[reply]