Problems? Is your data what you think it is? | |
PerlMonks |
comment on |
( [id://3333]=superdoc: print w/replies, xml ) | Need Help?? |
Task 2 of the current Perl Weekly Challenge is to generate a pangram from the English dictionary provided. “A pangram or holoalphabetic sentence is a sentence using every letter of a given alphabet at least once.”1 However, for this challenge the requirement that the words form a sentence is removed: a list of unrelated words is all that is required. The bonus challenge is to “Constrain or optimize for something interesting,” and the first suggestion given is: Shortest possible pangram (difficult) Ok, so just how difficult would it be to perform an exhaustive search? I don’t know the answer, but since the search space increases exponentially as words are added to the dictionary, a search for sequences (of varying lengths) drawn from a pool of 23,610 words (see below) would appear highly impractical. I haven’t attempted it. Before we proceed, it should be noted that the dictionary can be significantly reduced in size before the search begins. Where groups of words share the same set of letters, only one word from the group (the shortest) need be kept. For example, “abdicate,” “diabetic,” and “abdicated” all have the letter set {a, b, c, d, e, i, t}, so “abdicate” is retained but “diabetic” and “abdicated” are filtered out. This reduces the dictionary size from 39,172 to 23,610 — a saving of 15,562 words! As noted above, 23,610 is still far too many words for an exhaustive search. So we’re looking for a way to get a close-to-optimum result in a reasonable time. And as we’ve just noted, each dictionary word can be considered as a set of its component letters. So the task is to find a minimum set cover, a well-studied problem in combinatorics and computer science.2 And there is an algorithm that is known to give good results in polynomial time: the greedy algorithm: at each stage, choose the set that contains the largest number of uncovered elements. My initial implementation of this algorithm produced the following pangram: oversimplification ladybug hawk jazz equinox (40 letters) I wasn’t satisfied with this, so began looking for ways to improve the search. It occurred to me that words with rare letters should be preferred over words with common letters. This immediately suggested the Scrabble3 points system:
With this I was able to give each word a weighting based on its component letters, and prefer words with higher weightings. This produced the following pangram: sympathized quarterbacking reflexive jawbone (41 letters) — no improvement at all! The problem appears to be that the selected words contain too much deadweight — too many redundant letters. So would it help to adjust the weighting for each word by deducting points for redundant letters? I experimented with different values of $DUP_WEIGHT, the amount to deduct for each redundant letter. Here are the results:
Eureka! At 29 letters, the last pangram is only 3 letters above the theoretical best-possible result of 26. (For values of $DUP_WEIGHT above 4, the result does not change.) It should be emphasized that my improved results are not an improvement on the greedy algorithm. That algorithm applies to sets, whereas my search for pangrams is a search over words, which are essentially bags because they contain multiple instances of the same letters. For the implementation, I found the Set::Tiny module very useful. I highly recommend it for working with sets of strings. Wikipedia1 gives the following as a short pangram: Waltz, bad nymph, for quick jigs vex. (28 letters) Since all these words occur in the given dictionary, it would be possible for a search to improve on my best solution by at least one letter. Challenge: can anyone implement a search which finds this pangram (or a shorter one) within a reasonable time? Full code listings for my solution (in Perl and Raku) Cheers, 1“Pangram, ” Wikipedia, https://en.wikipedia.org/wiki/Pangram
In reply to Pangram challenge: greed and Scrabble by Athanasius
|
|