Beefy Boxes and Bandwidth Generously Provided by pair Networks
Problems? Is your data what you think it is?
 
PerlMonks  

comment on

( [id://3333]=superdoc: print w/replies, xml ) Need Help??

Task 2 of the current Perl Weekly Challenge is to generate a pangram from the English dictionary provided. “A pangram or holoalphabetic sentence is a sentence using every letter of a given alphabet at least once.”1 However, for this challenge the requirement that the words form a sentence is removed: a list of unrelated words is all that is required.

The bonus challenge is to “Constrain or optimize for something interesting,” and the first suggestion given is:

Shortest possible pangram (difficult)

Ok, so just how difficult would it be to perform an exhaustive search? I don’t know the answer, but since the search space increases exponentially as words are added to the dictionary, a search for sequences (of varying lengths) drawn from a pool of 23,610 words (see below) would appear highly impractical. I haven’t attempted it.

Before we proceed, it should be noted that the dictionary can be significantly reduced in size before the search begins. Where groups of words share the same set of letters, only one word from the group (the shortest) need be kept. For example, “abdicate,” “diabetic,” and “abdicated” all have the letter set {a, b, c, d, e, i, t}, so “abdicate” is retained but “diabetic” and “abdicated” are filtered out. This reduces the dictionary size from 39,172 to 23,610 — a saving of 15,562 words!

As noted above, 23,610 is still far too many words for an exhaustive search. So we’re looking for a way to get a close-to-optimum result in a reasonable time. And as we’ve just noted, each dictionary word can be considered as a set of its component letters. So the task is to find a minimum set cover, a well-studied problem in combinatorics and computer science.2 And there is an algorithm that is known to give good results in polynomial time: the greedy algorithm:

at each stage, choose the set that contains the largest number of uncovered elements.

My initial implementation of this algorithm produced the following pangram:

oversimplification ladybug hawk jazz equinox (40 letters)

I wasn’t satisfied with this, so began looking for ways to improve the search. It occurred to me that words with rare letters should be preferred over words with common letters. This immediately suggested the Scrabble3 points system:

const my %ALPHABET => ( a => 1, b => 3, c => 3, d => 2, e => 1, f => 4, g => 2, h => 4, i => 1, j => 8, k => 5, l => 1, m => 3, n => 1, o => 1, p => 3, q => 10, r => 1, s => 1, t => 1, u => 1, v => 4, w => 4, x => 8, y => 4, z => 10 );

With this I was able to give each word a weighting based on its component letters, and prefer words with higher weightings. This produced the following pangram:

sympathized quarterbacking reflexive jawbone (41 letters)

— no improvement at all! The problem appears to be that the selected words contain too much deadweight — too many redundant letters. So would it help to adjust the weighting for each word by deducting points for redundant letters? I experimented with different values of $DUP_WEIGHT, the amount to deduct for each redundant letter. Here are the results:

$DUP_WEIGHT Pangram Letter Count
1 sympathized quarterbacking jinx overflow 37
2 sympathized quacking overflow jinx be 33
3 sympathized quacking fox jaw verb all 32
4 sympathized unblock xv jaw qt frog 29

Eureka! At 29 letters, the last pangram is only 3 letters above the theoretical best-possible result of 26. (For values of $DUP_WEIGHT above 4, the result does not change.)

It should be emphasized that my improved results are not an improvement on the greedy algorithm. That algorithm applies to sets, whereas my search for pangrams is a search over words, which are essentially bags because they contain multiple instances of the same letters.

For the implementation, I found the Set::Tiny module very useful. I highly recommend it for working with sets of strings.

Wikipedia1 gives the following as a short pangram:

Waltz, bad nymph, for quick jigs vex. (28 letters)

Since all these words occur in the given dictionary, it would be possible for a search to improve on my best solution by at least one letter. Challenge: can anyone implement a search which finds this pangram (or a shorter one) within a reasonable time?

Full code listings for my solution (in Perl and Raku) will shortly be are available on the Weekly Challenge’s GitHub repository at https://github.com/manwar/perlweeklychallenge-club/tree/master/challenge-161/athanasius.

Cheers,

1“Pangram, ” Wikipedia, https://en.wikipedia.org/wiki/Pangram
2“Set cover problem,” Wikipedia, https://en.wikipedia.org/wiki/Set_cover_problem
3 “Scrabble letter distributions,” Wikipedia, https://en.wikipedia.org/wiki/Scrabble_letter_distributions#English

Athanasius <°(((><contra mundum Iustus alius egestas vitae, eros Piratica,


In reply to Pangram challenge: greed and Scrabble by Athanasius

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post; it's "PerlMonks-approved HTML":



  • Are you posting in the right place? Check out Where do I post X? to know for sure.
  • Posts may use any of the Perl Monks Approved HTML tags. Currently these include the following:
    <code> <a> <b> <big> <blockquote> <br /> <dd> <dl> <dt> <em> <font> <h1> <h2> <h3> <h4> <h5> <h6> <hr /> <i> <li> <nbsp> <ol> <p> <small> <strike> <strong> <sub> <sup> <table> <td> <th> <tr> <tt> <u> <ul>
  • Snippets of code should be wrapped in <code> tags not <pre> tags. In fact, <pre> tags should generally be avoided. If they must be used, extreme care should be taken to ensure that their contents do not have long lines (<70 chars), in order to prevent horizontal scrolling (and possible janitor intervention).
  • Want more info? How to link or How to display code and escape characters are good places to start.
Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others exploiting the Monastery: (8)
As of 2024-03-28 09:27 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found