Perl's pearls

Dear fellow Monks,
I would like to share with you some notes on software implementation. Some of the technical contents of this node have already been discussed in the Monastery, but I would like to present the issue in a more general context.
In his book "Programming Pearls", Jon Bentley describes an algorithm to find all anagrams in a dictionary. This is an interesting problem, because, until not long ago, CS textbooks used to present anagrams as a task for a brute force recursive attack, close to intractability. Bentley's refreshing view has shown us that several seemingly gigantic chores, when faced from a different angle, become almost routine.
A simplified description of this anagram method is a four steps task.
1. for each word in the dictionary, find an unique "signature" by sorting the word's letters.
2. print a space-separated list of the signature plus the word;
3. sort the list;
4. read into a new list, and put together the words with the same signature (adjacent within the list). They are the anagrams.
5. (not mentioned, but necessary) filter from the list the signatures with more than one word and sort them.

His implementation in "C" uses two programs (one to create the "signatures" and the second one to flatten two or more adjacent entries into a single one) plus the system sort. Even if he doesn't mention it, a third program is also needed, to filter off from the final list the words without anagram.
I wanted to test the implementation of the same algorithm in Perl. It is much simpler, resulting in one script of 15 lines (some Golf implementations can be seen at this node and this one). The external sorting plus the flattening program are replaced by inserting words in a hash, which deals very efficiently with the problem. No external filtering is needed. The final two steps, filtering and printing, are combined into one step only in Perl.
Of course it is not as fast as C, but the complexity of the Perl program is less than one third. The three C programs together amount to 50 lines of code, and the compiled results have to be put together with two calls to the system sort
./sign < dictionary | sort | ./squash | ./filter_anagrams |sort
The Perl script is simpler and more powerful than its C counterparts. If I wanted to implement the same handling of input/output as I have done in my Perl application it wouldn't be trivial. Having had my share of C programming, I know for a fact that doing in C what boils in those 15 lines of Perl can make me learn many new curses. :-)
Back to our anagrams in Perl. The principle is simple: a hash will create the candidates, using the sorted letters of each word as key (the "signature") so that two or more words with the same signature will end up together.

Here is the pseudocode

---------------------------
    create an empty words_hash
    while (input available)
        read one word
        windex = sorted lowercase word
        skip word duplicates
        add word to words_hash with key = windex

    foreach windex in words_hash
        remove unique words # = signatures with one word only
    
    sort words_hash

    foreach item in words_hash
        print anagrams
----------------------------
[download]

example: given this input file

    Abby
    abbot
    acne
    alert
    alter
    baby
    Baby
    BABY
    best
    bets
    cane
    later
    Unix
    UNIX
[download]

%words will become (notice that the duplicates in BABY, Baby and UNIX were skipped)

    key      values
    -----    ---------
    abby  => abby baby
    abbot => abbot
    acen  => acne cane
    best  => best bets
    inux  => unix
    aelrt => alert alter later
[download]

and then, skipping the unique items "abbot" and "inux", it will print (only values)

    
    abby baby
    acne cane
    best bets
    alert alter later
[download]

Enough introduction. Here is the actual working script

#!/usr/bin/perl -w
# usage: anagram.pl < wordlist
 
use strict;
my %words = ();                              # words index (Key: sorte
+d 
                                             # word. Value: word(s))
while (<>) {
    chomp;                                   # remove newline characte
+r
    $_ = lc $_;                              # converts to lowercase
    my $windex = pack "C*", 
        sort unpack "C*", $_;                # create index
    if (exists $words{$windex}) {             
        next if $words{$windex} =~ /\b$_\b/; # skip duplicates
        $words{$windex} .= " ";              # add separating space
    }
    $words{$windex} .= $_;                   # inserts a new word
}
print map {"$_\n"} sort grep {tr/ //} values %words;
[download]

And now, some Perl technical notes
(A) The last line of the script is roughly equivalent to the following

 my @anagrams =();
 foreach (values %words) {
    push @anagrams, $_ if tr/ //;
       # tr returns the number of processed characters, 
       # spaces in this case. Therefore, if spaces are > 0,
       # it means that there are at least two words, and 
       # we have some anagrams. If space is 0, the word 
       # is unique and we skip it.
       # grep returns all the items that satisfy the 
       # filtering condition (= are there any spaces?)
 }
 
 # map does the same as this foreach loop 
 foreach (sort @anagrams) {
        print "$_\n"
 }
[download]

(B) Rejecting duplicates could be only *slightly* faster if implemented with grep

    if (exists $words{$windex}) {             
        my $word = $_;
        next if grep {$_ eq $word} split / /, 
            $words{$windex} ;    # skip duplicates
        $words{$windex} .= " ";  # add separating space
    }
[download]

(C) Using an anonymous array as hash value, instead of a space-separated string, makes the program shorter, but (surprisingly) slower.

#!/usr/bin/perl -w
use strict;
my %words = (); 
while (<>) {
    chomp;
    my $word = lc $_;
    my $windex = pack "C*", sort unpack "C*", $word;
    next if (exists $words{$windex}) 
        and grep {$_ eq $word} @{$words{$windex}};
    push @{$words{$windex}}, $word;
}
print map {"@$_\n"} sort grep {scalar @$_ > 1} values %words;
[download]

After benchmarking, it seems that adding and fetching a string is less expensive than pushing and fetching (with interpolation) an array, while counting items is faster in arrays. I wouldn't go into more details here, since this node is already a long one.

(D) Creating the word signature with
pack "C*", sort unpack "C*", $_
is faster than using
join "", sort split //, $_;

(E) To make anagrams from words with accented characters (like in French and Italian,) tr should produce a better result than lc

    # French and Italian anagrams
    my $windex = $_;
    $windex =~ tr/噼殪蝙[A-Z]/aeeiou[a-z]/; # gives the same signature
+ 
                                            # to accented and 
                                            # unaccented vowels
    tr/廊商屹[A-Z]/噼殪蝙[a-z]/;            # converts to lowercase
    $windex = pack "C*", sort unpack "C*", $windex; # create index
[download]

Results
To obtain a significant sample, I merged the standard Unix "words" dictionary, the American and British English dictionaries from the GNU ispell package (eg: strings british.hash | perl -ne 'print if /^\w+$/') and removed duplicated words. The total was about 100_000 items. In less than 5 seconds (Linux on a PentiumIII 700) I got a list of more than 14_000 anagrams, from which I report an interesting excerpt:

abolitionism mobilisation 
acres cares races sacre scare
actualises casualties
aimless melissa
climaxes exclaim
collapse escallop
colonialist oscillation
detains instead sainted stained
donator odorant rotonda tornado 
endoscopic picosecond
enumeration mountaineer
friary rarify
license selenic silence
obscurantist subtractions
predicts scripted
striptease tapestries
shower whores
[download]

A design error
When I was comparing the relative efficiency of the algorithm in Perl and in C, I realized that the C program was finding about 80 anagrams less than the Perl one. I dumped both result lists to two files and I had a look at them with diff. The Perl scripts finds anagrams like
angrier earring rearing
arches chaser search
while the C program only gets
angrier earring
arches chaser

I cross-checked the C source code twice, and when I was convinced that the program was bug-free, I also checked the intermediate lists produced after the signing and sorting pass. I was even suspecting a bug in the GNU sort program (forgive me, RMS), but I dismissed it after a thorough test, comparing the result of different sorting algorithms in C and Perl. No bugs there either.
Then, I started thinking about the design, and eventually I found the real reason, which is in one assumption of the algorithm. The Perl implementation seems to be equivalent to the C one, but it is not. These 80 anagrams more are the result of better accuracy of the hash against the sort/squash approach.
The wrong assumption is that the sort program is sorting the signatures independently from its values. However, signatures and values are *in the same string*, thus the separating space and the beginning of the value words are evaluated into the sorting as well as any other character.
Using a hash, instead, guarantees that the signatures are independent from the associated values.
The C algorithm, then, although faster, is less accurate, and needs some re-thinking, using a supporting data structure or using a customized sorting algorithm (by the first item or by length and contents).
(Having some knowledge of the GNU sort utility, however, the problem would be solved by using sort -k 1,1 , which sorts by the first word only. But, as Hannibal Lecter would say, this is incidental).

A design improvement
I have mentioned several times that the C program is faster than my Perl script. But to be completely honest, I was not comparing the same thing. One of the reasons for which the script is slow is that I don't trust the input. My application is doing two things more than the C program: converting the input to lowercase and checking for duplicates, which were taken for granted in the original design.
What I should be comparing, instead, is a script that, given in input 'Unix' and 'UNIX', reports them as anagrams. But at the moment I'd rather wait two seconds more and enjoy an accurate output.

Conclusion
I didn't want to demonstrate that Perl is better than C (everybody in this site has already some ideas on this issue, I suspect) but I wanted to stress some useful aspects of software construction in Perl. The richness of features can greatly simplify the task of implementing a data manipulation algorithm. The difference in performance is compensated by the greater accuracy of the ouput and the ease of coding, which also accounts for ease of maintenance and adaptation to similar tasks (eg: other languages dictionaries).
I hope this exercise could be useful to somebody.

update.
petral found an efficiency improvement of the anagram algorythm.
A hint by TomK32 led to an interesting personal challenge, ie multiple words anagrams, whose first shot is here.

 _  _ _  _  
(_|| | |(_|><
 _|

Comment on Perl's pearls Select or Download Code

Replies are listed 'Best First'.
Re: Perl's pearls by merlyn (Sage) on Dec 31, 2001 at 21:13 UTC
I'd do the duplicate elimination in the outer loop: `while (<DATA>) { chomp; my $key = lc $_; next if $seen{$key}++; my $signature = join "", sort split //, $key; push @{$words{$signature}}, $_; } for (sort keys %words) { my @list = sort @{$words{$_}}; next unless @list > 1; print "@list\n"; } __END__ Abby abbot acne alert alter baby Baby BABY best bets cane later Unix UNIX` [download] -- Randal L. Schwartz, Perl hacker	[reply] [d/l]
Re: Re: Perl's pearls by gmax (Abbot) on Dec 31, 2001 at 21:48 UTC
Your script is very elegant, shorter and functionally equivalent. However, this simplicity comes at a price. The overhead of a second hash to eliminate duplicates is a heavy load on performance, especially after adding "use strict" and "-w" on top. I compared the execution times on a 102_759 words dictionary. `$ time perl gmax_anagram.pl < words \|wc 4.26user 0.07system 0:04.32elapsed 6427 14802 107910 $ time perl merlyn_anagram.pl < words \|wc 6.35user 0.10system 0:06.45elapsed 6427 14802 107910 (after adding use strict and -w) $ time perl merlyn_anagram.pl < words \|wc 7.44user 0.12system 0:07.55elapsed 6427 14802 107910` [download] Anyway, this approach is one more thing to know, if I want to privilege a shorter script in favor of a faster one. Thanks. _ _ _ _ (_\|\| \| \|(_\|>< _\|	[reply] [d/l]
Re: Re: Re: Perl's pearls by merlyn (Sage) on Dec 31, 2001 at 21:58 UTC
OK, let's get rid of that big hash, if you think that's where the trouble is: `WORD: while (<DATA>) { chomp; my $key = lc $_; my $signature = join "", sort split //, $key; my $list = $words{$signature} \|\|= []; lc $_ eq $key and next WORD for @$list; push @$list, $_; } for (sort keys %words) { my @list = sort @{$words{$_}}; next unless @list > 1; print "@list\n"; } __END__ Abby abbot acne alert alter baby Baby BABY best bets cane later Unix UNIX` [download] How's that do? -- Randal L. Schwartz, Perl hacker	[reply] [d/l]
Re: Re: Re: Re: Perl's pearls by gmax (Abbot) on Jan 02, 2002 at 11:05 UTC
Re: Re: Re: Perl's pearls by petral (Curate) on Jan 02, 2002 at 20:17 UTC
It seems like the main improvement/optimization would be not looping twice through the list of all words. Move all processing into the main loop: `my (%word, %gram); while (<>) { chomp; # $_ = lc $_; /[^a-z]/ and next; my $sig = pack "C", sort unpack "C", $_; if (exists $word{$sig}) { if (exists $gram{$sig}) { next if $gram{$sig} =~ /\b$_\b/; $gram{$sig} .= " $_"; # rare } else { next if $word{$sig} eq $_; $gram{$sig} = "$word{$sig} $_"; # rare } } else { $word{$sig} = $_; # mostly } } print join "\n", (sort values %gram), ''; # just output short list` [download] Only the first word of an anagram set is in both lists. Here's some more finds, mostly from the short OED from here ablest bleats stable tables adroitly dilatory idolatry angered derange enraged grandee grenade ascertain cartesian sectarian asleep elapse please aspirant partisan attentive tentative auctioned cautioned education canoe ocean comedian demoniac compile polemic covert vector danger gander garden deist diets edits idest sited tides emits items metis mites smite times emitter termite lapse leaps pales peals pleas nastily saintly obscurantist subtractions observe obverse verbose opt pot top opts post pots spot stop tops opus soup oy yo petrography typographer peripatetic precipitate present repents serpent presume supreme resin rinse risen siren salivated validates slitting stilting tiltings titlings tlingits views wives vowels wolves woodlark workload [download] p	[reply] [d/l] [select]
Re: Re: Re: Re: Perl's pearls by gmax (Abbot) on Jan 02, 2002 at 20:55 UTC
Re^2: Perl's pearls by dominick_t (Acolyte) on Oct 27, 2017 at 00:41 UTC
Might somebody on this thread be able to walk me through certain aspects of how this code is working? I understand the general principle, but I am missing some details, and I would like to understand it precisely. I have been reading all the documentation that I can, and am figuring it out in bits and pieces, but I need to understand the behavior of this program perfectly, as I want to write a program to accomplish a similar task.	[reply]
Re^3: Perl's pearls by roboticus (Chancellor) on Oct 27, 2017 at 04:16 UTC
dominick_t: I'd suggest running it under the debugger (perl -d script.pl) and examine the interesting bits and/or the bits you don't understand. The debugger isn't terribly difficult to use, and it's the easiest way to suss out the details you don't understand yet. `$ perl -d pm_135391.pl Loading DB routines from perl5db.pl version 1.49_001 Editor support available. Enter h or 'h h' for help, or 'man perldebug' for more help. main::(pm_135391.pl:1): while (<DATA>) { DB<1> n main::(pm_135391.pl:2): chomp; DB<1> p "<$_>" <Abby >` [download] So here, we started the program under the debugger, executed the first statement (loading $_ from <DATA>), then displayed the value using the p command. (I used "<$_>" to put delimiters around the value, so you can see that it has end-of-line characters in it. `DB<2> n main::(pm_135391.pl:3): my $key = lc $_; DB<2> n main::(pm_135391.pl:4): next if $seen{$key}++; DB<2> n main::(pm_135391.pl:5): my $signature = join "", sort split //, $key +; DB<2> x %seen 0 "abby \cM" 1 1` [download] Here, we executed the next three lines to chomp the value, make the $key value, and add $key to the %seen hash. Then I used the x command to display the %seen hash, so you can see what's in it. DB<3> n main::(pm_135391.pl:6): push @{$words{$signature}}, $_; DB<3> x $signature 0 "\cM abby" DB<4> h List/search source lines: Control script execution: l [ln\|sub] List source code T Stack trace - or . List previous/current line s [expr] Single step [in +expr] v [line] View around line n [expr] Next, steps over + subs f filename View source in file <CR/Enter> Repeat last n or + s /pattern/ ?patt? Search forw/backw r Return from subr +outine M Show module versions c [ln\|sub] Continue until p +osition Debugger controls: L List break/watch +/actions o [...] Set debugger options t [n] [expr] Toggle trace [m +ax depth] ][trace expr] <[<]\|{[{]\|>[>] [cmd] Do pre/post-prompt b [ln\|event\|sub] [cnd] Set b +reakpoint ! [N\|pat] Redo a previous command B ln\|* Delete a/all bre +akpoints H [-num] Display last num commands a [ln] cmd Do cmd before li +ne = [a val] Define/list an alias A ln\|* Delete a/all act +ions h [db_cmd] Get help on command w expr Add a watch expr +ession h h Complete help page W expr\|* Delete a/all wat +ch exprs \|[\|]db_cmd Send output to pager ![!] syscmd Run cmd in a sub +process q or ^D Quit R Attempt a restar +t Data Examination: expr Execute perl code, also see: s,n,t expr x\|m expr Evals expr in list context, dumps the result or lists + methods. p expr Print expression (uses script's current package). S [[!]pat] List subroutine names [not] matching pattern V [Pk [Vars]] List Variables in Package. Vars can be ~pattern or ! +pattern. X [Vars] Same as "V current_package [Vars]". i class inherita +nce tree. y [n [Vars]] List lexicals in higher scope <n>. Vars same as V. e Display thread id E Display all thread ids. For more help, type h cmd_letter, or run man perldebug for all docs. [download] Then, I decided I showed you enough ;^) and had it print the help page. Have fun with it! ...roboticus When your only tool is a hammer, all problems look like your thumb.	[reply] [d/l] [select]
Re: Perl's pearls by TomK32 (Monk) on Dec 31, 2001 at 15:34 UTC
It's a very cool script but what I'm missing is using two words for input or even getting a two-word-anagram like this example: `Jim Morrison -> Mr. Mojo Risin'` -- paco for president TomK32 - just a geek trying to change the world	[reply] [d/l]
Multiple words anagrams (challenge) Re: Perl's pearls by gmax (Abbot) on Jan 02, 2002 at 11:33 UTC
This is a real challenge (at least for me). Finding (multiple) words that are anagrams of a given phrase is much more complex than finding the simple anagrams. However, we can do something, at least to find up to three words corresponding to a given input. Read more... (7 kB)	[reply] [d/l] [select]
Re: Perl's pearls by stefp (Vicar) on Jan 02, 2002 at 00:10 UTC
Unique signatures can be put to all sort of good use. Suppose you want to regularly do a complete backup a farm of workstations. If they run the same OS, these workstations have a lot of files with identical content. So when you do a backup, you compare the MD5 signature of each file to see if it match signature of the files already stored. For each workstation, you also need to save the path of each file and all the meta information (ACLs, major-minor for devices...). Files in the same directory when sorted alphabetically share a "prefix string" so you can store only what changes from one file to the next; example: `/var/log/meesages /var/log/messages.1 # add .1 /var/log/meesages.2 # replace "1" by "2"` [download] But you are lazy and know this have been done before anc check out the source of slocate/updatedb. :) I delibaretely have overlooked many details. I just wanted to give an example where unique signatures could be used. -- stefp	[reply] [d/l]
Re: Perl's pearls by kbrannen (Beadle) on Jul 30, 2013 at 02:19 UTC
I was wanting something like this, so it was good to find. So how has 12 years changed things other than it still works? Tech note D is of interest: Creating the word signature with pack "C", sort unpack "C", $_ is faster than using join "", sort split //, $_; With Perl 5.14.2, the join/sort/split version is actually faster now, only by a few hundredths of a second on my machine, but it is faster. This is on 917K words. It also generates a correct signature, while the pack version generates a bad signature. Changing the last line to `foreach(sort(keys(%words))) { print "$_- $words{$_}\n"; }` allows us to compare: pack gives: deoab- abode adobe split gives: abdeo- abode adobe The pack version works because it orders letters in the same way, but it doesn't order them in alphabetical order like it's supposed to. I'm not sure if this is a result of changes to perl as it's advanced thru time, or if it was that way originally. I don't have a version of perl from back in 2001 to test anyway. I found the key generation error because I actually want the key to be available and need the letters to be in alphabetical order for another program. Thanks for sharing!	[reply] [d/l]

Back to Meditations