laziness, impatience, and hubris PerlMonks

### Perl's pearls

by gmax (Abbot)
 on Dec 31, 2001 at 11:41 UTC Need Help??

Dear fellow Monks,
I would like to share with you some notes on software implementation. Some of the technical contents of this node have already been discussed in the Monastery, but I would like to present the issue in a more general context.
In his book "Programming Pearls", Jon Bentley describes an algorithm to find all anagrams in a dictionary. This is an interesting problem, because, until not long ago, CS textbooks used to present anagrams as a task for a brute force recursive attack, close to intractability. Bentley's refreshing view has shown us that several seemingly gigantic chores, when faced from a different angle, become almost routine.
A simplified description of this anagram method is a four steps task.
1. for each word in the dictionary, find an unique "signature" by sorting the word's letters.
2. print a space-separated list of the signature plus the word;
3. sort the list;
4. read into a new list, and put together the words with the same signature (adjacent within the list). They are the anagrams.
5. (not mentioned, but necessary) filter from the list the signatures with more than one word and sort them.

His implementation in "C" uses two programs (one to create the "signatures" and the second one to flatten two or more adjacent entries into a single one) plus the system sort. Even if he doesn't mention it, a third program is also needed, to filter off from the final list the words without anagram.
I wanted to test the implementation of the same algorithm in Perl. It is much simpler, resulting in one script of 15 lines (some Golf implementations can be seen at this node and this one). The external sorting plus the flattening program are replaced by inserting words in a hash, which deals very efficiently with the problem. No external filtering is needed. The final two steps, filtering and printing, are combined into one step only in Perl.
Of course it is not as fast as C, but the complexity of the Perl program is less than one third. The three C programs together amount to 50 lines of code, and the compiled results have to be put together with two calls to the system sort
./sign < dictionary | sort | ./squash | ./filter_anagrams |sort
The Perl script is simpler and more powerful than its C counterparts. If I wanted to implement the same handling of input/output as I have done in my Perl application it wouldn't be trivial. Having had my share of C programming, I know for a fact that doing in C what boils in those 15 lines of Perl can make me learn many new curses. :-)
Back to our anagrams in Perl. The principle is simple: a hash will create the candidates, using the sorted letters of each word as key (the "signature") so that two or more words with the same signature will end up together.
Here is the pseudocode
```---------------------------
create an empty words_hash
while (input available)
windex = sorted lowercase word
skip word duplicates
add word to words_hash with key = windex

foreach windex in words_hash
remove unique words # = signatures with one word only

sort words_hash

foreach item in words_hash
print anagrams
----------------------------
example: given this input file
```    Abby
abbot
acne
alter
baby
Baby
BABY
best
bets
cane
later
Unix
UNIX
%words will become (notice that the duplicates in BABY, Baby and UNIX were skipped)
```    key      values
-----    ---------
abby  => abby baby
abbot => abbot
acen  => acne cane
best  => best bets
inux  => unix
and then, skipping the unique items "abbot" and "inux", it will print (only values)
```
abby baby
acne cane
best bets
Enough introduction. Here is the actual working script
```#!/usr/bin/perl -w
# usage: anagram.pl < wordlist

use strict;
my %words = ();                              # words index (Key: sorte
+d
# word. Value: word(s))
while (<>) {
chomp;                                   # remove newline characte
+r
\$_ = lc \$_;                              # converts to lowercase
my \$windex = pack "C*",
sort unpack "C*", \$_;                # create index
if (exists \$words{\$windex}) {
next if \$words{\$windex} =~ /\b\$_\b/; # skip duplicates
\$words{\$windex} .= " ";              # add separating space
}
\$words{\$windex} .= \$_;                   # inserts a new word
}
print map {"\$_\n"} sort grep {tr/ //} values %words;

And now, some Perl technical notes
(A) The last line of the script is roughly equivalent to the following
``` my @anagrams =();
foreach (values %words) {
push @anagrams, \$_ if tr/ //;
# tr returns the number of processed characters,
# spaces in this case. Therefore, if spaces are > 0,
# it means that there are at least two words, and
# we have some anagrams. If space is 0, the word
# is unique and we skip it.
# grep returns all the items that satisfy the
# filtering condition (= are there any spaces?)
}

# map does the same as this foreach loop
foreach (sort @anagrams) {
print "\$_\n"
}
(B) Rejecting duplicates could be only *slightly* faster if implemented with grep
```    if (exists \$words{\$windex}) {
my \$word = \$_;
next if grep {\$_ eq \$word} split / /,
\$words{\$windex} ;    # skip duplicates
\$words{\$windex} .= " ";  # add separating space
}
(C) Using an anonymous array as hash value, instead of a space-separated string, makes the program shorter, but (surprisingly) slower.
```#!/usr/bin/perl -w
use strict;
my %words = ();
while (<>) {
chomp;
my \$word = lc \$_;
my \$windex = pack "C*", sort unpack "C*", \$word;
next if (exists \$words{\$windex})
and grep {\$_ eq \$word} @{\$words{\$windex}};
push @{\$words{\$windex}}, \$word;
}
print map {"@\$_\n"} sort grep {scalar @\$_ > 1} values %words;
After benchmarking, it seems that adding and fetching a string is less expensive than pushing and fetching (with interpolation) an array, while counting items is faster in arrays. I wouldn't go into more details here, since this node is already a long one.

(D) Creating the word signature with
pack "C*", sort unpack "C*", \$_
is faster than using
join "", sort split //, \$_;

(E) To make anagrams from words with accented characters (like in French and Italian,) tr should produce a better result than lc
```    # French and Italian anagrams
my \$windex = \$_;
\$windex =~ tr/àèéìòù[A-Z]/aeeiou[a-z]/; # gives the same signature
+
# to accented and
# unaccented vowels
tr/ÀÈÉÌÒÙ[A-Z]/àèéìòù[a-z]/;            # converts to lowercase
\$windex = pack "C*", sort unpack "C*", \$windex; # create index
Results
To obtain a significant sample, I merged the standard Unix "words" dictionary, the American and British English dictionaries from the GNU ispell package (eg: strings british.hash | perl -ne 'print if /^\w+\$/') and removed duplicated words. The total was about 100_000 items. In less than 5 seconds (Linux on a PentiumIII 700) I got a list of more than 14_000 anagrams, from which I report an interesting excerpt:
```abolitionism mobilisation
acres cares races sacre scare
actualises casualties
aimless melissa
climaxes exclaim
collapse escallop
colonialist oscillation
endoscopic picosecond
enumeration mountaineer
friary rarify
obscurantist subtractions
predicts scripted
striptease tapestries
shower whores
A design error
When I was comparing the relative efficiency of the algorithm in Perl and in C, I realized that the C program was finding about 80 anagrams less than the Perl one. I dumped both result lists to two files and I had a look at them with diff. The Perl scripts finds anagrams like
angrier earring rearing
arches chaser search

while the C program only gets
angrier earring
arches chaser

I cross-checked the C source code twice, and when I was convinced that the program was bug-free, I also checked the intermediate lists produced after the signing and sorting pass. I was even suspecting a bug in the GNU sort program (forgive me, RMS), but I dismissed it after a thorough test, comparing the result of different sorting algorithms in C and Perl. No bugs there either.
Then, I started thinking about the design, and eventually I found the real reason, which is in one assumption of the algorithm. The Perl implementation seems to be equivalent to the C one, but it is not. These 80 anagrams more are the result of better accuracy of the hash against the sort/squash approach.
The wrong assumption is that the sort program is sorting the signatures independently from its values. However, signatures and values are *in the same string*, thus the separating space and the beginning of the value words are evaluated into the sorting as well as any other character.
Using a hash, instead, guarantees that the signatures are independent from the associated values.
The C algorithm, then, although faster, is less accurate, and needs some re-thinking, using a supporting data structure or using a customized sorting algorithm (by the first item or by length and contents).
(Having some knowledge of the GNU sort utility, however, the problem would be solved by using sort -k 1,1 , which sorts by the first word only. But, as Hannibal Lecter would say, this is incidental).

A design improvement
I have mentioned several times that the C program is faster than my Perl script. But to be completely honest, I was not comparing the same thing. One of the reasons for which the script is slow is that I don't trust the input. My application is doing two things more than the C program: converting the input to lowercase and checking for duplicates, which were taken for granted in the original design.
What I should be comparing, instead, is a script that, given in input 'Unix' and 'UNIX', reports them as anagrams. But at the moment I'd rather wait two seconds more and enjoy an accurate output.

Conclusion
I didn't want to demonstrate that Perl is better than C (everybody in this site has already some ideas on this issue, I suspect) but I wanted to stress some useful aspects of software construction in Perl. The richness of features can greatly simplify the task of implementing a data manipulation algorithm. The difference in performance is compensated by the greater accuracy of the ouput and the ease of coding, which also accounts for ease of maintenance and adaptation to similar tasks (eg: other languages dictionaries).
I hope this exercise could be useful to somebody.

update.
petral found an efficiency improvement of the anagram algorythm.
A hint by TomK32 led to an interesting personal challenge, ie multiple words anagrams, whose first shot is here.
``` _  _ _  _
(_|| | |(_|><
_|
```

Replies are listed 'Best First'.
Re: Perl's pearls
by merlyn (Sage) on Dec 31, 2001 at 21:13 UTC
I'd do the duplicate elimination in the outer loop:
```while (<DATA>) {
chomp;
my \$key = lc \$_;
next if \$seen{\$key}++;
my \$signature = join "", sort split //, \$key;
push @{\$words{\$signature}}, \$_;
}
for (sort keys %words) {
my @list = sort @{\$words{\$_}};
next unless @list > 1;
print "@list\n";
}
__END__
Abby
abbot
acne
alter
baby
Baby
BABY
best
bets
cane
later
Unix
UNIX
Your script is very elegant, shorter and functionally equivalent.
However, this simplicity comes at a price. The overhead of a second hash to eliminate duplicates is a heavy load on performance, especially after adding "use strict" and "-w" on top.
I compared the execution times on a 102_759 words dictionary.
```\$ time perl gmax_anagram.pl < words |wc
4.26user 0.07system 0:04.32elapsed
6427   14802  107910

\$ time perl merlyn_anagram.pl < words |wc
6.35user 0.10system 0:06.45elapsed
6427   14802  107910

(after adding use strict and -w)
\$ time perl merlyn_anagram.pl < words |wc
7.44user 0.12system 0:07.55elapsed
6427   14802  107910
Anyway, this approach is one more thing to know, if I want to privilege a shorter script in favor of a faster one.
Thanks.
``` _  _ _  _
(_|| | |(_|><
_|
```
OK, let's get rid of that big hash, if you think that's where the trouble is:
```WORD: while (<DATA>) {
chomp;
my \$key = lc \$_;
my \$signature = join "", sort split //, \$key;
my \$list = \$words{\$signature} ||= [];
lc \$_ eq \$key and next WORD for @\$list;
push @\$list, \$_;
}
for (sort keys %words) {
my @list = sort @{\$words{\$_}};
next unless @list > 1;
print "@list\n";
}
__END__
Abby
abbot
acne
alter
baby
Baby
BABY
best
bets
cane
later
Unix
UNIX
How's that do?
It seems like the main improvement/optimization would be not looping twice through the list of all words.  Move *all* processing into the main loop:
```my (%word, %gram);

while (<>) {
chomp;

#     \$_ = lc \$_;
/[^a-z]/ and next;
my \$sig = pack "C*", sort unpack "C*", \$_;

if (exists \$word{\$sig}) {

if (exists \$gram{\$sig}) {
next if \$gram{\$sig} =~ /\b\$_\b/;
\$gram{\$sig} .= " \$_";            # rare
}
else {
next if \$word{\$sig} eq \$_;
\$gram{\$sig} = "\$word{\$sig} \$_";  # rare
}
}
else {
\$word{\$sig} = \$_;                    # mostly
}
}

print join "\n", (sort values %gram), '';    # just output short list

Only the first word of an anagram set is in both lists.
Here's some more finds, mostly from the short OED from here
```ablest bleats stable tables
ascertain cartesian sectarian
aspirant partisan
attentive tentative
auctioned cautioned education
canoe ocean
comedian demoniac
compile polemic
covert vector
danger gander garden
deist diets edits idest sited tides
emits items metis mites smite times
emitter termite
lapse leaps pales peals pleas
nastily saintly
obscurantist subtractions
observe obverse verbose
opt pot top
opts post pots spot stop tops
opus soup
oy yo
petrography typographer
peripatetic precipitate
present repents serpent
presume supreme
resin rinse risen siren
salivated validates
slitting stilting tiltings titlings tlingits
views wives
vowels wolves

p
Might somebody on this thread be able to walk me through certain aspects of how this code is working? I understand the general principle, but I am missing some details, and I would like to understand it precisely. I have been reading all the documentation that I can, and am figuring it out in bits and pieces, but I need to understand the behavior of this program perfectly, as I want to write a program to accomplish a similar task.

I'd suggest running it under the debugger (perl -d script.pl) and examine the interesting bits and/or the bits you don't understand. The debugger isn't terribly difficult to use, and it's the easiest way to suss out the details you don't understand yet.

```\$ perl -d pm_135391.pl
Editor support available.

Enter h or 'h h' for help, or 'man perldebug' for more help.

main::(pm_135391.pl:1): while (<DATA>) {
DB<1> n
main::(pm_135391.pl:2):   chomp;
DB<1> p "<\$_>"
<Abby
>

So here, we started the program under the debugger, executed the first statement (loading \$_ from <DATA>), then displayed the value using the p command. (I used "<\$_>" to put delimiters around the value, so you can see that it has end-of-line characters in it.

```  DB<2> n
main::(pm_135391.pl:3):   my \$key = lc \$_;
DB<2> n
main::(pm_135391.pl:4):   next if \$seen{\$key}++;
DB<2> n
main::(pm_135391.pl:5):   my \$signature = join "", sort split //, \$key
+;
DB<2> x %seen
0  "abby \cM"
1  1

Here, we executed the next three lines to chomp the value, make the \$key value, and add \$key to the %seen hash. Then I used the x command to display the %seen hash, so you can see what's in it.

```  DB<3> n
main::(pm_135391.pl:6):   push @{\$words{\$signature}}, \$_;
DB<3> x \$signature
0  "\cM abby"
DB<4> h
List/search source lines:               Control script execution:
l [ln|sub]  List source code            T           Stack trace
- or .      List previous/current line  s [expr]    Single step [in
+expr]
v [line]    View around line            n [expr]    Next, steps over
+ subs
f filename  View source in file         <CR/Enter>  Repeat last n or
+ s
/pattern/ ?patt?   Search forw/backw    r           Return from subr
+outine
M           Show module versions        c [ln|sub]  Continue until p
+osition
Debugger controls:                        L           List break/watch
+/actions
o [...]     Set debugger options        t [n] [expr] Toggle trace [m
+ax depth] ][trace expr]
<[<]|{[{]|>[>] [cmd] Do pre/post-prompt b [ln|event|sub] [cnd] Set b
+reakpoint
! [N|pat]   Redo a previous command     B ln|*      Delete a/all bre
+akpoints
H [-num]    Display last num commands   a [ln] cmd  Do cmd before li
+ne
= [a val]   Define/list an alias        A ln|*      Delete a/all act
+ions
h [db_cmd]  Get help on command         w expr      Add a watch expr
+ession
h h         Complete help page          W expr|*    Delete a/all wat
+ch exprs
|[|]db_cmd  Send output to pager        ![!] syscmd Run cmd in a sub
+process
q or ^D     Quit                        R           Attempt a restar
+t
Data Examination:     expr     Execute perl code, also see: s,n,t expr
x|m expr       Evals expr in list context, dumps the result or lists
+ methods.
p expr         Print expression (uses script's current package).
S [[!]pat]     List subroutine names [not] matching pattern
V [Pk [Vars]]  List Variables in Package.  Vars can be ~pattern or !
+pattern.
X [Vars]       Same as "V current_package [Vars]".  i class inherita
+nce tree.
y [n [Vars]]   List lexicals in higher scope <n>.  Vars same as V.
For more help, type h cmd_letter, or run man perldebug for all docs.

Then, I decided I showed you enough ;^) and had it print the help page. Have fun with it!

...roboticus

When your only tool is a hammer, all problems look like your thumb.

Re: Perl's pearls
by TomK32 (Monk) on Dec 31, 2001 at 15:34 UTC
It's a very cool script but what I'm missing is using two words for input or even getting a two-word-anagram like this example:
Jim Morrison -> Mr. Mojo Risin'

--
paco for president
TomK32 - just a geek trying to change the world

This is a real challenge (at least for me).
Finding (multiple) words that are anagrams of a given phrase is much more complex than finding the simple anagrams. However, we can do something, at least to find up to three words corresponding to a given input.
Re: Perl's pearls
by stefp (Vicar) on Jan 02, 2002 at 00:10 UTC
Unique signatures can be put to all sort of good use. Suppose you want to regularly do a complete backup a farm of workstations. If they run the same OS, these workstations have a lot of files with identical content.

So when you do a backup, you compare the MD5 signature of each file to see if it match signature of the files already stored.

For each workstation, you also need to save the path of each file and all the meta information (ACLs, major-minor for devices...). Files in the same directory when sorted alphabetically share a "prefix string" so you can store only what changes from one file to the next; example:

```/var/log/meesages
/var/log/meesages.2  # replace "1" by "2"

But you are lazy and know this have been done before anc check out the source of slocate/updatedb. :)

I delibaretely have overlooked many details. I just wanted to give an example where unique signatures could be used.

-- stefp

Re: Perl's pearls
by kbrannen (Beadle) on Jul 30, 2013 at 02:19 UTC

I was wanting something like this, so it was good to find. So how has 12 years changed things other than it still works?

Tech note D is of interest:

```Creating the word signature with
pack "C*", sort unpack "C*", \$_
is faster than using
join "", sort split //, \$_;
```
With Perl 5.14.2, the join/sort/split version is actually faster now, only by a few hundredths of a second on my machine, but it is faster. This is on 917K words. It also generates a correct signature, while the pack version generates a bad signature. Changing the last line to foreach(sort(keys(%words))) { print "\$_- \$words{\$_}\n"; } allows us to compare:
• pack gives: deoab- abode adobe
• split gives: abdeo- abode adobe
The pack version works because it orders letters in the same way, but it doesn't order them in alphabetical order like it's supposed to. I'm not sure if this is a result of changes to perl as it's advanced thru time, or if it was that way originally. I don't have a version of perl from back in 2001 to test anyway.

I found the key generation error because I actually want the key to be available and need the letters to be in alphabetical order for another program.

Thanks for sharing!

Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlmeditation [id://135345]
Approved by root
help
Chatterbox?
and the web crawler heard nothing...

How do I use this? | Other CB clients
Other Users?
As of 2022-01-16 22:06 GMT
Sections?
Information?
Find Nodes?
Leftovers?
Voting Booth?
In 2022, my preferred method to securely store passwords is:

Results (50 votes). Check out past polls.

Notices?