lemnar has asked for the wisdom of the Perl Monks concerning the following question:

Anyone here have suggestions for scrabblehooks (esp. in terms of speed increase) I'm very much a perl n00b and when a friend of mine presented me with this text problem, I thought perl, and struggled through to write it, but it was much slower than hoped for. Any suggestions?

#!/usr/bin/perl $op = shift @ARGV; # load a list of valid words unless ($op =~ /^S$/) { open WORDLIST, shift @ARGV; while (<WORDLIST>) {chomp;push @words, $_;} close WORDLIST; } else { while(<>){push @correct;} } # find a list of possible words from the list of valid words if ($op =~ /^P(.{0,2})(.)(.{1,})$/) {($otherop,$direction,$letter)=($1 +,$2,$3); for (@words) { if ($direction eq "F") { if (/\b($letter)/) {push @possible, $';} } elsif ($direction eq "B") { if (/($letter)\b/) {push @possible, $`;} } } if (!$otherop) {for(@possible){print "$_\n";}} } # check the list of possible words against the list of valid words if ($op =~ /^(.?)C(S*)/) {$otherop = $2; if (!$1) {while(<>){push @possible, $_;}} for $word (@words) { for(@possible) { if ($word eq $_) {push @correct, $_;} } } if (!$otherop) {for(@correct){print "$_\n";}} } # sort the list of correct words by size (small to large) and then abc if ($op =~ /^(.*)S/) { if ($op =~ /S$/) {while(<>){push @correct, $_;}} @sorted = sort { length $a <=> length $b or $a cmp $b } @correct; for(@sorted) {print "$_\n";} }


The problem is this: the input is a list of valid words - all letters of the same case, one word per line. The output is based on finding words which, when a letter is removed from the beginning is still a valid word e.g. a search for a front hook of A in a list containing AAH and AH searching would return AH as AAH - A in front = AH, which is in the list. Expand this to 26 letters and 54 lists and it took a significant amount of time to process a 1.6M list of valid words.

Any speed tips? or is that much processing just going to take a long time? (I split the job up among 4x2GHz computers and it took about 8 hours to generate all of the lists)

Source, inputs and outputs

Replies are listed 'Best First'.
Re: speedier text matching
by Zaxo (Archbishop) on Aug 07, 2004 at 00:49 UTC

    You're iterating through the wordlist too many times, giving you O(N**2) scaling, or worse. Try making your wordlist the keys to a hash, which will speed lookups a lot. Something like this,

    my %word; { open my $fh, '<', '/path/to/wordlist.txt' or die $!; chomp( my @words = <$fh>); close $fh or die $!; @word{@words} = (); } # ... then when it's time to look up a word, if (exists $word{$foo}) { # ... }

    After Compline,
    Zaxo

Re: speedier text matching
by Aristotle (Chancellor) on Aug 07, 2004 at 00:48 UTC

    If you go over the entire list for every word, it's just going to take a lot of time. You want a hash instead.

    #!/usr/bin/perl use strict; use warnings; # the diamond operator is special: # supply filenames on commandline or provide input on STDIN and it jus +t works my @word = <>; chomp @word; my %in_list; undef @in_list{ @word }; # now all entries from @word are keys in %in_ +list for( @word ) { my @hook = grep { exists $in_list{ $_ } } ( substr( $_, 1 ), substr( $_, 0, length( $_ ) - 1 ), ); print "$_: @hook\n" if @hook; }

    You go through the list of words, generating front and tail hooks for each, and then look up whether there such a key exists in the hash.

    Makeshifts last the longest.

Re: speedier text matching
by ysth (Canon) on Aug 07, 2004 at 01:00 UTC
    Not a direct reponse to your question, but there was a Perl Quiz of the Week involving creating a spellchecker that would suggest word corrections. For my entry, I created a hash of word-missing-one-character to corrected word for each character for each word in a dictionary, and employed a number of tricks to get it to perform at a reasonable speed. See sub insertmisspelling if you are interested.