speedier text matching

lemnar has asked for the wisdom of the Perl Monks concerning the following question:

Anyone here have suggestions for scrabblehooks (esp. in terms of speed increase) I'm very much a perl n00b and when a friend of mine presented me with this text problem, I thought perl, and struggled through to write it, but it was much slower than hoped for. Any suggestions?


#!/usr/bin/perl
$op = shift @ARGV;

# load a list of valid words
unless ($op =~ /^S$/) {
 open WORDLIST, shift @ARGV;
 while (<WORDLIST>) {chomp;push @words, $_;}
 close WORDLIST;
} else {
 while(<>){push @correct;}
}

# find a list of possible words from the list of valid words
if ($op =~ /^P(.{0,2})(.)(.{1,})$/) {($otherop,$direction,$letter)=($1
+,$2,$3);
 for (@words) {
  if ($direction eq "F") {
   if (/\b($letter)/) {push @possible, $';}
  } elsif ($direction eq "B") {
   if (/($letter)\b/) {push @possible, $`;}
  }
 }
 if (!$otherop) {for(@possible){print "$_\n";}}
}

# check the list of possible words against the list of valid words
if ($op =~ /^(.?)C(S*)/) {$otherop = $2;
 if (!$1) {while(<>){push @possible, $_;}}
 for $word (@words) {
  for(@possible) {
   if ($word eq $_) {push @correct, $_;}
  }
 }
 if (!$otherop) {for(@correct){print "$_\n";}}
}

# sort the list of correct words by size (small to large) and then abc
if ($op =~ /^(.*)S/) {
 if ($op =~ /S$/) {while(<>){push @correct, $_;}}
 @sorted = sort { length $a <=> length $b or $a cmp $b } @correct;
 for(@sorted) {print "$_\n";}
}
[download]

The problem is this: the input is a list of valid words - all letters of the same case, one word per line. The output is based on finding words which, when a letter is removed from the beginning is still a valid word e.g. a search for a front hook of A in a list containing AAH and AH searching would return AH as AAH - A in front = AH, which is in the list. Expand this to 26 letters and 54 lists and it took a significant amount of time to process a 1.6M list of valid words.

Any speed tips? or is that much processing just going to take a long time? (I split the job up among 4x2GHz computers and it took about 8 hours to generate all of the lists)

Source, inputs and outputs

Comment on speedier text matching Download Code

Replies are listed 'Best First'.
Re: speedier text matching by Zaxo (Archbishop) on Aug 07, 2004 at 00:49 UTC
You're iterating through the wordlist too many times, giving you O(N**2) scaling, or worse. Try making your wordlist the keys to a hash, which will speed lookups a lot. Something like this, `my %word; { open my $fh, '<', '/path/to/wordlist.txt' or die $!; chomp( my @words = <$fh>); close $fh or die $!; @word{@words} = (); } # ... then when it's time to look up a word, if (exists $word{$foo}) { # ... }` [download] After Compline, Zaxo	[reply] [d/l]
Re: speedier text matching by Aristotle (Chancellor) on Aug 07, 2004 at 00:48 UTC
If you go over the entire list for every word, it's just going to take a lot of time. You want a hash instead. `#!/usr/bin/perl use strict; use warnings; # the diamond operator is special: # supply filenames on commandline or provide input on STDIN and it jus +t works my @word = <>; chomp @word; my %in_list; undef @in_list{ @word }; # now all entries from @word are keys in %in_ +list for( @word ) { my @hook = grep { exists $in_list{ $_ } } ( substr( $_, 1 ), substr( $_, 0, length( $_ ) - 1 ), ); print "$_: @hook\n" if @hook; }` [download] You go through the list of words, generating front and tail hooks for each, and then look up whether there such a key exists in the hash. Makeshifts last the longest.	[reply] [d/l]
Re: speedier text matching by ysth (Canon) on Aug 07, 2004 at 01:00 UTC
Not a direct reponse to your question, but there was a Perl Quiz of the Week involving creating a spellchecker that would suggest word corrections. For my entry, I created a hash of word-missing-one-character to corrected word for each character for each word in a dictionary, and employed a number of tricks to get it to perform at a reasonable speed. See sub insertmisspelling if you are interested.	[reply]