Re: Re: Fast wordlist lookup for game

it came out approximately 20 ms/word ... I seriously doubt that any DBM solution will beat that!

Well that sounds like a challenge :-)

I started with a dictionary file containing about 100,000 words; created a GDBM file keyed on first 3 letters of words; and checked each word from a document containing 10479 words.

Program to create index...

use strict;
use GDBM_File ;

my %hash;

tie %hash, 'GDBM_File', 'words.dbm', &GDBM_WRCREAT, 0640;

my $dict_file = '/usr/share/dict/longlist';

open DICT, "<$dict_file" or die "open($dict_file): $!";

my @words = ();
my $prefix = '';

while(<DICT>) {
  chomp;
  $_ = lc($_);
  if(substr($_, 0, 3) ne $prefix) {
    if(@words) {
      $hash{$prefix} = join(' ', @words);
      @words = ();
    }
    $prefix = substr($_, 0, 3);
  }
  push @words, $_;
}

untie %hash ;
[download]

If I tie the GDBM file once and check all the words the script runs in under a second which is (a) well under one millisecond per word and (b) probably not how you'd do it unless you had a dedicated wordcheck daemon

If I tie the GDBM file once before checking each word (and then untie it afterwards) then checking all the words takes 50 seconds. Which is less than 5 milliseconds per word.

use strict;
use Benchmark;
use GDBM_File ;

my $dbm_file = 'words.dbm';

my %hash;
$/ = undef;

my $data = <DATA>;
my @words = map {lc($_)} ($data =~ /(\w+)/g);
print scalar(@words), "\n";

timethis(1, 'one_tie()');
timethis(1, 'many_ties()');

sub one_tie {
  my($found, $not_found) = (0, 0);
  tie %hash, 'GDBM_File', $dbm_file, &GDBM_WRCREAT, 0640;
  foreach (@words) {
    my $prefix = substr($_, 0, 3);
    if(exists($hash{$prefix})  and  $hash{$prefix} =~ /\b$_\b/) {
      $found++;
    }
    else {
      $not_found++;
    }
  }
  untie %hash ;
  print "Found: $found\nNot found: $not_found\n";
}


sub many_ties {
  my($found, $not_found) = (0, 0);
  foreach (@words) {
    tie %hash, 'GDBM_File', $dbm_file, &GDBM_WRCREAT, 0640;
    my $prefix = substr($_, 0, 3);
    if(exists($hash{$prefix})  and  $hash{$prefix} =~ /\b$_\b/) {
      $found++;
    }
    else {
      $not_found++;
    }
    untie %hash ;
  }
  print "Found: $found\nNot found: $not_found\n";
}


__DATA__

insert output of 'perldoc perltoot' here :-)
[download]

Update: Of course you didn't mention what hardware you're running on - mines a 800Mhz PIII

Update 2: Added code for generating index in <readmore> tags.

Comment on Re: Re: Fast wordlist lookup for game Select or Download Code

Replies are listed 'Best First'.
Re: Re: Re: Fast wordlist lookup for game by BrowserUk (Patriarch) on Oct 30, 2002 at 04:55 UTC
Hmmm. Mine's a whimpy 233Mhz PII :^( Erm? Something like 20*233/800 = 5.825 plus some factor for greater efficiency of PIII v PII. To close too call without proper benchmarking I think (he clutches at a straw). It would be interesting to see how GDBM scales, I've no concept. to add a third char to the indexing of mine (maybe tomorrow). get them both running on the same box I must admit, when I said dbm, I was thinking RDMB (MySQL) rather (what I assume) is a B-tree DBM or similar. Never having used GDBM, I am very surprised at that performance. I guess I should keep my thoughts inward:^). Nice one++. Nah! Your thinking of Simon Templar, originally played by Roger Moore and later by Ian Ogilvy	[reply]
Re: Re: Re: Fast wordlist lookup for game by BrowserUk (Patriarch) on Oct 30, 2002 at 17:18 UTC
As I said I would last night, I went back and adapted my to original programs to use 3-character indexing instead of two, to bring it into line with yours and benefit from the smaller linear search. The result, using my original 26 "worst case" words, is a reduction in average search time by nearly an order of magnitude from 20ms/word to 2.7ms/word, which is kind of pleasing. If you can make your 10k test file available to me, or we can agree on a suitable publically available test file, we could try comparing apples with apples:^). I do love it when someone makes me work for my one of my boasts! Indexer code Read more... (967 Bytes) Wordcheck test program Read more... (797 Bytes) Results and 'benchmark'. Read more... (1272 Bytes) Nah! Your thinking of Simon Templar, originally played by Roger Moore and later by Ian Ogilvy	[reply] [d/l] [select]
Re: Re: Re: Re: Fast wordlist lookup for game by grantm (Parson) on Oct 30, 2002 at 20:17 UTC
I've updated my original post to include the code I used to generate the index. So you can use that with your dictionary word list. The 10,000 word document I used in my benchmark test was the output of `perldoc perltoot`. I'm definitely interested in your results, unfortunately I've got some real work to do so I can't play right now. The main reason I posted a DBM solution was that many people are unaware of this middle ground between storing your data in a flat text file or using a full-blown RDBMS solution. One advantage is the shallow learning curve - if you can use a hash then with 3 more lines of code (use, tie, untie) then you can use a DBM solution. Another advantage is the blazing speed - the various DBM solutions are coded in highly optimised 'C' and use efficient indexing algorithms that can perform better than more general solutions used in RDBMS products. I also had a rethink about one tie vs one tie per lookup. Assuming that the final solution was going to run under mod_perl, you probably would tie once when the script was first invoked. Kind of like using a persistent database connection with DBI. It would only be safe to keep the DBM file tied if it was not being written to. For the performance boost it would give, it might be worth batching any required updates and using the DBM file in readonly mode.	[reply]
Re: Re: Re: Re: Re: Fast wordlist lookup for game by BrowserUk (Patriarch) on Oct 30, 2002 at 22:29 UTC
With a couple of changes to my program, I ran it on the output of perldoc perltoot and got the following results: `C:\test>208899-2 junk.dat Of 10479 word-like things, 7803 where flagged as words and 2676 as not 10479 trials of _default (58.184s total), 5.552ms/trial` [download] Which if my ad-hoc conversion factor of 233/800 has any merit puts it at 1.61702ms/word which stands up pretty well given the that it has to re-open the file each time. Not having used any of the dbm varients, I was impressed by the efficiency and simplicity of use that your program demonstrated, and when I looked, found that dada's site has a Win32 port of GDBM which I am in the process of downloading now. Thanks greatly for bringing this to my attention. There are many things for which I think I will find this extrememly useful. Nah! Your thinking of Simon Templar, originally played by Roger Moore and later by Ian Ogilvy	[reply] [d/l]
Re*6: Fast wordlist lookup for game by grantm (Parson) on Oct 30, 2002 at 23:29 UTC