in reply to Fast wordlist lookup for game

You could take your original idea of splitting the list across files keyed by first letter one step further and split them into subdirs named for the first letter and then files by the second. This would vastly reduce the memory overhead for the look up.

I did this with my own words file containing over 600,000 words and then wrote a small sub to read the appropriate file and a m// to check for its existance and in a thoroughly unscientific benchmark used it to look up a word in the largest subset of each of the 26 first letters and it came out approximately 20 ms/word which should be fast enough for almost any purpose.

The program to split the words file

my ($oldfirst, $oldsecond) = ( '', '' ); open WORDS, '<', $ARGV[0] or die $ARGV[0] . $1 . $/; while(<WORDS>) { chomp; next if length == 1; $_ = lc; my ($first, $second) = /^(.)(.).*/; $first = './' . $first; $second = $first . '/' . (($second eq ' ') ? '_' : $second); if ( ($first ne $oldfirst) or ($second ne $oldsecond) ) { mkdir $first if not -e $first; open SUB, '>>', $second or die $second . $! . $/; } print SUB $_, $/; ($oldfirst, $oldsecond) = ($first, $second); print "\r$second:$_ "; }

And the program for doing the lookup.

#! perl -sw use strict; use Benchmark::Timer; sub IsWord { return if not @_ or length == 1; my ($word, $first, $second) = $_[0] =~ /^((.)(.).*)$/; local $_ = do{ local (@ARGV,$/) = "./$first/$second"; <> }; return /^$word$/im; } my $timer = Benchmark::Timer->new(skip=>0); for (@ARGV) { $timer->start; my $IsWord = IsWord( $_ ); $timer->stop; print $_, $IsWord ? ' is ' : ' is not ', 'a word', $/; } $timer->report; print "@{[$timer->data('_default')]}", $/; __END__

The results and benchmark

C:\test\WORDS>../208899-2 Anonymous Baltic Confirm Delta Entity Font Gregarious Hazzard Indigo Jalopy Kilo Lascivious Matter Notorious Ovulation Prosthetic Quinine Reading Semaphore Transendental Uniform Vivisection Weather Xenophobia Yesterday Zorro Anonymous is a word Baltic is a word Confirm is a word Delta is a word Entity is a word Font is a word Gregarious is a word Hazzard is a word Indigo is a word Jalopy is a word Kilo is a word Lascivious is a word Matter is a word Notorious is a word Ovulation is a word Prosthetic is a word Quinine is a word Reading is a word Semaphore is a word Transendental is not a word Uniform is a word Vivisection is a word Weather is a word Xenophobia is a word Yesterday is a word Zorro is a word 26 trials of _default (522ms total), 20.077ms/trial 0.01 0.01 0.041 0.02 0.02 0.01 0.01 0.02 0.02 0.01 0 0.01 0.02 0.02 0.01 0.08 0.01 0.03 0.04 0.07 0.041 0.01 0.01 0 0 0 C:\test\WORDS>

I seriously doubt that any DBM solution will beat that!


Nah! Your thinking of Simon Templar, originally played by Roger Moore and later by Ian Ogilvy

Replies are listed 'Best First'.
Re: Re: Fast wordlist lookup for game
by grantm (Parson) on Oct 30, 2002 at 04:03 UTC
    it came out approximately 20 ms/word ... I seriously doubt that any DBM solution will beat that!

    Well that sounds like a challenge :-)

    I started with a dictionary file containing about 100,000 words; created a GDBM file keyed on first 3 letters of words; and checked each word from a document containing 10479 words.

    Program to create index...

    If I tie the GDBM file once and check all the words the script runs in under a second which is (a) well under one millisecond per word and (b) probably not how you'd do it unless you had a dedicated wordcheck daemon

    If I tie the GDBM file once before checking each word (and then untie it afterwards) then checking all the words takes 50 seconds. Which is less than 5 milliseconds per word.

    use strict; use Benchmark; use GDBM_File ; my $dbm_file = 'words.dbm'; my %hash; $/ = undef; my $data = <DATA>; my @words = map {lc($_)} ($data =~ /(\w+)/g); print scalar(@words), "\n"; timethis(1, 'one_tie()'); timethis(1, 'many_ties()'); sub one_tie { my($found, $not_found) = (0, 0); tie %hash, 'GDBM_File', $dbm_file, &GDBM_WRCREAT, 0640; foreach (@words) { my $prefix = substr($_, 0, 3); if(exists($hash{$prefix}) and $hash{$prefix} =~ /\b$_\b/) { $found++; } else { $not_found++; } } untie %hash ; print "Found: $found\nNot found: $not_found\n"; } sub many_ties { my($found, $not_found) = (0, 0); foreach (@words) { tie %hash, 'GDBM_File', $dbm_file, &GDBM_WRCREAT, 0640; my $prefix = substr($_, 0, 3); if(exists($hash{$prefix}) and $hash{$prefix} =~ /\b$_\b/) { $found++; } else { $not_found++; } untie %hash ; } print "Found: $found\nNot found: $not_found\n"; } __DATA__ insert output of 'perldoc perltoot' here :-)

    Update: Of course you didn't mention what hardware you're running on - mines a 800Mhz PIII

    Update 2: Added code for generating index in <readmore> tags.

      Hmmm. Mine's a whimpy 233Mhz PII :^(

      Erm? Something like 20*233/800 = 5.825 plus some factor for greater efficiency of PIII v PII. To close too call without proper benchmarking I think (he clutches at a straw).

      It would be interesting to

      1. see how GDBM scales, I've no concept.
      2. to add a third char to the indexing of mine (maybe tomorrow).
      3. get them both running on the same box

      I must admit, when I said dbm, I was thinking RDMB (MySQL) rather (what I assume) is a B-tree DBM or similar. Never having used GDBM, I am very surprised at that performance. I guess I should keep my thoughts inward:^).

      Nice one++.


      Nah! Your thinking of Simon Templar, originally played by Roger Moore and later by Ian Ogilvy

      As I said I would last night, I went back and adapted my to original programs to use 3-character indexing instead of two, to bring it into line with yours and benefit from the smaller linear search.

      The result, using my original 26 "worst case" words, is a reduction in average search time by nearly an order of magnitude from 20ms/word to 2.7ms/word, which is kind of pleasing.

      If you can make your 10k test file available to me, or we can agree on a suitable publically available test file, we could try comparing apples with apples:^).

      I do love it when someone makes me work for my one of my boasts!

      Indexer code

      Wordcheck test program

      Results and 'benchmark'.


      Nah! Your thinking of Simon Templar, originally played by Roger Moore and later by Ian Ogilvy

        I've updated my original post to include the code I used to generate the index. So you can use that with your dictionary word list.

        The 10,000 word document I used in my benchmark test was the output of `perldoc perltoot`.

        I'm definitely interested in your results, unfortunately I've got some real work to do so I can't play right now. The main reason I posted a DBM solution was that many people are unaware of this middle ground between storing your data in a flat text file or using a full-blown RDBMS solution. One advantage is the shallow learning curve - if you can use a hash then with 3 more lines of code (use, tie, untie) then you can use a DBM solution. Another advantage is the blazing speed - the various DBM solutions are coded in highly optimised 'C' and use efficient indexing algorithms that can perform better than more general solutions used in RDBMS products.

        I also had a rethink about one tie vs one tie per lookup. Assuming that the final solution was going to run under mod_perl, you probably would tie once when the script was first invoked. Kind of like using a persistent database connection with DBI. It would only be safe to keep the DBM file tied if it was not being written to. For the performance boost it would give, it might be worth batching any required updates and using the DBM file in readonly mode.