in reply to Re: Re: Fast wordlist lookup for game
in thread Fast wordlist lookup for game

As I said I would last night, I went back and adapted my to original programs to use 3-character indexing instead of two, to bring it into line with yours and benefit from the smaller linear search.

The result, using my original 26 "worst case" words, is a reduction in average search time by nearly an order of magnitude from 20ms/word to 2.7ms/word, which is kind of pleasing.

If you can make your 10k test file available to me, or we can agree on a suitable publically available test file, we could try comparing apples with apples:^).

I do love it when someone makes me work for my one of my boasts!

Indexer code

#! perl -sw use strict; my ($oldfirst, $oldc2n3) = ( '', '' ); open WORDS, '<', $ARGV[0] or die $ARGV[0] . $1 . $/; while(<WORDS>) { chomp; $_ = lc; my ($first, $c2n3) = /^(.)(..?)?.*/; $first = './' . $first; $c2n3 = substr($c2n3.'__',0,2) if !defined $c2n3 or length($c2n3) +< 2; $c2n3 = $first . '/' . $c2n3; if ( ($first ne $oldfirst) or ($c2n3 ne $oldc2n3) ) { mkdir $first if not -e $first; open SUB, '>>', $c2n3 or die $c2n3 . $! . $/; } print SUB $_, $/; ($oldfirst, $oldc2n3) = ($first, $c2n3); print "\r$c2n3:$_ "; }

Wordcheck test program

#! perl -sw use strict; use Benchmark::Timer; sub IsWord { return if not @_ or length == 1; my ($word, $first, $c2n3) = $_[0] =~ /^((.)(..?)?.*)$/; local $_ = do{ local (@ARGV,$/) = "./$first/$c2n3"; <> }; return /^$word$/im; } my $timer = Benchmark::Timer->new(skip=>0); for (@ARGV) { $timer->start; my $IsWord = IsWord( $_ ); $timer->stop; print $_, $IsWord ? ' is ' : ' is not ', 'a word', $/; } $timer->report; print "@{[$timer->data('_default')]}", $/; __END__

Results and 'benchmark'.

C:\test\WORDS>../208899-2 Anonymous Baltic Confirm Delta Entity Font G +regarious Hazzard Indigo Jalopy Kilo Lascivious Matter Notorious Ovul +ation Prosthetic Quinine Reading Semaphore Transendental Uniform Vivisection Weather Xenophobia Yesterday Zorro Anonymous is a word Baltic is a word Confirm is a word Delta is a word Entity is a word Font is a word Gregarious is a word Hazzard is a word Indigo is a word Jalopy is a word Kilo is a word Lascivious is a word Matter is a word Notorious is a word Ovulation is a word Prosthetic is a word Quinine is a word Reading is a word Semaphore is a word Transendental is not a word Uniform is a word Vivisection is a word Weather is a word Xenophobia is a word Yesterday is a word Zorro is a word 26 trials of _default (70ms total), 2.692ms/trial 0 0 0 0 0 0.01 0 0 0.01 0 0 0 0 0 0 0.01 0.01 0 0.01 0.01 0 0 0 0 0.01 + 0 C:\test\WORDS>

Nah! Your thinking of Simon Templar, originally played by Roger Moore and later by Ian Ogilvy

Replies are listed 'Best First'.
Re: Re: Re: Re: Fast wordlist lookup for game
by grantm (Parson) on Oct 30, 2002 at 20:17 UTC

    I've updated my original post to include the code I used to generate the index. So you can use that with your dictionary word list.

    The 10,000 word document I used in my benchmark test was the output of `perldoc perltoot`.

    I'm definitely interested in your results, unfortunately I've got some real work to do so I can't play right now. The main reason I posted a DBM solution was that many people are unaware of this middle ground between storing your data in a flat text file or using a full-blown RDBMS solution. One advantage is the shallow learning curve - if you can use a hash then with 3 more lines of code (use, tie, untie) then you can use a DBM solution. Another advantage is the blazing speed - the various DBM solutions are coded in highly optimised 'C' and use efficient indexing algorithms that can perform better than more general solutions used in RDBMS products.

    I also had a rethink about one tie vs one tie per lookup. Assuming that the final solution was going to run under mod_perl, you probably would tie once when the script was first invoked. Kind of like using a persistent database connection with DBI. It would only be safe to keep the DBM file tied if it was not being written to. For the performance boost it would give, it might be worth batching any required updates and using the DBM file in readonly mode.

      With a couple of changes to my program, I ran it on the output of perldoc perltoot and got the following results:

      C:\test>208899-2 junk.dat Of 10479 word-like things, 7803 where flagged as words and 2676 as not 10479 trials of _default (58.184s total), 5.552ms/trial

      Which if my ad-hoc conversion factor of *233/800 has any merit puts it at 1.61702ms/word which stands up pretty well given the that it has to re-open the file each time.

      Not having used any of the *dbm varients, I was impressed by the efficiency and simplicity of use that your program demonstrated, and when I looked, found that dada's site has a Win32 port of GDBM which I am in the process of downloading now.

      Thanks greatly for bringing this to my attention. There are many things for which I think I will find this extrememly useful.


      Nah! Your thinking of Simon Templar, originally played by Roger Moore and later by Ian Ogilvy

        Further thoughts:

        • I suspect there's not too much merit in your *233/800 factor - those numbers are just raw processor Mhz numbers, many cycles will be wasted waiting for main memory and man, many more if disk IO is required.
        • You could also use DB_File which is available via PPM from ActiveState and I believe uses the same API
        • SDBM_File is a core perl module which also uses the same API, but has significant limitations on allowable size of data
        • See also MLDBM which allows you to store complex data structures in a DBM hash