Re: Re: Re: Fast wordlist lookup for game

As I said I would last night, I went back and adapted my to original programs to use 3-character indexing instead of two, to bring it into line with yours and benefit from the smaller linear search.

The result, using my original 26 "worst case" words, is a reduction in average search time by nearly an order of magnitude from 20ms/word to 2.7ms/word, which is kind of pleasing.

If you can make your 10k test file available to me, or we can agree on a suitable publically available test file, we could try comparing apples with apples:^).

I do love it when someone makes me work for my one of my boasts!

Indexer code

#! perl -sw
use strict;

my ($oldfirst, $oldc2n3) = ( '', '' );

open WORDS, '<',  $ARGV[0] or die $ARGV[0] . $1 . $/;
while(<WORDS>) {
    chomp;
    $_ = lc;
    my ($first, $c2n3) = /^(.)(..?)?.*/;
    $first = './' . $first;
    $c2n3 = substr($c2n3.'__',0,2) if !defined $c2n3 or length($c2n3) 
+< 2;
    $c2n3  = $first . '/' . $c2n3;

    if ( ($first ne $oldfirst) or ($c2n3 ne $oldc2n3) ) {
        mkdir $first if not -e $first;
        open SUB, '>>', $c2n3 or die $c2n3 . $! . $/;
    }
    print SUB $_, $/;
    ($oldfirst, $oldc2n3) = ($first, $c2n3);
    print "\r$c2n3:$_              ";
}
[download]

Wordcheck test program

#! perl -sw
use strict;
use Benchmark::Timer;

sub IsWord {
    return if not @_ or length == 1;
    my ($word, $first, $c2n3) = $_[0] =~ /^((.)(..?)?.*)$/;
    local $_ = do{ local (@ARGV,$/) = "./$first/$c2n3"; <> };
    return /^$word$/im;
}

my $timer = Benchmark::Timer->new(skip=>0);

for (@ARGV) {
    $timer->start;
    my $IsWord = IsWord( $_ );
    $timer->stop;
    print $_, $IsWord ? ' is ' : ' is not ', 'a word', $/;
}

$timer->report;
print "@{[$timer->data('_default')]}", $/;

__END__
[download]

Results and 'benchmark'.

C:\test\WORDS>../208899-2 Anonymous Baltic Confirm Delta Entity Font G
+regarious Hazzard Indigo Jalopy Kilo Lascivious Matter Notorious Ovul
+ation 
Prosthetic Quinine Reading Semaphore Transendental Uniform
Vivisection Weather Xenophobia Yesterday Zorro

Anonymous is a word
Baltic is a word
Confirm is a word
Delta is a word
Entity is a word
Font is a word
Gregarious is a word
Hazzard is a word
Indigo is a word
Jalopy is a word
Kilo is a word
Lascivious is a word
Matter is a word
Notorious is a word
Ovulation is a word
Prosthetic is a word
Quinine is a word
Reading is a word
Semaphore is a word
Transendental is not a word
Uniform is a word
Vivisection is a word
Weather is a word
Xenophobia is a word
Yesterday is a word
Zorro is a word
26 trials of _default (70ms total), 2.692ms/trial
0 0 0 0 0 0.01 0 0 0.01 0 0 0 0 0 0 0.01 0.01 0 0.01 0.01 0 0 0 0 0.01
+ 0

C:\test\WORDS>
[download]

Nah! Your thinking of Simon Templar, originally played by Roger Moore and later by Ian Ogilvy

Comment on Re: Re: Re: Fast wordlist lookup for game Select or Download Code

Replies are listed 'Best First'.
Re: Re: Re: Re: Fast wordlist lookup for game by grantm (Parson) on Oct 30, 2002 at 20:17 UTC
I've updated my original post to include the code I used to generate the index. So you can use that with your dictionary word list. The 10,000 word document I used in my benchmark test was the output of `perldoc perltoot`. I'm definitely interested in your results, unfortunately I've got some real work to do so I can't play right now. The main reason I posted a DBM solution was that many people are unaware of this middle ground between storing your data in a flat text file or using a full-blown RDBMS solution. One advantage is the shallow learning curve - if you can use a hash then with 3 more lines of code (use, tie, untie) then you can use a DBM solution. Another advantage is the blazing speed - the various DBM solutions are coded in highly optimised 'C' and use efficient indexing algorithms that can perform better than more general solutions used in RDBMS products. I also had a rethink about one tie vs one tie per lookup. Assuming that the final solution was going to run under mod_perl, you probably would tie once when the script was first invoked. Kind of like using a persistent database connection with DBI. It would only be safe to keep the DBM file tied if it was not being written to. For the performance boost it would give, it might be worth batching any required updates and using the DBM file in readonly mode.	[reply]
Re: Re: Re: Re: Re: Fast wordlist lookup for game by BrowserUk (Patriarch) on Oct 30, 2002 at 22:29 UTC
With a couple of changes to my program, I ran it on the output of perldoc perltoot and got the following results: `C:\test>208899-2 junk.dat Of 10479 word-like things, 7803 where flagged as words and 2676 as not 10479 trials of _default (58.184s total), 5.552ms/trial` [download] Which if my ad-hoc conversion factor of 233/800 has any merit puts it at 1.61702ms/word which stands up pretty well given the that it has to re-open the file each time. Not having used any of the dbm varients, I was impressed by the efficiency and simplicity of use that your program demonstrated, and when I looked, found that dada's site has a Win32 port of GDBM which I am in the process of downloading now. Thanks greatly for bringing this to my attention. There are many things for which I think I will find this extrememly useful. Nah! Your thinking of Simon Templar, originally played by Roger Moore and later by Ian Ogilvy	[reply] [d/l]
Re*6: Fast wordlist lookup for game by grantm (Parson) on Oct 30, 2002 at 23:29 UTC
Further thoughts: I suspect there's not too much merit in your *233/800 factor - those numbers are just raw processor Mhz numbers, many cycles will be wasted waiting for main memory and man, many more if disk IO is required. You could also use DB_File which is available via PPM from ActiveState and I believe uses the same API SDBM_File is a core perl module which also uses the same API, but has significant limitations on allowable size of data See also MLDBM which allows you to store complex data structures in a DBM hash	[reply]