comment on

it came out approximately 20 ms/word ... I seriously doubt that any DBM solution will beat that!

Well that sounds like a challenge :-)

I started with a dictionary file containing about 100,000 words; created a GDBM file keyed on first 3 letters of words; and checked each word from a document containing 10479 words.

Program to create index...

use strict;
use GDBM_File ;

my %hash;

tie %hash, 'GDBM_File', 'words.dbm', &GDBM_WRCREAT, 0640;

my $dict_file = '/usr/share/dict/longlist';

open DICT, "<$dict_file" or die "open($dict_file): $!";

my @words = ();
my $prefix = '';

while(<DICT>) {
  chomp;
  $_ = lc($_);
  if(substr($_, 0, 3) ne $prefix) {
    if(@words) {
      $hash{$prefix} = join(' ', @words);
      @words = ();
    }
    $prefix = substr($_, 0, 3);
  }
  push @words, $_;
}

untie %hash ;
[download]

If I tie the GDBM file once and check all the words the script runs in under a second which is (a) well under one millisecond per word and (b) probably not how you'd do it unless you had a dedicated wordcheck daemon

If I tie the GDBM file once before checking each word (and then untie it afterwards) then checking all the words takes 50 seconds. Which is less than 5 milliseconds per word.

use strict;
use Benchmark;
use GDBM_File ;

my $dbm_file = 'words.dbm';

my %hash;
$/ = undef;

my $data = <DATA>;
my @words = map {lc($_)} ($data =~ /(\w+)/g);
print scalar(@words), "\n";

timethis(1, 'one_tie()');
timethis(1, 'many_ties()');

sub one_tie {
  my($found, $not_found) = (0, 0);
  tie %hash, 'GDBM_File', $dbm_file, &GDBM_WRCREAT, 0640;
  foreach (@words) {
    my $prefix = substr($_, 0, 3);
    if(exists($hash{$prefix})  and  $hash{$prefix} =~ /\b$_\b/) {
      $found++;
    }
    else {
      $not_found++;
    }
  }
  untie %hash ;
  print "Found: $found\nNot found: $not_found\n";
}


sub many_ties {
  my($found, $not_found) = (0, 0);
  foreach (@words) {
    tie %hash, 'GDBM_File', $dbm_file, &GDBM_WRCREAT, 0640;
    my $prefix = substr($_, 0, 3);
    if(exists($hash{$prefix})  and  $hash{$prefix} =~ /\b$_\b/) {
      $found++;
    }
    else {
      $not_found++;
    }
    untie %hash ;
  }
  print "Found: $found\nNot found: $not_found\n";
}


__DATA__

insert output of 'perldoc perltoot' here :-)
[download]

Update: Of course you didn't mention what hardware you're running on - mines a 800Mhz PIII

Update 2: Added code for generating index in <readmore> tags.

In reply to Re: Re: Fast wordlist lookup for game by grantm
in thread Fast wordlist lookup for game by jerrygarciuh

Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!

Titles consisting of a single word are discouraged, and in most cases are disallowed outright.

Read Where should I post X? if you're not absolutely sure you're posting in the right place.

Please read these before you post! —

Posts may use any of the Perl Monks Approved HTML tags:

a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr

You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)

	For:		Use:
	&		`&`
	<		`<`
	>		`>`
	[		`[`
	]		`]`

Link using PerlMonks shortcuts! What shortcuts can I use for linking?

See Writeup Formatting Tips and other pages linked from there for more info.