it came out approximately 20 ms/word ... I seriously doubt that any DBM solution will beat that!

Well that sounds like a challenge :-)

I started with a dictionary file containing about 100,000 words; created a GDBM file keyed on first 3 letters of words; and checked each word from a document containing 10479 words.

Program to create index...

use strict; use GDBM_File ; my %hash; tie %hash, 'GDBM_File', 'words.dbm', &GDBM_WRCREAT, 0640; my $dict_file = '/usr/share/dict/longlist'; open DICT, "<$dict_file" or die "open($dict_file): $!"; my @words = (); my $prefix = ''; while(<DICT>) { chomp; $_ = lc($_); if(substr($_, 0, 3) ne $prefix) { if(@words) { $hash{$prefix} = join(' ', @words); @words = (); } $prefix = substr($_, 0, 3); } push @words, $_; } untie %hash ;

If I tie the GDBM file once and check all the words the script runs in under a second which is (a) well under one millisecond per word and (b) probably not how you'd do it unless you had a dedicated wordcheck daemon

If I tie the GDBM file once before checking each word (and then untie it afterwards) then checking all the words takes 50 seconds. Which is less than 5 milliseconds per word.

use strict; use Benchmark; use GDBM_File ; my $dbm_file = 'words.dbm'; my %hash; $/ = undef; my $data = <DATA>; my @words = map {lc($_)} ($data =~ /(\w+)/g); print scalar(@words), "\n"; timethis(1, 'one_tie()'); timethis(1, 'many_ties()'); sub one_tie { my($found, $not_found) = (0, 0); tie %hash, 'GDBM_File', $dbm_file, &GDBM_WRCREAT, 0640; foreach (@words) { my $prefix = substr($_, 0, 3); if(exists($hash{$prefix}) and $hash{$prefix} =~ /\b$_\b/) { $found++; } else { $not_found++; } } untie %hash ; print "Found: $found\nNot found: $not_found\n"; } sub many_ties { my($found, $not_found) = (0, 0); foreach (@words) { tie %hash, 'GDBM_File', $dbm_file, &GDBM_WRCREAT, 0640; my $prefix = substr($_, 0, 3); if(exists($hash{$prefix}) and $hash{$prefix} =~ /\b$_\b/) { $found++; } else { $not_found++; } untie %hash ; } print "Found: $found\nNot found: $not_found\n"; } __DATA__ insert output of 'perldoc perltoot' here :-)

Update: Of course you didn't mention what hardware you're running on - mines a 800Mhz PIII

Update 2: Added code for generating index in <readmore> tags.


In reply to Re: Re: Fast wordlist lookup for game by grantm
in thread Fast wordlist lookup for game by jerrygarciuh

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.