Hey everyone,

Deep within a Perl program I'm writing I need to implement a spellcheck dictionary (not one that returns suggestions, just one that says if something is spelled correctly or not). Rather than use something external I thought I'd make it in Perl, because I'm already there, it should be pretty simple, and I'll have more control over it so I can make sure it's fast (ideally).

My dictionary is constant and has about 100,000 words in it, and I'm spell-checking about one terabyte of text, so I want it to be as fast as possible. Is a very simple hash and if ($dictionary{$word}) going to be the fastest way? I ran such a dictionary over a gigabyte of text on my computer, checking every word, and it took five minutes in total. This works out to about 3.5 days on the whole 1 TB corpus, and that will have to be on top of the several days worth of processing I already need to do. My dissertation is due in about two weeks, so this is valuable time...

I have plenty of ideas, but I'm newish to Perl and I don't want to waste time benchmarking 10 different approaches if half of them are stupid and I've missed the best approach anyway. Is using exists any faster than checking the value? Would it be faster to have 26 separate hashes for each letter of the alphabet, arranged in an array, hash, subroutine, whatever? Since my dictionary is constant, are there any simple ways to make perfect (or closer to perfect) hash functions in Perl? What about using a tree? Any other clever ideas?

Alright, I think that covers it. Any advice is very much appreciated! Thank you.


In reply to Advice for optimizing lookup speed in gigantic hashes by tobek

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.