tobek has asked for the wisdom of the Perl Monks concerning the following question:

Hey everyone,

Deep within a Perl program I'm writing I need to implement a spellcheck dictionary (not one that returns suggestions, just one that says if something is spelled correctly or not). Rather than use something external I thought I'd make it in Perl, because I'm already there, it should be pretty simple, and I'll have more control over it so I can make sure it's fast (ideally).

My dictionary is constant and has about 100,000 words in it, and I'm spell-checking about one terabyte of text, so I want it to be as fast as possible. Is a very simple hash and if ($dictionary{$word}) going to be the fastest way? I ran such a dictionary over a gigabyte of text on my computer, checking every word, and it took five minutes in total. This works out to about 3.5 days on the whole 1 TB corpus, and that will have to be on top of the several days worth of processing I already need to do. My dissertation is due in about two weeks, so this is valuable time...

I have plenty of ideas, but I'm newish to Perl and I don't want to waste time benchmarking 10 different approaches if half of them are stupid and I've missed the best approach anyway. Is using exists any faster than checking the value? Would it be faster to have 26 separate hashes for each letter of the alphabet, arranged in an array, hash, subroutine, whatever? Since my dictionary is constant, are there any simple ways to make perfect (or closer to perfect) hash functions in Perl? What about using a tree? Any other clever ideas?

Alright, I think that covers it. Any advice is very much appreciated! Thank you.

Replies are listed 'Best First'.
Re: Advice for optimizing lookup speed in gigantic hashes
by moritz (Cardinal) on Aug 23, 2011 at 05:49 UTC
Re: Advice for optimizing lookup speed in gigantic hashes
by BrowserUk (Patriarch) on Aug 23, 2011 at 02:35 UTC
    I ran such a dictionary over a gigabyte of text on my computer, and it took five minutes to check every word. This works out to about 3.5 days on the whole 1 TB corpus,

    It seems unlikely that your slow timings are due to the hash lookup. Far more likely to be down to how you are breaking your data into words?


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
      Whoops! That sure was ambiguous! It didn't mean that it took 5 minutes to check every word. I meant: checking every word, it took five minutes to process a gig of text. I guess I could have just said "It took five minutes to process a gig of text." I've updated my question =)

        By way of demonstration. Looking up 9000 words in a hash is 6X faster than splitting those same 9000 words out of a string:

        $s = 'the quick brown fox jumps over the lazy dog' x 1000;; ++$h{ $_ } for qw[ the quick brown fox jumps over the lazy];; cmpthese -1,{ a => q[ my @words = split ' ', $s;], b => q[ my $n=0; $h{$_} && ++$n for (qw[ the quick brown fox jumps over the lazy dog ]) x 1000; ] };; Rate a b a 95.4/s -- -85% b 644/s 575% --

        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
        It didn't mean that it took 5 minutes to check every word.

        I understood you.

        My point was that it inevitably will take longer to split a line of input into words than it does to look those words up in a hash. So, if the overall time taken is too long, you should be looking at how to split the words rather than how to look them up once you;ve split them.


        Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
        "Science is about questioning the status quo. Questioning authority".
        In the absence of evidence, opinion is indistinguishable from prejudice.
    A reply falls below the community's threshold of quality. You may see it by logging in.