If you were doing mutiple lookups per run of the program, then I would expect storing the full lexicon in a hash to help out significantly, since it would only need to read in the full file once and 8M isn't really all that much memory these days. If you're only doing one lookup per run, though, it will probably make things slower, since it would always need to read in the full file rather than stopping once it finds a match.
The earlier comment regarding spell/grammar checkers was spot-on. If you can find any information on how they function, it would probably be highly relevant to your problem.
For more general solutions, this seems to me like a database would be your best bet, whether a 'real' database (Postgres, MySQL, etc.) or just a tied/dbm hash.
If you really need to work directly off of a plain text file for some reason, you could index it to get at least some of the improvement that a database would bring: Sort the text file (it's probably already sorted, being a dictionary, but I mention it just to be sure) and then build a separate index file containing the offset in the dictionary for the first word beginning with each letter. By seeking to that position in the file before reading and processing lines and stopping when you hit a line that starts with a different letter, you can avoid searching through any words that start with the wrong letter, effectively reducing your dictionary size substantially.
Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
Read Where should I post X? if you're not absolutely sure you're posting in the right place.
Please read these before you post! —
Posts may use any of the Perl Monks Approved HTML tags:
- a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
| |
For: |
|
Use: |
| & | | & |
| < | | < |
| > | | > |
| [ | | [ |
| ] | | ] |
Link using PerlMonks shortcuts! What shortcuts can I use for linking?
See Writeup Formatting Tips and other pages linked from there for more info.