Re^2: reading dictionary file -> morphological analyser

If you were doing mutiple lookups per run of the program, then I would expect storing the full lexicon in a hash to help out significantly, since it would only need to read in the full file once and 8M isn't really all that much memory these days. If you're only doing one lookup per run, though, it will probably make things slower, since it would always need to read in the full file rather than stopping once it finds a match.

The earlier comment regarding spell/grammar checkers was spot-on. If you can find any information on how they function, it would probably be highly relevant to your problem.

For more general solutions, this seems to me like a database would be your best bet, whether a 'real' database (Postgres, MySQL, etc.) or just a tied/dbm hash.

If you really need to work directly off of a plain text file for some reason, you could index it to get at least some of the improvement that a database would bring: Sort the text file (it's probably already sorted, being a dictionary, but I mention it just to be sure) and then build a separate index file containing the offset in the dictionary for the first word beginning with each letter. By seeking to that position in the file before reading and processing lines and stopping when you hit a line that starts with a different letter, you can avoid searching through any words that start with the wrong letter, effectively reducing your dictionary size substantially.

Comment on Re^2: reading dictionary file -> morphological analyser