If you were doing mutiple lookups per run of the program, then I would expect storing the full lexicon in a hash to help out significantly, since it would only need to read in the full file once and 8M isn't really all that much memory these days. If you're only doing one lookup per run, though, it will probably make things slower, since it would always need to read in the full file rather than stopping once it finds a match.
The earlier comment regarding spell/grammar checkers was spot-on. If you can find any information on how they function, it would probably be highly relevant to your problem.
For more general solutions, this seems to me like a database would be your best bet, whether a 'real' database (Postgres, MySQL, etc.) or just a tied/dbm hash.
If you really need to work directly off of a plain text file for some reason, you could index it to get at least some of the improvement that a database would bring: Sort the text file (it's probably already sorted, being a dictionary, but I mention it just to be sure) and then build a separate index file containing the offset in the dictionary for the first word beginning with each letter. By seeking to that position in the file before reading and processing lines and stopping when you hit a line that starts with a different letter, you can avoid searching through any words that start with the wrong letter, effectively reducing your dictionary size substantially. | [reply] |
Just to elaborate a bit on the hash approach, here's a very simple
example of how you would populate the hash and then look up some value(s):
#!/usr/bin/perl
my %dict; # the hash
while (my $line = <DATA>) {
chomp $line;
$dict{$line}++; # instead of ++ you could also assign some value.
+..
}
my @inputs = qw(
foo
fooed
fooen
prefoo
postfoo
);
for my $input (@inputs) {
print "found '$input' in lexicon\n" if exists $dict{$input};
}
# I'm using the special DATA filehandle here to be able to inline it..
+.
# That would be your DICTE handle supplying all the precomputed 385090
+ lines
__DATA__
foo
prefoo
bar
baz
...
BTW, am I understanding you correctly, that what you mean by 'analyse'
is essentially to check whether some given $input is found in the lexicon?
| [reply] [d/l] |