in reply to reading dictionary file -> morphological analyser

salutations, thank you for the responses. almut: the computer has 1GB of memory and 2.4GHz of processment. we have already tested both ways: conjugating on-the-fly and reading from a lexicon with all the inflected forms. we want the analyser to analyse only one input, the $input variable, by any of those two forms (or any other one that anyone may suggest). note: we have already tested it with a language's dictionary which has 16008 roots. for each root, the irregular base for conjugating is given, as well as the grammatical class (thus the analyser knows if it will conjugate or decline): this root dictionary has 499 kilobytes and 16008 lines or words; there are 3 persons, 2 numbers and 8 tenses (thus 48 forms) for verbs (the conjugation is, basically, adding a simple ending depending on person, number and tense/mood), and 8 cases and 2 numbers for nouns (thus 16 forms); nouns take some changes in the root, which are accomplished by a substitution regular expression. the whole lexicon has 8 megabytes and 385090 lines, including roots and inflected forms. as both you and dk suggested "hashes", we will look for what it is and how it's used, and see if we can use it in our case. in case hashes work (and in case we understood your suggestion correctly), we will post the result here. any more suggestions, or any more information needed? thank you in advance.
  • Comment on Re: reading dictionary file -> morphological analyser

Replies are listed 'Best First'.
Re^2: reading dictionary file -> morphological analyser
by dsheroh (Monsignor) on Jul 17, 2007 at 15:31 UTC
    If you were doing mutiple lookups per run of the program, then I would expect storing the full lexicon in a hash to help out significantly, since it would only need to read in the full file once and 8M isn't really all that much memory these days. If you're only doing one lookup per run, though, it will probably make things slower, since it would always need to read in the full file rather than stopping once it finds a match.

    The earlier comment regarding spell/grammar checkers was spot-on. If you can find any information on how they function, it would probably be highly relevant to your problem.

    For more general solutions, this seems to me like a database would be your best bet, whether a 'real' database (Postgres, MySQL, etc.) or just a tied/dbm hash.

    If you really need to work directly off of a plain text file for some reason, you could index it to get at least some of the improvement that a database would bring: Sort the text file (it's probably already sorted, being a dictionary, but I mention it just to be sure) and then build a separate index file containing the offset in the dictionary for the first word beginning with each letter. By seeking to that position in the file before reading and processing lines and stopping when you hit a line that starts with a different letter, you can avoid searching through any words that start with the wrong letter, effectively reducing your dictionary size substantially.

Re^2: reading dictionary file -> morphological analyser
by almut (Canon) on Jul 17, 2007 at 15:56 UTC

    Just to elaborate a bit on the hash approach, here's a very simple example of how you would populate the hash and then look up some value(s):

    #!/usr/bin/perl my %dict; # the hash while (my $line = <DATA>) { chomp $line; $dict{$line}++; # instead of ++ you could also assign some value. +.. } my @inputs = qw( foo fooed fooen prefoo postfoo ); for my $input (@inputs) { print "found '$input' in lexicon\n" if exists $dict{$input}; } # I'm using the special DATA filehandle here to be able to inline it.. +. # That would be your DICTE handle supplying all the precomputed 385090 + lines __DATA__ foo prefoo bar baz ...

    BTW, am I understanding you correctly, that what you mean by 'analyse' is essentially to check whether some given $input is found in the lexicon?