pc2 has asked for the wisdom of the Perl Monks concerning the following question:

salutations, we are new to this site, it appears to be very good. we are also very new to perl (we use it now for about 1 week). here is the doubt: we want to make a morphological analyzer of a language (the language doesn't matter here, neither does matter how we obtained the dictionary). we use some dictionary which is pure text: each word represents one line, and each line has grammatical data of a single word separated by ';' semicolon. the idea is: the dictionary (for example, dict.txt) is accessed in a filehandle, DICT. then, each line is read via a while loop, then each line is separated by semicolon for extracting the grammatical informations. thus, each word and word forms of the dictionary is compared to the user's input. here is a sample code (keep in mind that $lang represents a word of the dictionary, and that $irreg and $clss are, respectively, irregular forms and the grammatical class: verb, noun, etc., and $input is the word to be analysed):
open DICTE, "dict.txt"; if (length($input)>0){ print "<p><b>".$input."</b></p>"; while (<DICTE>){ chomp; ($english, $lang, $irreg, $clss) = split(/;/,$_); #gets the gr +ammatical informations stored in one line of the dictionary. if ($input eq $lang){ print "<p>$english - $lang, $clss</p>";} + #if the input equals the word in the dictionary, print it along with + its translation. }
we already tested a code similar to this one above (besides checking the word, it also divides the "$irreg" variable into commas to check if the irregular forms of the word equal the user input). it also contains, besides the "if ($input eq $lang)" check, an engine that conjugates $lang in all tenses and moods of the language, to check if the input equals a conjugated form. thus, it calls a conjugating function, like this:
open DICTE, "dict.txt"; if (length($input)>0){ print "<p><b>".$input."</b></p>"; while (<DICTE>){ chomp; ($english, $lang, $irreg, $clss) = split(/;/,$_); #gets the gr +ammatical informations stored in one line of the dictionary. if ($input eq $lang){ print "<p>$english - $lang, $clss</p>";} + #if the input equals the word in the dictionary, print it along with + its translation. if (conj("$lang;present;1;singular") eq $lang){ print "<p>$eng +lish - $lang, $clss</p>";} #if the input equals a conjugated form, pr +int it, if (conj("$lang;present;2;singular") eq $lang){ print "<p>$eng +lish - $lang, $clss</p>";} #where conj("$word;$tense;$person;$number" +) is a function that conjugates the verb, given the specific informat +ions. # ... and in every tense, person and number. }
great, so far. the problem with this code is that, as the possible conjugated forms get larger, and also if its necessary to check for prefixes or suffixes, the analysis takes very long if the dictionary is too big (specifically, 16008 words). so, we tried it in 2 ways: 1. check for every word in the dictionary, inflect it in all ways possible and compare it with the used input (the code displayed above); 2. the dictionary already contains all the conjugated and declined forms, thus analyser compares each line with the user input, with no need for declining/conjugating each word. the problem with the first one is that it gets too slow if the analyser itself has to conjugate each word to compare it, when it is an inflectional language with many inflected forms. the problem with the second one is that it gets too slow when the lexicon (root words plus inflected forms) is very big (like 385090 lines). we are concentrating on the second technique. it seems that the more information one line condensates, slower is the reading of each line. so, what do you preffer? many lines containing few information each, or less lines containing many information each? any thoughts on this? do you know any other way to make a faster analyser? thank you in advance, Paulo Marcos Durand Segal & Claudio Marcos Durand Segal.

Replies are listed 'Best First'.
Re: reading dictionary file -> morphological analyser
by almut (Canon) on Jul 17, 2007 at 14:06 UTC
    ...so, what do you preffer? many lines containing few information each, or less lines containing many information each?

    To be able to give any useful advice, it would help to know the exact context that this program is going to be used in. For example, are you intending to call it once for every input item? How many input items are you planning to look up? Etc.

    Also, as to speed, it's hard to make a precise prediction without knowing all the details (such as hardware being used, complexity of the conjugations, etc.). However, my gut feeling is that precomputing everything will be marginally faster, in particular if you need to compare more than one input item against the full set of words. Generally, if your machine has fast I/O, slurping in the whole precomputed dictionary might not be a problem, while, if you have a fast processor, computing the conjugations / inflections on-the-fly might still be faster after all...   If I were you, I would just benchmark the different approaches, and go with whatever turns out to be faster.

    In case you need to do many lookups, it's probably a good idea to use a hash to store all the words (as keys).  If the words aren't unique, i.e. if conjugations of different base words may lead to the same final form (and if you don't want to lose the info where the final form originated from), you might want to accumulate that info in the hash elements' values... In addition to using a hash, it might help to read in the whole data structure once, and keep the process running persistently. I.e., you might consider using some client/server architecture.

    Generally, the whole thing very much sounds like the problem a typical spell/grammar checker has to solve (if I understood your task correctly, that is). I'm personally no expert in such algorithms, but I'm sure many people have put much thought into optimising these things... In other words, that's where I would start looking for publications, etc. (e.g. scientific ones).   Good luck! And welcome here, BTW.

Re: reading dictionary file -> morphological analyser
by dk (Chaplain) on Jul 17, 2007 at 14:14 UTC
    Scanning each line is never scalable. What you need is text indexing (see f.ex. discussions 310891 249393), or simply to assign the whole, fully inflexed lexicon to a hash, so lookups become very cheap (at the expense of memory size though). If then memory becomes a problem, you might consider stuffing the lexicon into a database.
Re: reading dictionary file -> morphological analyser
by BrowserUk (Patriarch) on Jul 17, 2007 at 19:57 UTC

    Using a hash will speed things up, but unless you are using some persistant environment (mod_perl etc), then reloading the hash everytime will still be quite slow.

    If moving your dictionary into a database is not immediatly possible for any reason, then you can speed things up a lot by simply splitting it into several files based on the first letter, so that you have 26 files named dict_A.txt, dict_B.txt, etc. That would (on average) make your program take less than 5% of the time currently takes:

    if (length($input)>0){ print "<p><b>".$input."</b></p>"; my $firstLetter = substr( $input, 0, 1 ); open DICTE, "dict" . $firstLetter . ".txt" or die $!; while (<DICTE>){ chomp; ($english, $lang, $irreg, $clss) = split(/;/,$_); if ($input eq $lang){ print "<p>$english - $lang, $clss</p>"; } if (conj("$lang;present;1;singular") eq $lang){ print "<p>$english - $lang, $clss</p>"; } if (conj("$lang;present;2;singular") eq $lang){ print "<p>$english - $lang, $clss</p>"; } } }

    And if that is still too slow, then you can divide the dictionary into smaller pieces by using the first two letters. dict_AA.txt, dict_AB.txt etc. which (on average) should reduce the search time to ~ 0.2% of the current.

    Though there are probably few words begining with ZZ... or JJ... etc, so the speed up won't be exactly evenly distributed.

    Moving your data into a simple tied DB like BerkeleyDB or similar, is a better long term solution, but this might help you out while you get to grips with Perl, hashes, modules and the rest.


    Examine what is said, not who speaks -- Silence betokens consent -- Love the truth but pardon error.
    "Science is about questioning the status quo. Questioning authority".
    In the absence of evidence, opinion is indistinguishable from prejudice.
Re: reading dictionary file -> morphological analyser
by pc2 (Beadle) on Jul 17, 2007 at 15:08 UTC
    salutations, thank you for the responses. almut: the computer has 1GB of memory and 2.4GHz of processment. we have already tested both ways: conjugating on-the-fly and reading from a lexicon with all the inflected forms. we want the analyser to analyse only one input, the $input variable, by any of those two forms (or any other one that anyone may suggest). note: we have already tested it with a language's dictionary which has 16008 roots. for each root, the irregular base for conjugating is given, as well as the grammatical class (thus the analyser knows if it will conjugate or decline): this root dictionary has 499 kilobytes and 16008 lines or words; there are 3 persons, 2 numbers and 8 tenses (thus 48 forms) for verbs (the conjugation is, basically, adding a simple ending depending on person, number and tense/mood), and 8 cases and 2 numbers for nouns (thus 16 forms); nouns take some changes in the root, which are accomplished by a substitution regular expression. the whole lexicon has 8 megabytes and 385090 lines, including roots and inflected forms. as both you and dk suggested "hashes", we will look for what it is and how it's used, and see if we can use it in our case. in case hashes work (and in case we understood your suggestion correctly), we will post the result here. any more suggestions, or any more information needed? thank you in advance.
      If you were doing mutiple lookups per run of the program, then I would expect storing the full lexicon in a hash to help out significantly, since it would only need to read in the full file once and 8M isn't really all that much memory these days. If you're only doing one lookup per run, though, it will probably make things slower, since it would always need to read in the full file rather than stopping once it finds a match.

      The earlier comment regarding spell/grammar checkers was spot-on. If you can find any information on how they function, it would probably be highly relevant to your problem.

      For more general solutions, this seems to me like a database would be your best bet, whether a 'real' database (Postgres, MySQL, etc.) or just a tied/dbm hash.

      If you really need to work directly off of a plain text file for some reason, you could index it to get at least some of the improvement that a database would bring: Sort the text file (it's probably already sorted, being a dictionary, but I mention it just to be sure) and then build a separate index file containing the offset in the dictionary for the first word beginning with each letter. By seeking to that position in the file before reading and processing lines and stopping when you hit a line that starts with a different letter, you can avoid searching through any words that start with the wrong letter, effectively reducing your dictionary size substantially.

      Just to elaborate a bit on the hash approach, here's a very simple example of how you would populate the hash and then look up some value(s):

      #!/usr/bin/perl my %dict; # the hash while (my $line = <DATA>) { chomp $line; $dict{$line}++; # instead of ++ you could also assign some value. +.. } my @inputs = qw( foo fooed fooen prefoo postfoo ); for my $input (@inputs) { print "found '$input' in lexicon\n" if exists $dict{$input}; } # I'm using the special DATA filehandle here to be able to inline it.. +. # That would be your DICTE handle supplying all the precomputed 385090 + lines __DATA__ foo prefoo bar baz ...

      BTW, am I understanding you correctly, that what you mean by 'analyse' is essentially to check whether some given $input is found in the lexicon?

Re: reading dictionary file -> morphological analyser
by dogz007 (Scribe) on Jul 17, 2007 at 18:52 UTC
    From the context of the first block of code in your post, I assume that your code is acting as a CGI (or equivalent) script, which means you already have a web server environment. If this is indeed the case, you may benefit from a more server-like solution to your dictionary lookup problem. Your current flat-file dictionary database must be read entirely each time it is accessed (or at least read until the desired entry is found). If instead you stored it using a relationional database manager such as MySQL (recommended only because it is free and I don't know much about any others), then your lookup times should decrease dramatically. Then simply use the Perl DBI/DBM interface to query the database.
Re: reading dictionary file -> morphological analyser
by jhourcle (Prior) on Jul 18, 2007 at 13:58 UTC
    the problem with this code is that, as the possible conjugated forms get larger, and also if its necessary to check for prefixes or suffixes, the analysis takes very long if the dictionary is too big (specifically, 16008 words).

    You may wish to look into a concept called 'stemming'. Basically, the program reads in terms, it removes prefixes, suffixes, numeric quantifiers, etc, to get just the 'stem' or base of the word. You then compare just the base word, rather than all of the possible permutations.

    This may also reduce the number of terms that you're tracking in your dictionary. As you're trying to be language agnostic, this may be more difficult, as the rules for stemming will be dependant upon the language. Also, English is a notoriously difficult language to do this in, as its terms come from multiple languages, and tend to keep the rules for the original language.

Re: reading dictionary file -> morphological analyser
by Anonymous Monk on Jul 18, 2007 at 11:40 UTC
    salutations, thank you for the responses. based on the responses we got, we tried it several ways: we tried to create a Microsoft Access database and access it via Win32::OLE, but it didn't seem to be convenient (because we would have to convert the txt to mdb always when the dictionary had some alteration), and we also tried other ways, like putting the line of the dict file on the hash only when it matches the user input (but it wouldn't very convenient for multiple word inputs, which we plan to implement). so, we concentrated on the hash approach, which seemed to be the best one, for returning a hash value of a certain key is very fast, even for a file of a very big line number. the only problem was generating the hash from the dict file, line by line. so, we found the "Storable" library (is this name correct?). thus, the hash file would be generated only once, stored in the file by the store() function, and then retrieved from the file:
    #!c:\perl\bin\perl use Storable; #use this for calling the hash storing functions. my %dict; # the hash while (my $line = <DATA>) { chomp $line; $dict{$line}++; # instead of ++ you could also assign some value. +.. } store(\%dict, "hash.txt"); #store the hash in the file. %dict = %{retrieve("hash.txt")}; #retrieve the hash from the file "has +h.txt". thus, it only needs to be generated once. my @inputs = qw( foo fooed fooen prefoo postfoo ); for my $input (@inputs) { print "found '$input' in lexicon\n" if exists $dict{$input}; }
    by using the very fast hash, retrieving it from a file previously generated, it works very faster. so, again, thank you for the responses, they were very helpful for finding a good approach for our problem. note: we are not professional linguists nor professional programmers (we can't work yet, anyway, because of our age), we do this just because we like (the programming part and the lingustic part). by the way, we are from Brazil, and we are twins (it explains the "we"'s). salutations.
      just for informing: this message above by "Anonymous Monk" is ours, we just had forgotten to log in when we sent it.
Re: reading dictionary file -> morphological analyser
by pc2 (Beadle) on Jul 18, 2007 at 19:54 UTC
    salutations, jhourcle, we couldn't use the stemming technique because the language in question involves root changing and many irregular forms. BrowserUk, dividing the dictionary might be a good solution, but it would not be very convenient for the way we are planning to process the user input (since there would be many-word inputs). we implemented a solution by using Berkeley DB:
    #!c:/perl/bin/perl.exe print "Content-type: text/html\n\n"; use BerkeleyDB; my $filename = "dict"; my $db = new BerkeleyDB::Hash -Filename => $filename, -Flags => DB_CREATE or die "Cannot open file $filename: $! $BerkeleyDB::Error\n";
    getting the value by a given key is done by:
    $db->db_get("word", $lang); #the word is stored in the $lang variable.
    and works very fast. so, that DICTE filehandle code, instead of throwing values to a %dict hash, it throws values to a database by $db->db_put($english,$lang). this database creation has to be done only when the dictionary changes, but it's very slow. so, now the question is: is there a command-line Windows program that can convert flat-file databases do Berkeley DB databases? thank you in advance.
      Part of the utility of a database is that you can do updates to it, rather than regenerating it every time there's change. If your language changes, say from Spanish to Russian, then yes, you'd have to swap out the database. If you're working in one language, though, most of the words I would suspect stay the same, and you'll only be making additions and corrections over time.

      I would suggest learning to leverage the advantages using databases gives you. Whether you stick with Berkeley or use something SQL-centered, you should be looking to do updates to the database instead of regenerating it on a regular basis.
Re: reading dictionary file -> morphological analyser
by pc2 (Beadle) on Jul 20, 2007 at 23:32 UTC
    salutations, we really wanted to stick with the BerkeleyDB solution, so, we found a solution to the problem: we generate the dict file in TXT format, and, then, we convert it to Berkeley DB by using the command "db-load" at command-line. for this, besides the BerkeleyDB module of Perl, we also installed the Windows installer of BerkeleyDB at http://www.oracle.com/technology/software/products/berkeley-db/index.html. then, we discovered that the command (in the command-line)
    db_load -c duplicates=1 -T -t hash -f dict.txt dict.db
    converts "dict.txt" (with keys and values separated by a newline character, and each pair of lines being a record) to the BerkeleyDB database "dict.db", allowing duplicate keys. this solution turned out to work great. thank you for all the help.