salutations, we are new to this site, it appears to be very good. we are also very new to perl (we use it now for about 1 week). here is the doubt: we want to make a morphological analyzer of a language (the language doesn't matter here, neither does matter how we obtained the dictionary). we use some dictionary which is pure text: each word represents one line, and each line has grammatical data of a single word separated by ';' semicolon. the idea is: the dictionary (for example, dict.txt) is accessed in a filehandle, DICT. then, each line is read via a while loop, then each line is separated by semicolon for extracting the grammatical informations. thus, each word and word forms of the dictionary is compared to the user's input. here is a sample code (keep in mind that $lang represents a word of the dictionary, and that $irreg and $clss are, respectively, irregular forms and the grammatical class: verb, noun, etc., and $input is the word to be analysed):
open DICTE, "dict.txt"; if (length($input)>0){ print "<p><b>".$input."</b></p>"; while (<DICTE>){ chomp; ($english, $lang, $irreg, $clss) = split(/;/,$_); #gets the gr +ammatical informations stored in one line of the dictionary. if ($input eq $lang){ print "<p>$english - $lang, $clss</p>";} + #if the input equals the word in the dictionary, print it along with + its translation. }
we already tested a code similar to this one above (besides checking the word, it also divides the "$irreg" variable into commas to check if the irregular forms of the word equal the user input). it also contains, besides the "if ($input eq $lang)" check, an engine that conjugates $lang in all tenses and moods of the language, to check if the input equals a conjugated form. thus, it calls a conjugating function, like this:
open DICTE, "dict.txt"; if (length($input)>0){ print "<p><b>".$input."</b></p>"; while (<DICTE>){ chomp; ($english, $lang, $irreg, $clss) = split(/;/,$_); #gets the gr +ammatical informations stored in one line of the dictionary. if ($input eq $lang){ print "<p>$english - $lang, $clss</p>";} + #if the input equals the word in the dictionary, print it along with + its translation. if (conj("$lang;present;1;singular") eq $lang){ print "<p>$eng +lish - $lang, $clss</p>";} #if the input equals a conjugated form, pr +int it, if (conj("$lang;present;2;singular") eq $lang){ print "<p>$eng +lish - $lang, $clss</p>";} #where conj("$word;$tense;$person;$number" +) is a function that conjugates the verb, given the specific informat +ions. # ... and in every tense, person and number. }
great, so far. the problem with this code is that, as the possible conjugated forms get larger, and also if its necessary to check for prefixes or suffixes, the analysis takes very long if the dictionary is too big (specifically, 16008 words). so, we tried it in 2 ways: 1. check for every word in the dictionary, inflect it in all ways possible and compare it with the used input (the code displayed above); 2. the dictionary already contains all the conjugated and declined forms, thus analyser compares each line with the user input, with no need for declining/conjugating each word. the problem with the first one is that it gets too slow if the analyser itself has to conjugate each word to compare it, when it is an inflectional language with many inflected forms. the problem with the second one is that it gets too slow when the lexicon (root words plus inflected forms) is very big (like 385090 lines). we are concentrating on the second technique. it seems that the more information one line condensates, slower is the reading of each line. so, what do you preffer? many lines containing few information each, or less lines containing many information each? any thoughts on this? do you know any other way to make a faster analyser? thank you in advance, Paulo Marcos Durand Segal & Claudio Marcos Durand Segal.

In reply to reading dictionary file -> morphological analyser by pc2

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.