Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi, I have question of this type in a file (part-of speech): <s>/SYM Who/WP is/VBZ the/DT author/NN of/IN the/DT book/NN... ?/. </s>/SYM.
I did extract just the tags to make formula (ex: WP+VBZ+DT+NN+IN+DT+NN.., and I stored in a different file, to compare with some patterns(rules) I have in another file.
#while(<in>) { while ($question =~ /([a-zA-Z0-9.-_]+)\/([A-Z]+)/g){ $word = $1; $tag = $2; print OUT "$tag"."/"; print "$word"."/"; } print OUT "\n";
1. How I compare from one file to another, by lines?, or I have to store all the lines in an array? 2. After I did match (true) with the patterns (rules), How I will retrieve just the value of the tag (WP ->Who...) and display them?, Do I need a hash table; until now I have not used any king of arrays, just files. Thanks. Tita

Replies are listed 'Best First'.
Re: How will I retrieve values from a POS-tagged question
by djantzen (Priest) on Jul 26, 2002 at 17:33 UTC

    To clarify: you've got a list of word/tag pairs, where the former may be any alphanumeric character, or '.', '-', or '_', and the tag must be an uppercase character. (Is '<s>/SYM' a valid combination too? What does it mean?) You want to extract the words and tags and write the tags out to a file for comparison to other tag (rules) files, and the comparison involves looking for an exact match of tag sequence. Upon finding a match in tags, or sentence structure, you want to report the relevant sentence or sentences. Is this correct?

    If so, I think it's best done with a hash of arrays, where the keys in the hash are the sequence of tags for a particular sentence, and the values are arrays of the corresponding sentences that have that structure. You'll need to use arrays as the values because hashes guarantee the uniqueness of individual keys and in the instance of two grammatically identical sentences (according to your tag schema) the newer entry will overwrite the older.

    Take the file you are currently processing and create a hash like this:

    my %pairs = (); foreach my $tag_sequence (@sequences) { push(@{$pairs}{$tag_sequence}, $sentence); }
    Where $tag_sequence is the concatenation of tags for a given sentence, such as 'WP+VBZ+DT+NN+IN+DT+NN', @sequences is an array of such concatenations, and $sentence is 'Who is the author of the book?' (or 'Who+is+the+author+of+the+book+?', etc.). (Check out Dereferencing a hash reference to a Hash of Arrays for explanation of nested structures)

    When you do your comparison against another tag (rule) file, use each line you examine to do a key lookup in your hash. This will return a reference to an array of sentences that have that structure.

    while (my $rule = chomp(<in>) { my $sentences = $pairs{$rule}; # Do what you want with the data returned }

    Hope that helps.

      Thank you Fever. You got it right!!. I want to retrieve the sentences (questions) and also their possible answers.

      Sorry I did not mention it before, but the file with my-rules (patterns) should be associated to another file with one or more possible answer(s). Thus when it is a match (true) Formula-Rule, I have to retrieve the sentence (questions) and the answer(s) associated with that rule.

      I will still use hash of arrays, won't I?, BUT the only thing is that I will have duplicate keys, if for each rule there are more than one answer. Can I have duplicates?, or for one key, I will have answers (one line) separate by a delimiter?

      Thanks again.
      Tita

        Sorry I did not mention it before, but the file with my-rules (patterns) should be associated to another file with one or more possible answer(s). Thus when it is a match (true) Formula-Rule, I have to retrieve the sentence (questions) and the answer(s) associated with that rule.

        Now, only the question is described by the rule, right? Are the question and answer to be stored in the same file or in different ones? For the moment I'll assume a total of two files, one for rules and one for question/answer pairs.

        Okay, let's break the problem into two parts: data storage and data manipulation.

        For storage your options are rather open, with the restriction that you have a trustworthy correlation between a rule in one file and the sentences which the rule describes in another. Thus, for any data set such as <s>/SYM Who/WP is/VBZ the/DT author/NN of/IN the/DT book/NN... ?/. </s>/SYM, you'll have two files each containing different subsets of the data, namely, tags and sentences. This will work fine as long as your files don't get tampered with, because you'll be depending on the order in which data appears to know which question properly belongs to each rule. If you're concerned about this, you could supply an index for each entry so that rule 0 corresponds to question/answer pair 0 in your other file. This is still far from unbreakable, but it's a little better. (As an aside, consider looking at something like DB_File if your data collection is going to get very large at all.)

        Now as to the data structure for doing your actual look ups; yes, I still think a hash of arrays is a good place to start. You'll need the arrays to handle cases of multiple questions/answers per rule since hashes eliminate duplicate keys. Of course, in your text files you can have as many duplicate entries as you want because they're just text files! Probably you'll end up slurping both files into arrays and then combining them into a hash using some code along the lines I provided in my first post. Then as you run through a list of rules for which you wish to find question/answer pairs, you just have to do the hash lookup.

        Good luck!

Re: How will I retrieve values from a POS-tagged question
by Tita (Initiate) on Jul 26, 2002 at 21:20 UTC
    By the way, "<s>/SYM"... "</s>/SYM" is just a format that surrounds each sentence (question) indicating where it starts and ends.

    Tita