in reply to How will I retrieve values from a POS-tagged question

To clarify: you've got a list of word/tag pairs, where the former may be any alphanumeric character, or '.', '-', or '_', and the tag must be an uppercase character. (Is '<s>/SYM' a valid combination too? What does it mean?) You want to extract the words and tags and write the tags out to a file for comparison to other tag (rules) files, and the comparison involves looking for an exact match of tag sequence. Upon finding a match in tags, or sentence structure, you want to report the relevant sentence or sentences. Is this correct?

If so, I think it's best done with a hash of arrays, where the keys in the hash are the sequence of tags for a particular sentence, and the values are arrays of the corresponding sentences that have that structure. You'll need to use arrays as the values because hashes guarantee the uniqueness of individual keys and in the instance of two grammatically identical sentences (according to your tag schema) the newer entry will overwrite the older.

Take the file you are currently processing and create a hash like this:

my %pairs = (); foreach my $tag_sequence (@sequences) { push(@{$pairs}{$tag_sequence}, $sentence); }
Where $tag_sequence is the concatenation of tags for a given sentence, such as 'WP+VBZ+DT+NN+IN+DT+NN', @sequences is an array of such concatenations, and $sentence is 'Who is the author of the book?' (or 'Who+is+the+author+of+the+book+?', etc.). (Check out Dereferencing a hash reference to a Hash of Arrays for explanation of nested structures)

When you do your comparison against another tag (rule) file, use each line you examine to do a key lookup in your hash. This will return a reference to an array of sentences that have that structure.

while (my $rule = chomp(<in>) { my $sentences = $pairs{$rule}; # Do what you want with the data returned }

Hope that helps.

Replies are listed 'Best First'.
Re: Re: How will I retrieve values from a POS-tagged question
by Anonymous Monk on Jul 26, 2002 at 20:56 UTC
    Thank you Fever. You got it right!!. I want to retrieve the sentences (questions) and also their possible answers.

    Sorry I did not mention it before, but the file with my-rules (patterns) should be associated to another file with one or more possible answer(s). Thus when it is a match (true) Formula-Rule, I have to retrieve the sentence (questions) and the answer(s) associated with that rule.

    I will still use hash of arrays, won't I?, BUT the only thing is that I will have duplicate keys, if for each rule there are more than one answer. Can I have duplicates?, or for one key, I will have answers (one line) separate by a delimiter?

    Thanks again.
    Tita

      Sorry I did not mention it before, but the file with my-rules (patterns) should be associated to another file with one or more possible answer(s). Thus when it is a match (true) Formula-Rule, I have to retrieve the sentence (questions) and the answer(s) associated with that rule.

      Now, only the question is described by the rule, right? Are the question and answer to be stored in the same file or in different ones? For the moment I'll assume a total of two files, one for rules and one for question/answer pairs.

      Okay, let's break the problem into two parts: data storage and data manipulation.

      For storage your options are rather open, with the restriction that you have a trustworthy correlation between a rule in one file and the sentences which the rule describes in another. Thus, for any data set such as <s>/SYM Who/WP is/VBZ the/DT author/NN of/IN the/DT book/NN... ?/. </s>/SYM, you'll have two files each containing different subsets of the data, namely, tags and sentences. This will work fine as long as your files don't get tampered with, because you'll be depending on the order in which data appears to know which question properly belongs to each rule. If you're concerned about this, you could supply an index for each entry so that rule 0 corresponds to question/answer pair 0 in your other file. This is still far from unbreakable, but it's a little better. (As an aside, consider looking at something like DB_File if your data collection is going to get very large at all.)

      Now as to the data structure for doing your actual look ups; yes, I still think a hash of arrays is a good place to start. You'll need the arrays to handle cases of multiple questions/answers per rule since hashes eliminate duplicate keys. Of course, in your text files you can have as many duplicate entries as you want because they're just text files! Probably you'll end up slurping both files into arrays and then combining them into a hash using some code along the lines I provided in my first post. Then as you run through a list of rules for which you wish to find question/answer pairs, you just have to do the hash lookup.

      Good luck!