in reply to How will I retrieve values from a POS-tagged question
To clarify: you've got a list of word/tag pairs, where the former may be any alphanumeric character, or '.', '-', or '_', and the tag must be an uppercase character. (Is '<s>/SYM' a valid combination too? What does it mean?) You want to extract the words and tags and write the tags out to a file for comparison to other tag (rules) files, and the comparison involves looking for an exact match of tag sequence. Upon finding a match in tags, or sentence structure, you want to report the relevant sentence or sentences. Is this correct?
If so, I think it's best done with a hash of arrays, where the keys in the hash are the sequence of tags for a particular sentence, and the values are arrays of the corresponding sentences that have that structure. You'll need to use arrays as the values because hashes guarantee the uniqueness of individual keys and in the instance of two grammatically identical sentences (according to your tag schema) the newer entry will overwrite the older.
Take the file you are currently processing and create a hash like this:
Where $tag_sequence is the concatenation of tags for a given sentence, such as 'WP+VBZ+DT+NN+IN+DT+NN', @sequences is an array of such concatenations, and $sentence is 'Who is the author of the book?' (or 'Who+is+the+author+of+the+book+?', etc.). (Check out Dereferencing a hash reference to a Hash of Arrays for explanation of nested structures)my %pairs = (); foreach my $tag_sequence (@sequences) { push(@{$pairs}{$tag_sequence}, $sentence); }
When you do your comparison against another tag (rule) file, use each line you examine to do a key lookup in your hash. This will return a reference to an array of sentences that have that structure.
while (my $rule = chomp(<in>) { my $sentences = $pairs{$rule}; # Do what you want with the data returned }
Hope that helps.
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Re: How will I retrieve values from a POS-tagged question
by Anonymous Monk on Jul 26, 2002 at 20:56 UTC | |
by djantzen (Priest) on Jul 26, 2002 at 22:42 UTC |