To clarify: you've got a list of word/tag pairs, where the former may be any alphanumeric character, or '.', '-', or '_', and the tag must be an uppercase character. (Is '<s>/SYM' a valid combination too? What does it mean?) You want to extract the words and tags and write the tags out to a file for comparison to other tag (rules) files, and the comparison involves looking for an exact match of tag sequence. Upon finding a match in tags, or sentence structure, you want to report the relevant sentence or sentences. Is this correct?

If so, I think it's best done with a hash of arrays, where the keys in the hash are the sequence of tags for a particular sentence, and the values are arrays of the corresponding sentences that have that structure. You'll need to use arrays as the values because hashes guarantee the uniqueness of individual keys and in the instance of two grammatically identical sentences (according to your tag schema) the newer entry will overwrite the older.

Take the file you are currently processing and create a hash like this:

my %pairs = (); foreach my $tag_sequence (@sequences) { push(@{$pairs}{$tag_sequence}, $sentence); }
Where $tag_sequence is the concatenation of tags for a given sentence, such as 'WP+VBZ+DT+NN+IN+DT+NN', @sequences is an array of such concatenations, and $sentence is 'Who is the author of the book?' (or 'Who+is+the+author+of+the+book+?', etc.). (Check out Dereferencing a hash reference to a Hash of Arrays for explanation of nested structures)

When you do your comparison against another tag (rule) file, use each line you examine to do a key lookup in your hash. This will return a reference to an array of sentences that have that structure.

while (my $rule = chomp(<in>) { my $sentences = $pairs{$rule}; # Do what you want with the data returned }

Hope that helps.


In reply to Re: How will I retrieve values from a POS-tagged question by djantzen
in thread How will I retrieve values from a POS-tagged question by Anonymous Monk

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.