I'm working on tagging a large linguistic corpus
Been there, done that. (Still there, doing it, in fact...)
What I need to do is add a tag around each line (<il> or <df> in the above cases) where the contents of the tag match the two character string at the head of each line:

<il> il yadayada <il>

Might you happen to be somewhat new to the area of markup languages (i.e. XML) also? You may want to double-check what the goal is supposed to be. Many people doing linguistic-related research would prefer to use real XML in their corpus data, and what you proposed is not real XML, despite having something in common with it (using angle brackets).

There are two things you should consider (maybe ask others in your group/research community to get their suggestions):

  1. The tags you add should be paired like this:
    <tag> text content ... </tag>
    Note the slash character in the second tag that marks the end of the region -- that's required.

  2. If the initial "token" on each is really a classifier (i.e. an annotation that someone has added to the corpus data, rather than being part of the original spoken or written corpus content), then the XML tags ought to replace the classifier, rather than simply being placed around it.

On the second point, I could see wanting to leave the 2-letter code in the line, just to make sure you put the tags in the right way, but there are better ways to validate your process.

If I'm guessing right about what you really should be doing, your regex should just put angle brackets around the initial 2-character token, then make a copy of it at the end of the line with a slash added as needed. Something like this:

s{^(\w{2})(.*)}{<$1>$2 </$1>};
(I chose to use curlies around the regex and replacement, just so I wouldn't have to use a backslash-escape for the slash in the closing tag.)

(P.S.: Welcome to the Monastery!)


In reply to Re: tagging question by graff
in thread tagging question by bagerson

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.