I'm working on tagging a large linguistic corpusBeen there, done that. (Still there, doing it, in fact...)
What I need to do is add a tag around each line (<il> or <df> in the above cases) where the contents of the tag match the two character string at the head of each line:Might you happen to be somewhat new to the area of markup languages (i.e. XML) also? You may want to double-check what the goal is supposed to be. Many people doing linguistic-related research would prefer to use real XML in their corpus data, and what you proposed is not real XML, despite having something in common with it (using angle brackets).<il> il yadayada <il>
There are two things you should consider (maybe ask others in your group/research community to get their suggestions):
Note the slash character in the second tag that marks the end of the region -- that's required.<tag> text content ... </tag>
On the second point, I could see wanting to leave the 2-letter code in the line, just to make sure you put the tags in the right way, but there are better ways to validate your process.
If I'm guessing right about what you really should be doing, your regex should just put angle brackets around the initial 2-character token, then make a copy of it at the end of the line with a slash added as needed. Something like this:
(I chose to use curlies around the regex and replacement, just so I wouldn't have to use a backslash-escape for the slash in the closing tag.)s{^(\w{2})(.*)}{<$1>$2 </$1>};
(P.S.: Welcome to the Monastery!)
In reply to Re: tagging question
by graff
in thread tagging question
by bagerson
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |