Whoever came up with that sort of format for marking part-of-speech in text data should learn about using a proper bracketing markup design instead. In any data set like the sample you showed, a simple slip-up in white space (adding or dropping a space character in the wrong place next to a "*", or heaven forbid, ending up with an odd number of "*"'s) could render the file unparsable and very difficult to fix.
XML would be worth looking into for this, or even just labeled parens, like "(STAN con la certeza absoluta) de .. que (VERB_COMPLEX no hay-e+) (SUBJ nadie) (LOC:ST en la casa)" -- anything like this would make the data easier to process, and less prone to simple mistakes that might cause catastrophic damage.
(If your goal is to transform the data into some better format, this is an excellent idea, and I wish you the best of luck.)
Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
Read Where should I post X? if you're not absolutely sure you're posting in the right place.
Please read these before you post! —
Posts may use any of the Perl Monks Approved HTML tags:
- a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
| |
For: |
|
Use: |
| & | | & |
| < | | < |
| > | | > |
| [ | | [ |
| ] | | ] |
Link using PerlMonks shortcuts! What shortcuts can I use for linking?
See Writeup Formatting Tips and other pages linked from there for more info.