in reply to regexp not greedy

Whoever came up with that sort of format for marking part-of-speech in text data should learn about using a proper bracketing markup design instead. In any data set like the sample you showed, a simple slip-up in white space (adding or dropping a space character in the wrong place next to a "*", or heaven forbid, ending up with an odd number of "*"'s) could render the file unparsable and very difficult to fix.

XML would be worth looking into for this, or even just labeled parens, like "(STAN con la certeza absoluta) de .. que (VERB_COMPLEX no hay-e+) (SUBJ nadie) (LOC:ST en la casa)" -- anything like this would make the data easier to process, and less prone to simple mistakes that might cause catastrophic damage.

(If your goal is to transform the data into some better format, this is an excellent idea, and I wish you the best of luck.)

  • Comment on OT: corpus design (Re: regexp not greedy)