ivosan has asked for the wisdom of the Perl Monks concerning the following question:

Hi, I have to grab all instances of prepositional phrases in a corpus in lines that look like:

... *con la certeza absoluta*stan de .. que {no hay-e+} *nadie*s *en la casa*loc:st

the characters between * and *stan comprise a stance adverb, in between brackets verb complexes, and the problem is that there is a subject (* and *s) and then the prepositional phrase * *loc:st, but the following expression:

if ($line =~ /\s\*(.*?)\*loc/gimx ){
grabs NOT only the last prepositional phrase: *en la casa*loc:st, but also the subject together with it: *nadie*s *en la casa*loc:st

I am trying to make the regexp greedy with the "?" operator, but it doesn't seem to work. Dear Monks what's wrong?

thank you,

ivosan

Replies are listed 'Best First'.
Re: regexp not greedy
by thedoe (Monk) on Dec 13, 2005 at 20:37 UTC

    The ? in this context, .*?, will find every character up to the first occurrence of the next portion of the match. Leaving the ? out, .*, will suck up every character to the last occurrence of the next match.

    This means your regex is finding the first * after a space, the *nadie, then looking from there until it finds *loc. This is why your regex is including that entire portion of the string.

    To fix this for your scenario, the regex: \s\*([^\*]*)\*loc should work fine. This matches a space, followed by a *, then any number of characters that are NOT a *, followed by a *loc.

    Buena Suerte!

      Dear Monk, Thank you so much, it fixed.
OT: corpus design (Re: regexp not greedy)
by graff (Chancellor) on Dec 14, 2005 at 04:47 UTC
    Whoever came up with that sort of format for marking part-of-speech in text data should learn about using a proper bracketing markup design instead. In any data set like the sample you showed, a simple slip-up in white space (adding or dropping a space character in the wrong place next to a "*", or heaven forbid, ending up with an odd number of "*"'s) could render the file unparsable and very difficult to fix.

    XML would be worth looking into for this, or even just labeled parens, like "(STAN con la certeza absoluta) de .. que (VERB_COMPLEX no hay-e+) (SUBJ nadie) (LOC:ST en la casa)" -- anything like this would make the data easier to process, and less prone to simple mistakes that might cause catastrophic damage.

    (If your goal is to transform the data into some better format, this is an excellent idea, and I wish you the best of luck.)