Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hello, This is my first posting to the keepers of wisdom, so I'll try to keep it brief. I've found that I need to stack a lot of regular expressions in order to force pattern matching to occur first on larger comlpex patterns then on the smaller patterns that they are composed of It seems to me that this is simply greedy matching, with the special circumstance that the largest patterns are made up of optional and obligatory combinations of other patterns that will match at least the minimal pattern. For instance:
$np1="(?:$det|$gen)"; $np2 ="(?:$adj|$num|$conj|$adv|$inf)"; $np3="(?:$np1*\s*($noun)*\s*$np2*\s*($noun)+\s*$adj*)";
used together in the following manner: $NP = "(?:(?:$np1)*\s*$np2*(?:$np3)+)"; As I've mentioned, I want to match the longest patterns first but allow for matching on the smaller patterns, which is my reason for including Kleene stars for optional subpatterns. The problem that I'm having is that the optionality leads to matching the minimal patterns and never the optionally longer ones. My question then is whether I need to do as I am now doing, and matching the longest patterns, or the next longest, and so on down to the minimal patterns ? I ask because the OR grouping from greatest coverage to least seems to also be missing longer patterns. So to sum, I need to match long patterns composed of smaller patterns where the long ones match first, then failing that, the long ones match. If my question is overly simple, or my discussion of it unclear, I apologize in advance Thanks,

Replies are listed 'Best First'.
(Ovid) Re: greedy and lazy
by Ovid (Cardinal) on Jul 25, 2000 at 00:02 UTC
    It appears that you were cut off in the middle of your post. Could you please repost? From what I can see, your regexes look very interesting and I suspect from your choice of variable names that your target text is rather unpredictable, which makes that regexes more interesting still. I very much would like to take a look at what you are trying to accomplish.

    Also, wrapping your code in <CODE></CODE> tags will format it nicely:

    $np1="(?:$det|$gen)"; $np2 ="(?:$adj|$num|$conj|$adv|$inf)"; $np3="(?:$np1*\s*($noun)*\s*$np2*\s*($noun)+\s*$adj*)";
    Cheers,
    Ovid
      If you're interested in the actual patterns and all, here they are.
      $noun ="(?: *[A-Za-z0-9._]+\/NN[PS]*)"; $det ="( *[A-Za-z]+\/DT)"; $adj ="( *[A-Za-z]+\/JJ[RS]?)"; $gen ="( *[A-Za-z]+\/POSS)"; $adv="( *[A-Za-z\']+\/RB[RS]?)"; $inf =" *to\/TO"; $adv="( *[A-Za-z\']+\/RB[RS]?)"; $np1="(?:$det|$gen)"; $np2 ="(?:$adj|$num|$conj|$adv|$inf)"; $np3="(?:$np1*\s*(?:$noun)*\s*$np2*\s*(?:$noun)+\s*$adj*)"; $np4="((?:$noun)+\s*$np2+\s*(?:$noun)+)"; $np5="(?:$np1*\s*$adj+\s*($noun)+)"; # more complex noun and prep phrases $NP = "(?:(?:$np1)*\s*(?:$np3)+)"; $NP1 = "(?:$np3)+\s*(?:$np2)\s*(?:$np3)+"; $NP2 = "(?:(?:$np1)+\s*(?:$np3)+\s*(?:$np4)+)"; $NP3 ="(?:$np1*\s*$noun+\s*[^INV]+\s*(?:$noun)+)"; $NP4 ="$np1+\s*[^NV]+\s*$noun+"; $nps= "(?:($NP1)|($NP2)|($NP3)|($NP4)|($NP))"; $extnp="(?:($pro(?!\$))|($np5))";
      I'm basically trying to parse text into Noun, Verb, and Preposition Phrases. Nouns are the most troublesome at the moment, because although the individual patterns match what I want in test output, when they are OR'd together, their output is not always correct Thanks
        Ouch.

        I've heard good things about Parse::RecDescent... perhaps writing a small grammar with it would be more fruitful than using regular expressions.