Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks,

I'd like to ask another regex question. Consider this string:

$x = 'ab cd cd EF ghi jkl'; $x =~ /cd (.*) ghi/;

$1 evaluates to "cd EF". This looks correct.

But:

$x =~ /cd (.*?) ghi/;

$1 evaluates to "cd EF", same as above. A strange result, as I expect it to be just "EF".

Why do matches aginst (.*) and (.*?) result in exactly the same result? What needs to be done to get only the "EF" as a result?

In real life the EF is a random text that I want to extract.

Thank you.

Replies are listed 'Best First'.
Re: Regex problem - (non)greedy?
by ikegami (Patriarch) on Nov 14, 2013 at 14:18 UTC

    Why do matches aginst (.*) and (.*?) result in exactly the same result?

    Because there's only one ghi.

    ... /c/ matches "c" (at position 3) /d/ matches "d" /.*/ matches " cd EF ghi jkl" /g/ fails -> backtrack /.*/ matches " cd EF ghi jk" /g/ fails -> backtrack /.*/ matches " cd EF ghi j" /g/ fails -> backtrack ... /.*/ matches " cd EF " /g/ matches "g" /h/ matches "h" /i/ matches "i" -> success
    ... /c/ matches "c" (at position 3) /d/ matches "d" /.*?/ matches "" /g/ fails -> backtrack /.*?/ matches " " /g/ fails -> backtrack /.*?/ matches " c" /g/ fails -> backtrack ... /.*?/ matches " cd EF " /g/ matches "g" /h/ matches "h" /i/ matches "i" -> success

    What needs to be done to get only the "EF" as a result?

    (?s:(?!STRING).)* is to STRING as [^CHAR]* is to CHAR.

    /cd(?:(?!cd|ghi).)ghi/s

    Note: Unlike the trick of adding ^.* to the start of the pattern, this pattern can be used in other patterns.

    Note: I think want "EF" instead of " EF ", but it's easier just to trim the whitespaces afterwards.

Re: Regex problem - (non)greedy?
by hdb (Monsignor) on Nov 14, 2013 at 14:29 UTC

    The non-greedy regex would still start at the first occurence of 'cd' but then stop at the first 'ghi', whereas the greedy one would stop at the last 'ghi'. (In your example there is only one, so they are the same.)

    You can use this behavior to achieve the desired effect by adding something greedy before the expression of interest, like

    $x =~ /.*cd (.*) ghi/;

    Here the .* at the beginning would eat up everything up the last occurrence of 'cd', which is what you want.

      I expect that doesn't give the intended results for any of the following:

      "cdghi" "cd ghi ghi"

      Furthermore, that pattern won't help if we wants to go on to handle either of the following:

      "cd XXX ghi cd YYY ghi" "cd XXX cd YYY ghi ZZZ ghi"

      By the way, you really want to add a leading ^ to that to speed things up greatly when the pattern doesn't match.

        I am not sure what the expected results are, but the actual results are:

        cdghi: (no match) cd ghi ghi: ghi cd abc ghi cd def ghi: def
      I think
      $x =~ /.* cd (.*?) ghi/x

      Is better, but can't test ATM... : )

      Cheers Rolf

      ( addicted to the Perl Programming Language)

      update

      Moved ? into group...

      PS: ikegami msged me in the meantime thx :)

      update

       > ikegami says Re Re^2: Regex problem - (non)greedy? By the way, you really want to add a leading ^ to that to speed things up greatly when the pattern doesn't match.

      He's (normally) right! ++

      might depend on how nifty Perl optimized the regex...

Re: Regex problem - (non)greedy?
by vpbamberg (Initiate) on Nov 14, 2013 at 15:13 UTC

    Thank you for the answers. Please excuse that my login session expired before I sent my question, rendering it anonymous.

    My real life problem is WebInject, where I need to extract a form variable from the response. Webinject captures almost the whole HTML source in the form-tag.

    As I am using Webinject, post-processing the regex results is not possible...

Re: Regex problem - (non)greedy?
by Laurent_R (Canon) on Nov 14, 2013 at 18:22 UTC

    The main point about regex matches is that the regex engine will always give you the first possible match, i.e. if you try to match the string "bbaababaaaa" with the regex /a+/ or /a+?/ the match will start on the first "a" of the string irrespective of greediness. With the greedy operator you'll get the first "aa" and with non greedy "a". The fact that you used a greedy quantifier will still not lead to the "better" match at the end of the string: as soon as there is a possible match, there will not be backtracking to look for "aaaa". Similarly, with a non greedy quantifier, the regex engine will not give up the start of match if it found one. In other words, greediness should be understood "forward, or "to the right" only, not backward.