bfdi533 has asked for the wisdom of the Perl Monks concerning the following question:

I have a recent issue that required a regex with the or '|' operator and was confused on the output.

Here is some sample input:

this_line_'aaa|bbb|ccc|ddd|eee'_is_terminated this_line_'aaa|bbb|ccc|ddd|eee'_is_terminated this_line_'aaaaa|bbbbb|ccccc|ddddd|eeeee'_is_... this_line_'aaaaa|bbbbb|ccccc|ddddd|eeeee'_is_... this_line_'aaaaaa|bbbbbb|cccccc|dddddd|eeeeee... this_line_'aaaaaa|bbbbbb|cccccc|dddddd|eeeeee...

I am using the following code to find and extract part of the line:

while (<>) { $inline = $_; ($data,$trail) = $inline =~ /this_line_'(.*)('_|\.\.\.); print $data . "\n"; }

What I am getting is the following output:

aaa|bbb|ccc|ddd|eee aaa|bbb|ccc|ddd|eee aaaaa|bbbbb|ccccc|ddddd|eeeee'_is_ aaaaa|bbbbb|ccccc|ddddd|eeeee'_is_ aaaaaa|bbbbbb|cccccc|dddddd|eeeeee aaaaaa|bbbbbb|cccccc|dddddd|eeeeee

The first two lines I understand as they match the regex with only one of the second conditions. The last two lines I understand as well as they match the regex with only one of the second conditions.

However, the middle two lines I do not understand. The second regex condition with the '|' in it would seem to me to eval left to right. Rather, it seems to be being evaluated right to left. Maybe it is not really right to left but some other way.

Can anyone help me to understand the priority of the conditions in a regex where the or ('|') operator is concerned?

Replies are listed 'Best First'.
Re: Priority of | regex operator
by eff_i_g (Curate) on Apr 26, 2006 at 22:54 UTC
    The (.*) is greedy. Once it captures the entire string it lets the alternation in the parens take over, which backtracks to see if it can find a match, and it can: the '...' is the first match it comes across by moving backwards.

    You can use (.*?) to make the expression ungreedy and it will match what you want.

    P.S. If you don't need the trailing part, use (?:) instead of ().

      Thank you very much for the reply. That is exactly the data that I needed to know.

      Obviously, I need to get smart of the use of the ? in a regex as I really do not understand it. But, it does work as expected now and I will hit the books or PODs and learn about that.

Re: Priority of | regex operator
by rhesa (Vicar) on Apr 26, 2006 at 23:03 UTC
    the * is greedy, so (.*) eats up as much of the string as it can. When it reaches the end of the string, it backtracks to see if it can make the second group match, and the first time that happens is on the three dots. So it doesn't have to do with the priority or the order of the | at all.

    Take a look at this input:

    this_line_'aaaaaa|bbbbbb|cccccc|dddddd|eeeee...'_
    That will result in
    aaaaaa|bbbbbb|cccccc|dddddd|eeeee...
    because the '_ is the first bit that makes the second group match successfully.