in reply to qr() match order with multiple patterns

The problem is not that it is matching right to left. It is that the regex engine will find the earliest, ie. left-most match that it can.

Notionally, it starts at the first character of the string and tries each of the alternations in your regex, left to right. If none of them match at the first character, then it moves to the next character and tries each alternation again, left to right.

The problem with your regex is, that whilst the longer subexpressions won't match early in the strings you posted in your other post, the shorter ones will. So it finds the shorter match before the longer match.

You can prove that it isn't the ordering of the alternations by reversing their ordering. You will still find the leftmost possible match first.

It's not completely clear to me from the two examples you gave, exactly which bits you want to match in the second example, but I think the fix is to make sure that the bit you are trying to match is bounded by anchors of some sort, maybe a non-digit char or the start or end of string. Something like this.

my $Date = qr{(?:^|\D)($MMDDYYYY|$DDMMYYYY|$YYYYMMDD|$DDMMYY|$MMDD +YY|$DDMM|$MMDD)(?:\D|$)};

You might want to use different anchors, and you might need to apply them to the individual parts of the composite rather than the composite, but as I said, without a few more example of inputs and desired outputs your goal isn't exactly clear (to me:).


Examine what is said, not who speaks.
"Efficiency is intelligent laziness." -David Dunham
"When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller

Replies are listed 'Best First'.
Re: Re: qr() match order with multiple patterns
by gnu@perl (Pilgrim) on Jul 23, 2003 at 14:37 UTC
    Thanks for your input. After some more thinking (and info from this thread) I did realize that I was looking at the match the wrong way and that it worked as you have stated here.

    Essentially what I am attempting to do is create a date recognition engine. We process about 100,000 files a day with a large variety of naming conventions. Each name contains a date in one format or another. Unfortunately the programs that generate these file do not adhere to one date format or field delimeter.

    I plan on adding a little more to it as rules. Such as the file date found can not be in the future, or too far in the past (1923). Since these files are handled many times by many programs the mtime or ctime are not reliable methods for dating the files.

    Now that I have this part figured out and that it won't work as I had thought I might start another thread for suggestions on a date recognition engine.

    I was able to make this work easily by placing the patterns into an array and then iterating through the array in order of preference and stopping when I reach a suitable fit.