Beefy Boxes and Bandwidth Generously Provided by pair Networks
The stupid question is the question not asked
 
PerlMonks  

qr() match order with multiple patterns

by gnu@perl (Pilgrim)
on Jul 23, 2003 at 13:32 UTC ( [id://277140]=perlquestion: print w/replies, xml ) Need Help??

gnu@perl has asked for the wisdom of the Perl Monks concerning the following question:

Ok, here's a quick example of what I'm doing:
my $dom = qr{0[1-9]|[12][0-9]|3[01]}; my $month = qr{0[1-9]|1[012]}; my $fourYear = qr{2003}; my $twoYear = qr{03}; my $MMDDYYYY = qr{$month$dom$fourYear}; my $DDMMYYYY = qr{$dom$month$fourYear}; my $YYYYMMDD = qr{$fourYear$month$dom}; my $DDMMYY = qr{$dom$month$twoYear}; my $MMDDYY = qr{$month$dom$twoYear}; my $MMDD = qr{($month$dom)}; my $DDMM = qr{$dom$month}; my $Date = qr{($MMDDYYYY|$DDMMYYYY|$YYYYMMDD|$DDMMYY|$MMDDYY|$DDMM +|$MMDD)}; opendir(DIR,".") or die "Cannot open directory for reading: $1\n"; my @files = grep { /^[^\.|\.\.]/ } readdir(DIR); for (@files) { unless ( $_ =~ /($Date)/g) { no_match($_); next; } my $match = "${^N}:$`--$&--$'"; print "In file -->$_<--\n"; print "$1\n"; }
My thought was that the pattern would be compared in the order it was compiled in $Date, so if $MMDDYYYY matched $_ in the example the matching should stop and report what was matched.

The problem is that for a given line with matches for both $MMDDYYYY and $DDMM the pattern for $DDMM is reported even though the text that matches $MMDDYYYY is first in the line.

If I change $Date to equal just $MMDDYYYY alone the test works correctly and if I change $Date to just $DDMM it returns the appropriate portion. The patterns in $Date are ordered in preference, but it is not returning in that fashion.

My main interest is in discovering if the order of qr() can be different than what was placed in at the compliation of the pattern. It may also be that I have a regex problem, comments on both are welcome.

TIA, Chad.

Replies are listed 'Best First'.
Re: qr() match order with multiple patterns
by BrowserUk (Patriarch) on Jul 23, 2003 at 14:29 UTC

    The problem is not that it is matching right to left. It is that the regex engine will find the earliest, ie. left-most match that it can.

    Notionally, it starts at the first character of the string and tries each of the alternations in your regex, left to right. If none of them match at the first character, then it moves to the next character and tries each alternation again, left to right.

    The problem with your regex is, that whilst the longer subexpressions won't match early in the strings you posted in your other post, the shorter ones will. So it finds the shorter match before the longer match.

    You can prove that it isn't the ordering of the alternations by reversing their ordering. You will still find the leftmost possible match first.

    It's not completely clear to me from the two examples you gave, exactly which bits you want to match in the second example, but I think the fix is to make sure that the bit you are trying to match is bounded by anchors of some sort, maybe a non-digit char or the start or end of string. Something like this.

    my $Date = qr{(?:^|\D)($MMDDYYYY|$DDMMYYYY|$YYYYMMDD|$DDMMYY|$MMDD +YY|$DDMM|$MMDD)(?:\D|$)};

    You might want to use different anchors, and you might need to apply them to the individual parts of the composite rather than the composite, but as I said, without a few more example of inputs and desired outputs your goal isn't exactly clear (to me:).


    Examine what is said, not who speaks.
    "Efficiency is intelligent laziness." -David Dunham
    "When I'm working on a problem, I never think about beauty. I think only how to solve the problem. But when I have finished, if the solution is not beautiful, I know it is wrong." -Richard Buckminster Fuller

      Thanks for your input. After some more thinking (and info from this thread) I did realize that I was looking at the match the wrong way and that it worked as you have stated here.

      Essentially what I am attempting to do is create a date recognition engine. We process about 100,000 files a day with a large variety of naming conventions. Each name contains a date in one format or another. Unfortunately the programs that generate these file do not adhere to one date format or field delimeter.

      I plan on adding a little more to it as rules. Such as the file date found can not be in the future, or too far in the past (1923). Since these files are handled many times by many programs the mtime or ctime are not reliable methods for dating the files.

      Now that I have this part figured out and that it won't work as I had thought I might start another thread for suggestions on a date recognition engine.

      I was able to make this work easily by placing the patterns into an array and then iterating through the array in order of preference and stopping when I reach a suitable fit.

Re: qr() match order with multiple patterns
by broquaint (Abbot) on Jul 23, 2003 at 13:37 UTC
    The problem is that for a given line with matches for both $MMDDYYYY and $DDMM the pattern for $DDMM is reported even though the text that matches $MMDDYYYY is first in the line.
    Drop the /g modifier in your match condition and you should get the desired results (i.e $MMDDYYYY should match first).
    HTH

    _________
    broquaint

      Whoops! Kind of a typo, I was trying something different and forgot about that. I just went and changed it but I am still having the problem.

      Here are two example lines I am examining:

      F111406585_D072203_B085_E087_T047-P085_FCC_07222003_539.cdr USL_111406585_P085_A87_030723
      The first line, the date is 07222003, but it reports the 1114 that is found just after the 'F' in the beginning of the line. If in my code I take out the last $MMDD I then get '1406' returned for both lines. If I now take out $DDMM as well I then get 072203 for line 1 and a statement that there is no match for line 2. Now I take out $MMDDYY and I finally get 07222003 for line one and a statemant that line 2 does not match.

      It appears as if the match is happening from right to left, but that just didn't make sense to me.

        The first line, the date is 07222003, but it reports the 1114 that is found just after the 'F' in the beginning of the line
        Correct, as that matches $MMDD. This is because the alternation tries to match at every point of the string, and because $MMDD matches 1114, it's the first date to be returned. For some really detailed output on the workings of this regex behaviour try adding use re 'debug' to the top of your script to see exactly what the regex engine is doing at every step.

        Probably what you want instead of alternation, which will not do what you want in this particular case, is code that will match a given string with a list of regexes where the order correlates to the precedence of the regex e.g

        my $dom = qr{0[1-9]|[12][0-9]|3[01]}; my $month = qr{0[1-9]|1[012]}; my $fourYear = qr{2003}; my $twoYear = qr{03}; my $MMDDYYYY = qr{$month$dom$fourYear}; my $DDMMYYYY = qr{$dom$month$fourYear}; my $YYYYMMDD = qr{$fourYear$month$dom}; my $DDMMYY = qr{$dom$month$twoYear}; my $MMDDYY = qr{$month$dom$twoYear}; my $MMDD = qr{($month$dom)}; my $DDMM = qr{$dom$month}; my @date_regexes = ( $MMDDYYYY, $DDMMYYYY, $YYYYMMDD, $DDMMYY, $MMDDYY, $MMDD, $DDMM, ); my $line = 'F111406585_D072203_B085_E087_T047-P085_FCC_07222003_539.cdr'; print "date is - ", match_precedence(\@date_regexes, $line), $/; sub match_precedence { my($regs, $str) = @_; for(@$regs) { return $1 if $str =~ /($_)/ } return; } __output__ date is - 07222003
        That's not great code, put hopefully it'll give you something to work with.

        Update - well, you can use a regex with alternation, but it ain't pretty

        my $date_regex = qr{ (?: .*($MMDDYYYY)|.*($DDMMYYYY)|.*($YYYYMMDD)|.*($DDMMYY)| .*($MMDDYY)|.*($MMDD)|.*($DDMM) ) }x;
        Shudder, backtracking hell basically. It'll work but it'll hugely slow on big strings, so the above regex is really for "you can do it" purposes, so I wouldn't advice using it!
        HTH

        _________
        broquaint

Re: qr() match order with multiple patterns
by RollyGuy (Chaplain) on Jul 23, 2003 at 13:39 UTC
    When you actually use your regex $Date, you specify a g modifier thereby searching the string globally. This type of search will result in the last match being returned not the first. Therefore, if you have two dates in the same line that match, you will only see the second one.
    Enjoy.

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://277140]
Approved by RollyGuy
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others meditating upon the Monastery: (5)
As of 2024-03-29 07:51 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found