F111406585_D072203_B085_E087_T047-P085_FCC_07222003_539.cdr
USL_111406585_P085_A87_030723
The first line, the date is 07222003, but it reports the 1114 that is found just after the 'F' in the beginning of the line. If in my code I take out the last $MMDD I then get '1406' returned for both lines. If I now take out $DDMM as well I then get 072203 for line 1 and a statement that there is no match for line 2. Now I take out $MMDDYY and I finally get 07222003 for line one and a statemant that line 2 does not match.
It appears as if the match is happening from right to left, but that just didn't make sense to me. | [reply] [d/l] |
The first line, the date is 07222003, but it reports the 1114 that is found just after the 'F' in the beginning of the line
Correct, as that matches $MMDD. This is because the alternation tries to match at every point of the string, and because $MMDD matches 1114, it's the first date to be returned. For some really detailed output on the workings of this regex behaviour try adding use re 'debug' to the top of your script to see exactly what the regex engine is doing at every step.
Probably what you want instead of alternation, which will not do what you want in this particular case, is code that will match a given string with a list of regexes where the order correlates to the precedence of the regex e.g
my $dom = qr{0[1-9]|[12][0-9]|3[01]};
my $month = qr{0[1-9]|1[012]};
my $fourYear = qr{2003};
my $twoYear = qr{03};
my $MMDDYYYY = qr{$month$dom$fourYear};
my $DDMMYYYY = qr{$dom$month$fourYear};
my $YYYYMMDD = qr{$fourYear$month$dom};
my $DDMMYY = qr{$dom$month$twoYear};
my $MMDDYY = qr{$month$dom$twoYear};
my $MMDD = qr{($month$dom)};
my $DDMM = qr{$dom$month};
my @date_regexes = (
$MMDDYYYY, $DDMMYYYY, $YYYYMMDD, $DDMMYY, $MMDDYY, $MMDD, $DDMM,
);
my $line =
'F111406585_D072203_B085_E087_T047-P085_FCC_07222003_539.cdr';
print "date is - ", match_precedence(\@date_regexes, $line), $/;
sub match_precedence {
my($regs, $str) = @_;
for(@$regs) { return $1 if $str =~ /($_)/ }
return;
}
__output__
date is - 07222003
That's not great code, put hopefully it'll give you something to work with.
Update - well, you can use a regex with alternation, but it ain't pretty
my $date_regex = qr{
(?: .*($MMDDYYYY)|.*($DDMMYYYY)|.*($YYYYMMDD)|.*($DDMMYY)|
.*($MMDDYY)|.*($MMDD)|.*($DDMM) )
}x;
Shudder, backtracking hell basically. It'll work but it'll hugely slow on big strings, so the above regex is really for "you can do it" purposes, so I wouldn't advice using it!
HTH
_________ broquaint | [reply] [d/l] [select] |
| [reply] |