First of all, that regex, despite it being commented, isn't very easy on the eyes. I know that this goes for regexen in general, but I find that \s+ is much cleaner than [ ][ ]*?. But I'm just nitpicking.
The last comment in your code confuses me a bit - "Date may exist more than once", however ? means "0 times or once". Then, later on, you rely on $1 and $2 to read out the matched dates, suggesting that you only care about the first two occurences of a date substring.
What I also find confusing is that you say that dates can optionally specify a year, but your first regex doesn't allow for those for digits. Your regex consists of two halves, both intended to match a date that should be composed according to a specific format:
but the first half of your regex never tries to match the year. In other words, the two attempts in your regex to capture strings that should adhere to the same format aren't compatible with each other.Month + (space) + Day + (space) + (Optional: Year + (space) ) + Hour + (colon) + Minute + (colon) + Seconds + (Optional: (decimal point) + Fraction )
So I've taken the liberty to rewrite your regex -and your code in general-, so that it:
1) accepts all of the formats you've specified for a date stering;
2) captures all date substrings in the string, not just the first two
3) allows you to work with only the first two dates anyway, if that's what you want
use strict; use warnings; my $current_line = " Mar 11 08:02:08 172.28.17.253 Mar 11 2011 08:02:0 +8 DR-FW-1 :"; $current_line .= "And also Apr 1 11:12:13 -- April's Fool is a fun day + :)"; my @dates = $current_line =~ m/ ( # Match and capture... (?:Jan|Feb|Mar| # ... one of the twelve months Apr|May|Jun| # I prefer to be explicit about this: +it's Jul|Aug|Sep| # a very limited set of strings that w +e accept Oct|Nov|Dec # and [A-Z][a-z]{2} is too unrestricti +ve ) \s+ # ... and one or more spaces \d\d? # ... and one or two digits for the day, b +ut note # that this will also match for Feb 30 +, # which doesn't exist, or for # a day such as 54 \s+ # ... and one or more spaces (?:\d{4}\s+)? # ... and optionally four digits for the y +ear, # followed by one or more spaces \d\d:\d\d:\d\d # ... HH:MM:SS, but note that this will al +so # acccept hours such as 34, or minutes + such as # 84, so this isn't the best we can do +! (?:\.\d*)? # ... and optionally a decimal point, whic +h, if # present, is optionally followed by f +raction # of second ) # End of capture. /gxms; print "I got ", scalar(@dates), " dates:\n"; print " => $_\n" for @dates; print "But the first two are $dates[0] and $dates[1]\n";
Output:
I got 3 dates: => Mar 11 08:02:08 => Mar 11 2011 08:02:08 => Apr 1 11:12:13 But the first two are Mar 11 08:02:08 and Mar 11 2011 08:02:08
In reply to Re: regex for multiple dates
by muba
in thread regex for multiple dates
by Anonymous Monk
| For: | Use: | ||
| & | & | ||
| < | < | ||
| > | > | ||
| [ | [ | ||
| ] | ] |