Anonymous Monk has asked for the wisdom of the Perl Monks concerning the following question:

Hi Monks

Sorry if my question is a little basic. I know some basic pattern matching, and have tried the tutorials section, but I'm still having a little trouble with the following task.

I have a load of date/time strings that have the following format:
mmm d{d} yyyy hh:mm:ss:sss{AM/PM}

eg. Oct 16 2004 11:09:19:943AM (example 1)
or Mar 3 2007 10:30:31:170PM (example 2)

What I want to do is to extract the day, month and year, and stick them together in the format yyyymmdd so I can sort by date easily. So for the above examples you'd get:
20041016 and
20070303

I've tried the following bit of code:

if ($data=~/(\D{3}) *?(\d{1,2}) *?(\d{4})/) # i.e. look fo +r 3 non-digits, then any number of spaces, then either 1 or two digit +s, then any number of spaces, then 4 digits. (I wasn't sure how many +spaces separated each part.) { $month= $datehash{$1},br; # datehash is a lookup for th +e month $day=$2; if ($day=~/\d{1}/) { $day='0'.$day; # i.e. add a '0' in front of days consistin +g of 1 digit, so 16 stays as 16, but 1 turns into 01 } $year=$3; $sortingdate=join '', $year, $month, $day; }
Now this works for example 2 (I do get 20070303), but not for example 1: for some reason I get 1016 instead of 20041016. Why?

Replies are listed 'Best First'.
Re: Pattern matching for extracting dates
by GrandFather (Saint) on Apr 16, 2007 at 12:42 UTC

    Because your second regex (if ($day=~/\d{1}/)) clobbers $3. I'd rework the code somewhat to:

    use strict; use warnings; my %datehash = ( jan => 1, feb => 2, mar => 3, apr => 4, may => 5, jun => 6, jul => 7, aug => 8, sep => 9, oct => 10, nov => 11, dec => 12 ); while (<DATA>) { chomp; my $data = lc $_; next unless $data =~ /(\D{3}) \s* (\d{1,2}) \s* (\d{4})/x; next unless exists $datehash{$1}; my $month = $datehash{$1}; printf "%04d%02d%02d\n", $3, $month, $2; } __DATA__ Oct 16 2004 11:09:19:943AM Mar 3 2007 10:30:31:170PM

    Prints:

    20041016 20070303

    Note the use of \s to match any white space and that a greedy match is used because it makes no difference in this context. Note too that the /x flag is used so that white space can be used in the regex to make the different parts more obvious.

    printf is a cleaner way of generating the string. If you want the assign the result to a variable use sprintf instead.

    For future reference it is worth giving a stand alone runnable sample such as this so others can reproduce your problem easily.


    DWIM is Perl's answer to Gödel
      Thanks for the replies, people!
Re: Pattern matching for extracting dates
by Krambambuli (Curate) on Apr 16, 2007 at 11:53 UTC
    Try something like
    use strict; use warnings; my @dates = ( 'Oct 16 2004 11:09:19:943AM', 'Mar 3 2007 10:30:31:170PM', ); foreach my $date (@dates) { my ($month, $day, $year, $time) = split( /\s+/, $date ); my $sorting_date = $year . $date_hash{$month} . sprintf( '%02d', $day); }
Re: Pattern matching for extracting dates
by Anno (Deacon) on Apr 16, 2007 at 12:05 UTC
    I have reformatted your code to make it more readable:
    if ($data=~/(\D{3}) *?(\d{1,2}) *?(\d{4})/) { $month = $datehash{$1}; $day=$2; if ($day=~/\d{1}/) { $day='0'.$day; } $year=$3; $sortingdate=join '', $year, $month, $day; }
    Your regular expression is okay, as far as it goes. The error is in the formatting. The test $day=~/\d{1}/ will match every one- or two-digit day specification, so you will always add a zero. Also, do the values of %datehash have leading zeroes where required? Here is how I would re-write your code:
    my %datehash; @datehash{ qw( Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec) } = map sprintf( '%02d', $_), 1 .. 12; my $mon_re = join '|', keys %datehash; while ( <DATA> ) { my ( $mon, $day, $year) = /($mon_re)\s+(\d\d?)\s+(\d{4})/; $day = sprintf '%02d', $day; print "$year$datehash{ $mon}$day\n"; } __DATA__ Oct 16 2004 11:09:19:943AM Mar 3 2007 10:30:31:170PM
    I have made the regular expression more specific for the month names (since %datehash is already there) and less specific for white space (allowing any, not only blanks).

    Anno

Re: Pattern matching for extracting dates
by Herkum (Parson) on Apr 16, 2007 at 11:47 UTC

    People cannot answer you question because we don't have the exact data that you are a working with. You gave an example, but we can never be sure if you just typed it in correctly and just missed the reason behind it not working.

    An alternative solution, why don't you split the data,

    my ($month_abbrev, $day, $year) = split qr{\s+}, $data, 3;

    This should give you the same information but is easier to understand and should be little faster than a regexp.

Re: Pattern matching for extracting dates
by johngg (Canon) on Apr 16, 2007 at 14:35 UTC
    Other Monks have given you advice to solve your question but I note that you say

    I have a load of date/time strings that have the following format

    Your approach using yyyymmdd as the sorting string may not be adequate if more than one of those date/time strings falls on the same day. If this is likely to be a problem you should consider transforming the date/time into an epoch time value using the timelocal() subroutine in the Time::Local module. This can then be sorted numerically.

    I hope this is of use.

    Cheers,

    JohnGG

      Thanks very much for your reply, JohnGG. The strings are generated once per day, so I should be OK, but thanks for your time anyway. It amazes me how quickly people come back with useful, accurate answers. Probably because I'm a Perl newb...