PerlGrrl has asked for the wisdom of the Perl Monks concerning the following question:

I've been trying to find date strings of different formats using regexes, but it seems really cumbersome, and almost illegible. Any suggestions as to how I can tidy this code up without skipping any strings? Also, would like to trim the comma char...not sure how. Can anyone help? My data is a series of large text files containing phone numbers, names, post codes, long sentences...
while(<FH1>) { if #(/(19\d{2})/gi) ( (/(\d{1,2}\-\d{1,2}\s\w+\s(19\d{2}))/gi) || (/(\d{1,2}\w{2}\s\w+\s(19\d{2}))/gi) || (/(\d{1,2}\s\w+\s(19\d{2}))/gi) || (/(\w+\s\d{1,2},\s(19\d{2}))/gi) || (/(\w+\s\d{1,2}\-\d{1,2},\s(19\d{2}))/gi) ) { print OUTFILE ("$filename\t $1\t \n"); } }

Replies are listed 'Best First'.
Re: My code seems messy...
by Limbic~Region (Chancellor) on May 13, 2006 at 15:18 UTC
    PerlGrrl,
    There are many ways to make your code look more pristine. They almost always involve trade-offs. One example would be to hide all your regexes in a subroutine or an object. Either of those approaches could be separated further by having a master subroutine/method that called each 1 regex specific routine in sequence stopping at first match. This allows you to add new matches relatively easy.
    while ( <$fh> ) { my ($yr, $mon, $day) = parse_date($_); next if ! defined $yr; # ... }
    Additionally, you probably want to have a look at Date::Manip's 'ParseDate', Date::Calc's 'Parse_Date', Date::Parse, and Regexp::Common::time for ideas on improving or expanding your matching criteria (possibly as a replacement)

    Cheers - L~R

Re: My code seems messy...
by davidrw (Prior) on May 13, 2006 at 14:53 UTC
    two general comments: use code comments, e.g. || /foo\d+/ #match foo1234 and also look at the /x modifier in perlre for embedding whitespace and comments inside the regex itself.
    Also, the parens are unnecessary in (/foo/) and (19\d{2}).
    Here's a (untested) crack at it (also restructing into one RE, since there are commonalities):
    if( /( # start re & capture (?: # non-capturing OR clause for the following cases: \d{1,2}\-\d{1,2}\s\w+ # "03-25 X" | \d{1,2}\w{2}\s\w+ # "03aa X" | \d{1,2}\s\w+ # "12 X" | \w+\s\d{1,2}, # "X 12," | \w+\s\d{1,2}\-\d{1,2}, # "X 12-20," ) \s19\d{2} # the <spaceChar>19xx marker )/gxi ){ ... }
Re: My code seems messy...
by dsheroh (Monsignor) on May 13, 2006 at 15:17 UTC
    Have you considered trying Date::Parse from CPAN? I don't know whether it supports all date formats that you might encounter, but it does handle several.

    (Hmm... I thought perlmonks autolinked references to CPAN modules, but I guess not... What has to be done to get that?)

        Aha! So that's the syntax... I found a reference to the cpan://Date::Parse notation, but missed that it needed brackets around it. Thanks!
Re: My code seems messy...
by Miguel (Friar) on May 13, 2006 at 14:05 UTC
    There are so many date formats around the world that it could be easier to us to help you if you post some line examples from your files.
Re: My code seems messy...
by perl-diddler (Chaplain) on May 13, 2006 at 23:45 UTC
    Sorry to be dense, but what types of date strings are you trying to match? I.e.:
    Recognize formats: mm-dd-yy, mm-dd-yyyy, mm/dd/yy, MMM-dd, yyyy, etc...?

    Are you searching for possible embedded date-strings in a line, or do you know the beginning of the date field and you are trying to parse the date; Any date validity checking?

    Would the "Date::Manip", "ParseDate" routine from CPAN be useful?

Re: My code seems messy...
by moklevat (Priest) on May 13, 2006 at 20:24 UTC
    If you want to go beyond davidrw's fine suggestions, I would recommend you take a look at TheDamian's book Perl Best Practices. It has excellent suggestions for writing readable and maintainable code.
      My text files include dates in string fromat such as the following:
      ...the next social club meeting is on April 15, 1994...

      ...September 21-23, 1994 we will be hosting visitors...

      ...submissions should be made by 11 February 1994...

      ...Mail sent 7 Feb 1994...

      ...On the 16th September 1994 Mr X will be giving a talk on ...

      ...unconfirmed conference dates are March 4, 5 and 6, 1994...

      (...is MY emphasis - for clarity's sake)
      Have absolutely no knowledge of surrounding boundaries, nor where the strings can occur in the text, just got to pull them all out, so that I can then do some date normalisation...also, i've found that some of my strings are sometimes matching numeric strings...

        Here's my attempt. I've used two regexes for the general cases of day month and month day.

        #!/usr/bin/perl use strict; use warnings; while (<DATA>){ if ( # month day / ( [JFMASOND][a-z]{2,8} # full or 3 letter month name \s [123]?\d # day number (?:-[123]?\d)? # optional dash and day number (?:,\s[123]?\d\s)* # optional list of day numbers (?:and\s)? # optional and (?:[123]?\d,?\s)? # optional end of list (?:19|20)\d{2} # year starting 19 or 20 ) /x ) { print "1: $1\n"; } elsif ( # day month / ( [123]?\d # day number (?:st|nd|rd|th)? # optional st, nd etc. \s+ [JFMASOND][a-z]{2,8} # month name \s+ (?:19|20)\d{2} # year starting 19 or 20 ) /x ) { print "2: $1\n"; } } __DATA__ ...the next social club meeting is on April 15, 1994... ...September 21-23, 1994 we will be hosting visitors... ...submissions should be made by 11 February 1994... ...Mail sent 7 Feb 1994... ...On the 16th September 1994 Mr X will be giving a talk on ... ...unconfirmed conference dates are March 4, 5 and 6, 1994...
        output:
        ---------- Capture Output ---------- > "C:\Perl\bin\perl.exe" _new.pl 1: April 15, 1994 1: September 21-23, 1994 2: 11 February 1994 2: 7 Feb 1994 2: 16th September 1994 1: March 4, 5 and 6, 1994 > Terminated with exit code 0.
        Here I start by checking if there's a valid month name in the string. If there is, I start extracting the dates from the string; else skipt that line.
        Note that I had to add a space after the first '...' because '...September' is not a valid month name.
        Updated to return valid dates
        #!/usr/bin/perl # filename: extract_dates.pl use strict; use warnings; use Data::Dumper; $Data::Dumper::Indent = 1; my @dates; push @dates, findMonth($_) while (<DATA>); print Dumper \@dates; # or do something else with your dates sub findMonth { my @words = split / /,shift; my %months = map {$_ => 1 } qw/ january jan february feb march mar april apr may may june jun july jul august aug september sep october oct november nov december dec /; foreach (@words) { if (exists $months{lc($_)} ) { return extractDate( { MONTH => $_, STRING => "@words" } ); last; } } return; } sub extractDate { my $self = shift; return makeValidDate( {STRING=>$self->{STRING},DATE=>$1} ) if ($self->{STRING} =~ / ( (?: ( [123]?\d (st|nd|rd|th)? \s+ )? $self->{MONTH} \s\d{1,4} ( (-[123]?\d)? (,\s[123]?\d\s)* (and\s\d+)? (,\s\d{4})? )? ) ) /x ); return; } sub makeValidDate { my $self = shift; my ($string) = $self->{STRING} =~/^(.+)$/; $self->{DATE} =~s/-/X/g; $self->{DATE} =~s/(^-|\W|and|th|st|nd|rd)/ /gi; my @date = split /\s+/,$self->{DATE}; my $date = {}; foreach (@date) { if ($_=~/^\d{1,2}$/) { push @{$date->{days}},$_ } elsif ($_=~/^\d{4}$/) { $date->{year} = $_ } elsif ($_=~/(\d+)X(\d+)/) { push @{$date->{days}},$1 .. $2 } else { $date->{month} = lc($_) } } use Date::Manip; my $out_date = {}; foreach (@{$date->{days}}) { my $temp_date = ParseDate( $_ . " " . $date->{month} . " " . $date->{year} ); $temp_date = &UnixDate($temp_date,"%D"); push @{$out_date->{dates}},$temp_date; $out_date->{string} = $string; } return $out_date; } __DATA__ ... the next social club meeting is on April 15, 1994... ... September 21-23, 1994 we will be hosting visitors... ... submissions should be made by 11 February 1994... ... Mail sent 7 Feb 1994... ... On the 16th September 1994 Mr X will be giving a talk on ... ... unconfirmed conference dates are March 4, 5 and 6, 1994...
        $VAR1 = [ { 'dates' => [ '04/15/94' ], 'string' => '... the next social club meeting is on April 15, 1994 +...' }, { 'dates' => [ '09/21/94', '09/22/94', '09/23/94' ], 'string' => '... September 21-23, 1994 we will be hosting visitors +...' }, { 'dates' => [ '02/11/94' ], 'string' => '... submissions should be made by 11 February 1994... +' }, { 'dates' => [ '02/07/94' ], 'string' => '... Mail sent 7 Feb 1994...' }, { 'dates' => [ '09/16/94' ], 'string' => '... On the 16th September 1994 Mr X will be giving a +talk on ...' }, { 'dates' => [ '03/04/94', '03/05/94', '03/06/94' ], 'string' => '... unconfirmed conference dates are March 4, 5 and 6 +, 1994...' } ];