Perl-chick has asked for the wisdom of the Perl Monks concerning the following question:

Im new to Perl (programming in general infact) and I am trying to write a program which will examine text files to determine the different formats in which dates appear and then captures these dates. The results need to be written to a file (one date per line, annotated with the file name). Having said this (and hopefully done this) i would also then like to modify the program so that all the dates are normalised to the form dd-mm-yyyy. If anyone has any suggestions on some suitable Regex, links, or snippets (an entire code wouldn't be knocked back either) then i would be a very happy chick. Thanks in advance..

Replies are listed 'Best First'.
Re: Date extraction
by lhoward (Vicar) on Jun 13, 2000 at 17:11 UTC
    Perl has several good date/time modules: Date::Manip, Time::CTime, Time::ParseDate

    Time::Parsedate can be used to parse many common date formats into a unix timestamp. Time::CTime can be used to format a unix timestamp into just about any date format you want.

    Date::Manip is the granddady of all date manipulation modules. It has functions to both parse and format dates in many formats. Date::Manip is large, but will work for dates before 1970 and after 2034 (the limitation of a unix timestamp, and hence the 2 Time:: modules mentioned above).

    For extracting the dates from your files you will probably need to examine the files to see what date formats are used (or in what location the occur) and then write a regular-expression to extract them based on format (or location). Once you have them extracted you could use the libraries above to parse and convert to a common date format.

Re: Date extraction
by brick (Sexton) on Jun 14, 2000 at 07:53 UTC
    Well, 
    
    If I'm reading your question right...you'd want to lock and 
    pop open the file(s) and sift for something like:
    
    [0-3]\{0,1}[0-9]/s[/April//Apr//August/...]/s[0-9]\{4\}/s([aAbB]\.\{0, +1}[cCdD]\.\{0,1})*
    My regex isn't awesome...but that should look for chunks that
    begin with zero to one instances of 0-3, followed by a 0-9,
    followed by a space, followed by that list of months {which is
    the one section I'm a lot fuzzy on.} then followed by a space
    and four consecutive digits 0-9 and then followed entirely by 
    that last chunk that looks for zero or more instances of A.D.,
    AD, a.d., ad, B.C., BC, b.c., and bc.{Although I haven't tried
    it, either...I'm trying freehand on it, but it's a start.}
    
    _If_ I'm reading you right.
    
    Then you'd take that matched bit and dump it to your opened
    output file until you had no more input file to munge. Close
    your files and unlock them and you'd be done.
    
    If you knew what date formats to expect, you could fiddle 
    with the regex a bit, ditching the ad/bc part if unnecessary
    or shortening the month listing to either full months or the
    abbreviated ones.
    
    just my shot at it.
    
    -brick.
    
RE: Date extraction
by BigJoe (Curate) on Jun 13, 2000 at 20:08 UTC
    I know I am going to get shunned for not showing you the ways on CPAN but I have a peice of code that I just copy into any script that needs a time stamp. I know it works on Win NT and Linux.
    @months = ('Jan','Feb','Mar','Apr','May','June', 'July','Aug','Sept','Oct','Nov','Dec'); @days = ('Sun','Mon','Tue','Wed','Thu','Fri','Sat'); ($sec,$min,$hour,$mday,$mon,$year,$wday) = (localtime(time))[0,1,2,3,4 +,5,6]; $year=$year+1900; $date = " $days[$wday], $months[$mon]/$mday/$year $hour:$min:$sec";
    With this it is very simple to change the format of the date or test certain parts of the date.

    --Joe
Re: Date extraction
by slayven (Pilgrim) on Jun 13, 2000 at 19:50 UTC
    for a simple script I always use
    $date = 'date "+%d-%b-%Y"';
    instead of loading a module.

    forgive me prefering shell comands sometimes :)
      For just getting the date and time, I use:

      $date = scalar gmtime;
      No modules or shell commands!