in reply to Parsing email for headers

CountZero and zwon have provided guidance on parsing the headers. This expands on zwon's referral to Date::Parse.

#!/usr/bin/perl use warnings; use strict; use Date::Parse; # 799107 my $data = ''; while ( <DATA> ) { # print $_; $data = $_; chomp ($data); if ($data =~ /^DATE:\s+(\w{3}, \d+ \w{3} \d{4} \d\d:\d\d:\d\d) + ([+-]\d{4})/ ) { my $date = $1; # Don't do this; check existence of $1 my $zone = $2; # and $2 before you try to use them! print "'Date' found: $date which converts to: "; my $time = str2time($date); print "$time Zone Offset: $zone\n"; print "\t Restringified: " . localtime($time) . "\n"; # reconv +ert, solely as a check on above } } =head OUTPUT 'Date' found: Sat, 9 Feb 2008 17:14:18 which converts to: 1202595258 Z +one Offset: -0730 Restringified: Sat Feb 9 17:14:18 2008 'Date' found: Sun, 10 Feb 2008 04:23:55 which converts to: 1202635435 +Zone Offset: +0400 Restringified: Sun Feb 10 04:23:55 2008 =cut __DATA__ SUBJECT: test FROM: John Smith DATE: Sat, 9 Feb 2008 17:14:18 -0730 TO: Joe Doe additional for demo only DATE: Sun, 10 Feb 2008 04:23:55 +0400

You could also roll your own, if for some unreasonable reason you want (not recommended) to avoid using an email module. Read on:

/me didn't bother to create a set of .eml files to read. Reading from files rather than from __DATA__ is left as an exercise to the OP.

#!/usr/bin/perl use warnings; use strict; use Date::Parse; # 799107 my $data = ''; while ( <DATA> ) { # print $_; $data = $_; chomp ($data); if ($data =~ /(^SUBJECT: .*)/) { print; } if ($data =~ /(^FROM: .*)/) { print; } if ($data =~ /(\w{3}, \d+ \w{3} \d{4} \d\d:\d\d:\d\d) ([+-]\d{4})/ + ) { my $date = $1; # Don't do this; check existence of $1 my $zone = $2; # and $2 before you try to use them! print "'Date' found: $date which converts to: "; my $time = str2time($date); print "$time Zone Offset: $zone\n"; print "\t Restringified: " . localtime($time) . "\n"; # reconv +ert, solely as a check on above } if ($data =~ /Message-ID/ix) { print "'ID' found: $data \n"; } if ($data =~ /(^TO: .*)/) { print; # last; # uncomment to make "the script ... quit reading t +he mail" after the "TO:" field } if ($data =~ /-{4}=_Part_.*|-{4}_{0,1}={0,1}_{0,1}NextPart_{0,1}.* +/x ) { print "'Part' header found: $data\n"; } } =head OUTPUT SUBJECT: test FROM: John Smith 'Date' found: Sat, 9 Feb 2008 17:14:18 which converts to: 1202595258 Z +one Offset: -0730 Restringified: Sat Feb 9 17:14:18 2008 TO: Joe Doe 'ID' found: Message-ID <F6E1D1E016C6A7468EEA1708CA24F72B1E363E@SERVER +.fake.local> 'ID' found: Message-Id <6A7468EEA17F6E1D1E016C08CA24F72B1E363E@SERVER +.fake.com> 'Part' header found: ----=_Part_abcd 'Part' header found: ----_=_NextPart_1234 'Part' header found: ----NextPartXYZ =cut __DATA__ SUBJECT: test FROM: John Smith DATE: Sat, 9 Feb 2008 17:14:18 -0730 TO: Joe Doe Message-ID <F6E1D1E016C6A7468EEA1708CA24F72B1E363E@SERVER.fake.local> Message-Id <6A7468EEA17F6E1D1E016C08CA24F72B1E363E@SERVER.fake.com> ----=_Part_abcd ----_=_NextPart_1234 ----NextPartXYZ Text of message here.

Your post seems to be a bit conflicted: You say you need only the first four fields but then discuss the variance in the "...Part..." as if it is an issue. Since you haven't provided any guidance on how you may want to use them, the above merely demonstrates use of a case insensitive regex.

And, just BTW, you probably meant "disparate" rather than "desparate."
  :-)

Replies are listed 'Best First'.
Re^2: Parsing email for headers
by PoorLuzer (Beadle) on Oct 05, 2009 at 05:01 UTC
    Woops! "disparate" is should be :blush: but maybe it goes to show you the state of mind I was in when I posted the thread :-D

    Well, to answer some of my own questions :

    1. MIME::Parser is way too heavy for this purpose. If you have to call $parser->filer->purge(); to delete all the files created from each of the mails. This really seems too much to "just read 4 fields from an email header".

    MIME::Head however seems to fit the bill very well :

    my $head = MIME::Head->read( \*FILE ); # TODO : Does it read the WHOLE + email or skips the remaining mail after reading the header? $head->unfold; # Was a "Subject:" field given? # $subject_was_given = $head->count('subject'); print $head->get('subject'); print $head->get('Message-ID'); print $head->get('from'); print $head->get('date');

    2. I would appreciate some answers to this.

    Of course missing id's will be logged and error handling done, but I was wondering if there are any servers with such known behaviour.

    3. This works just dandy :

    use Date::Manip; Date_Init("ConvTZ=IGNORE","TZ=GMT"); my $date = UnixDate( $head->get('date') , '%Y_%m_%Q-%H%M%S'); print $date;

    This can convert, for eg : "Sat, 9 Feb 2008 17:04:08 EET" to "2008_02_20080209-170408"

    Thanks guys for the great insights! Furthur tips/tricks/etc are welcome though.