Re: Parsing email for headers

CountZero and zwon have provided guidance on parsing the headers. This expands on zwon's referral to Date::Parse.

#!/usr/bin/perl
use warnings;
use strict;
use Date::Parse;

# 799107 

my $data = '';

while ( <DATA> ) {
    # print $_;
    $data = $_;
    chomp ($data);
        if ($data =~ /^DATE:\s+(\w{3}, \d+ \w{3} \d{4} \d\d:\d\d:\d\d)
+ ([+-]\d{4})/ ) {
        my $date = $1;       # Don't do this; check existence of $1 
        my $zone = $2;         # and $2 before you try to use them!
        print "'Date' found: $date which converts to: ";
        my $time = str2time($date);
        print "$time Zone Offset: $zone\n";
        print "\t Restringified: " . localtime($time) . "\n"; # reconv
+ert, solely as a check on above
    }
}

=head OUTPUT
'Date' found: Sat, 9 Feb 2008 17:14:18 which converts to: 1202595258 Z
+one Offset: -0730
         Restringified: Sat Feb  9 17:14:18 2008
'Date' found: Sun, 10 Feb 2008 04:23:55 which converts to: 1202635435 
+Zone Offset: +0400
         Restringified: Sun Feb 10 04:23:55 2008
=cut

__DATA__
SUBJECT: test
FROM: John Smith
DATE: Sat, 9 Feb 2008 17:14:18 -0730
TO: Joe Doe
additional for demo only
DATE:  Sun, 10 Feb 2008 04:23:55 +0400
[download]

You could also roll your own, if for some unreasonable reason you want (not recommended) to avoid using an email module. Read on:

/me didn't bother to create a set of .eml files to read. Reading from files rather than from __DATA__ is left as an exercise to the OP.

#!/usr/bin/perl
use warnings;
use strict;
use Date::Parse;

# 799107 

my $data = '';

while ( <DATA> ) {
    # print $_;
    $data = $_;
    chomp ($data);
    if ($data =~ /(^SUBJECT: .*)/) {
        print;
    }
    if ($data =~ /(^FROM: .*)/) {
        print;
    }
    if ($data =~ /(\w{3}, \d+ \w{3} \d{4} \d\d:\d\d:\d\d) ([+-]\d{4})/
+ ) {
        my $date = $1;       # Don't do this; check existence of $1 
        my $zone = $2;         # and $2 before you try to use them!
        print "'Date' found: $date which converts to: ";
        my $time = str2time($date);
        print "$time Zone Offset: $zone\n";
        print "\t Restringified: " . localtime($time) . "\n"; # reconv
+ert, solely as a check on above
    }

    if ($data =~ /Message-ID/ix) {
        print "'ID' found:  $data \n";
    }
    if ($data =~ /(^TO: .*)/) {
        print;
        # last;     # uncomment to make "the script ... quit reading t
+he mail" after the "TO:" field
    }
    if ($data =~ /-{4}=_Part_.*|-{4}_{0,1}={0,1}_{0,1}NextPart_{0,1}.*
+/x ) {
        print "'Part' header found: $data\n";
    }
}

=head OUTPUT
SUBJECT: test
FROM: John Smith
'Date' found: Sat, 9 Feb 2008 17:14:18 which converts to: 1202595258 Z
+one Offset: -0730
         Restringified: Sat Feb  9 17:14:18 2008
TO: Joe Doe
'ID' found:  Message-ID <F6E1D1E016C6A7468EEA1708CA24F72B1E363E@SERVER
+.fake.local>
'ID' found:  Message-Id <6A7468EEA17F6E1D1E016C08CA24F72B1E363E@SERVER
+.fake.com>
'Part' header found: ----=_Part_abcd
'Part' header found: ----_=_NextPart_1234
'Part' header found: ----NextPartXYZ
=cut

__DATA__
SUBJECT: test
FROM: John Smith
DATE: Sat, 9 Feb 2008 17:14:18 -0730
TO: Joe Doe
Message-ID <F6E1D1E016C6A7468EEA1708CA24F72B1E363E@SERVER.fake.local>
Message-Id <6A7468EEA17F6E1D1E016C08CA24F72B1E363E@SERVER.fake.com>
----=_Part_abcd
----_=_NextPart_1234
----NextPartXYZ
Text of message here.
[download]

Your post seems to be a bit conflicted: You say you need only the first four fields but then discuss the variance in the "...Part..." as if it is an issue. Since you haven't provided any guidance on how you may want to use them, the above merely demonstrates use of a case insensitive regex.

And, just BTW, you probably meant "disparate" rather than "desparate."
:-)

Comment on Re: Parsing email for headers Select or Download Code

Replies are listed 'Best First'.
Re^2: Parsing email for headers by PoorLuzer (Beadle) on Oct 05, 2009 at 05:01 UTC
Woops! "disparate" is should be :blush: but maybe it goes to show you the state of mind I was in when I posted the thread :-D Well, to answer some of my own questions : 1. MIME::Parser is way too heavy for this purpose. If you have to call `$parser->filer->purge();` to delete all the files created from each of the mails. This really seems too much to "just read 4 fields from an email header". MIME::Head however seems to fit the bill very well : `my $head = MIME::Head->read( \*FILE ); # TODO : Does it read the WHOLE + email or skips the remaining mail after reading the header? $head->unfold; # Was a "Subject:" field given? # $subject_was_given = $head->count('subject'); print $head->get('subject'); print $head->get('Message-ID'); print $head->get('from'); print $head->get('date');` [download] 2. I would appreciate some answers to this. Of course missing id's will be logged and error handling done, but I was wondering if there are any servers with such known behaviour. 3. This works just dandy : `use Date::Manip; Date_Init("ConvTZ=IGNORE","TZ=GMT"); my $date = UnixDate( $head->get('date') , '%Y_%m_%Q-%H%M%S'); print $date;` [download] This can convert, for eg : "Sat, 9 Feb 2008 17:04:08 EET" to "2008_02_20080209-170408" Thanks guys for the great insights! Furthur tips/tricks/etc are welcome though.	[reply] [d/l] [select]

Replies are listed 'Best First'.

Re^2: Parsing email for headers
by PoorLuzer (Beadle) on Oct 05, 2009 at 05:01 UTC

Well, to answer some of my own questions :

1. MIME::Parser is way too heavy for this purpose. If you have to call $parser->filer->purge(); to delete all the files created from each of the mails. This really seems too much to "just read 4 fields from an email header".

MIME::Head however seems to fit the bill very well :

my $head = MIME::Head->read( \*FILE ); # TODO : Does it read the WHOLE
+ email or skips the remaining mail after reading the header?
$head->unfold;

# Was a "Subject:" field given?
# $subject_was_given = $head->count('subject');

print $head->get('subject');
print $head->get('Message-ID');
print $head->get('from');
print $head->get('date');
[download]

2. I would appreciate some answers to this.

Of course missing id's will be logged and error handling done, but I was wondering if there are any servers with such known behaviour.

3. This works just dandy :

use Date::Manip;
Date_Init("ConvTZ=IGNORE","TZ=GMT");

my $date = UnixDate( $head->get('date') , '%Y_%m_%Q-%H%M%S');
print $date;
[download]

This can convert, for eg : "Sat, 9 Feb 2008 17:04:08 EET" to "2008_02_20080209-170408"

Thanks guys for the great insights! Furthur tips/tricks/etc are welcome though.

[reply]
[d/l]
[select]