PoorLuzer has asked for the wisdom of the Perl Monks concerning the following question:

How can I know where the header of an email starts and ends?

I am using a Python based tool called GMailBackup to archive my GMail locally.

This tool downloads all mails as a whole (headers and all) and stores them as .eml files

In order to overcome some limitations in the source code, I need to parse theses .eml files in order to grab these 4 fields:

  1. Date
  2. From
  3. Subject
  4. Message-ID

Seemed like a simple slurp and parse operation to me until I ran the program on .emls from desparate sources.

It's a nightmare. Issues arise from trivial changes, for example, Outlook desktop client seems to send mail with the "Message-ID" as "Message-ID", but the webclient sends the field as "Message-Id"; to something different like how separate mail servers mark the boundaries between headers and body.

For example, GMail and other email servers separate the header using "----=_Part_".

However, some M$ Servers seem to use "----_=_NextPart_", and others "----NextPart" and so on.

I have three questions :

1. Is there some module/subroutine/script that I can use to parse these 4 fields reliably from raw mails? The mails can be long (even some hundreds of MBs) and so the script should quit reading the mail as soon as these values are found from the header.

2. Is there any possibility where a "Message-ID" is not part of the mail header? I have not come across any such email over the 4GB of mail I have downloaded so far, but any misbehaving servers we should be aware of?

This ID is used to keep track of which mail has already been downloaded etc - a sort of a unique identifier for every email.

3. I would like to parse the "Date" field that seems to be universally in the stftime format, like "Sat, 9 Feb 2008 17:14:18 -0730"

I tried to use

if(($year,$month,$day) = Date::Calc::Parse_Date("Sat, 9 Feb 2008 17:1 +4:18 -0730")) { printf "\n[*] %d %d %d", $year,$month,$day; }

but it fails.

I need to convert something like "Sat, 9 Feb 2008 17:14:18 -0730" to "20080209171418"

Replies are listed 'Best First'.
Re: Parsing email for headers
by CountZero (Bishop) on Oct 04, 2009 at 16:45 UTC
    Did you already have a look at Email::MIME and more specifically Email::MIME::Header? Or perhaps Email::Simple which has a header_pairs method that returns a list of pairs describing the contents of the header. Every other value, starting with and including zeroth, is a header name and the value following it is the header value.

    If you are bothered by differences in upper-case/lower-case use of the headers, why don't you normalize all header names extracted from the eamil to lower-case and check against the lower-cased name of the header?

    CountZero

    A program should be light and agile, its subroutines connected like a string of pearls. The spirit and intent of the program should be retained throughout. There should be neither too little or too much, neither needless loops nor useless variables, neither lack of structure nor overwhelming rigidity." - The Tao of Programming, 4.1 - Geoffrey James

Re: Parsing email for headers
by zwon (Abbot) on Oct 04, 2009 at 16:54 UTC
Re: Parsing email for headers
by ww (Archbishop) on Oct 05, 2009 at 00:49 UTC

    CountZero and zwon have provided guidance on parsing the headers. This expands on zwon's referral to Date::Parse.

    #!/usr/bin/perl use warnings; use strict; use Date::Parse; # 799107 my $data = ''; while ( <DATA> ) { # print $_; $data = $_; chomp ($data); if ($data =~ /^DATE:\s+(\w{3}, \d+ \w{3} \d{4} \d\d:\d\d:\d\d) + ([+-]\d{4})/ ) { my $date = $1; # Don't do this; check existence of $1 my $zone = $2; # and $2 before you try to use them! print "'Date' found: $date which converts to: "; my $time = str2time($date); print "$time Zone Offset: $zone\n"; print "\t Restringified: " . localtime($time) . "\n"; # reconv +ert, solely as a check on above } } =head OUTPUT 'Date' found: Sat, 9 Feb 2008 17:14:18 which converts to: 1202595258 Z +one Offset: -0730 Restringified: Sat Feb 9 17:14:18 2008 'Date' found: Sun, 10 Feb 2008 04:23:55 which converts to: 1202635435 +Zone Offset: +0400 Restringified: Sun Feb 10 04:23:55 2008 =cut __DATA__ SUBJECT: test FROM: John Smith DATE: Sat, 9 Feb 2008 17:14:18 -0730 TO: Joe Doe additional for demo only DATE: Sun, 10 Feb 2008 04:23:55 +0400

    You could also roll your own, if for some unreasonable reason you want (not recommended) to avoid using an email module. Read on:

    And, just BTW, you probably meant "disparate" rather than "desparate."
      :-)

      Woops! "disparate" is should be :blush: but maybe it goes to show you the state of mind I was in when I posted the thread :-D

      Well, to answer some of my own questions :

      1. MIME::Parser is way too heavy for this purpose. If you have to call $parser->filer->purge(); to delete all the files created from each of the mails. This really seems too much to "just read 4 fields from an email header".

      MIME::Head however seems to fit the bill very well :

      my $head = MIME::Head->read( \*FILE ); # TODO : Does it read the WHOLE + email or skips the remaining mail after reading the header? $head->unfold; # Was a "Subject:" field given? # $subject_was_given = $head->count('subject'); print $head->get('subject'); print $head->get('Message-ID'); print $head->get('from'); print $head->get('date');

      2. I would appreciate some answers to this.

      Of course missing id's will be logged and error handling done, but I was wondering if there are any servers with such known behaviour.

      3. This works just dandy :

      use Date::Manip; Date_Init("ConvTZ=IGNORE","TZ=GMT"); my $date = UnixDate( $head->get('date') , '%Y_%m_%Q-%H%M%S'); print $date;

      This can convert, for eg : "Sat, 9 Feb 2008 17:04:08 EET" to "2008_02_20080209-170408"

      Thanks guys for the great insights! Furthur tips/tricks/etc are welcome though.

Re: Parsing email for headers
by McD (Chaplain) on Oct 05, 2009 at 02:06 UTC
    In addition to echoing what everyone else has to say about modules to use...

    2. Is there any possibility where a "Message-ID" is not part of the mail header?

    It's possible, but highly unlikely. Virtually every mail transport agent (sendmail, exchange, etc.) will add a message-id to a mail message if one isn't already there. Still, if this is production level code, you'd better test for the existence and complain if it's not found.

    Mail header keywords are case-insensitive per the spec, and are separated from the message body by a single blank line.

    Peace,
    -McD
Re: Parsing email for headers
by dsheroh (Monsignor) on Oct 05, 2009 at 10:34 UTC
    To answer your unnumbered opening question, email headers start with the five characters "From" (note the trailing space and no colon, which distinguish it from the "From:" header) and end when a blank line is encountered. The various "---Part"-type separators, if present, occur within the body of the email message after a blank line has indicated end-of-headers, so you don't need to rely on (or try to parse) them.