How can I know where the header of an email starts and ends?

I am using a Python based tool called GMailBackup to archive my GMail locally.

This tool downloads all mails as a whole (headers and all) and stores them as .eml files

In order to overcome some limitations in the source code, I need to parse theses .eml files in order to grab these 4 fields:

  1. Date
  2. From
  3. Subject
  4. Message-ID

Seemed like a simple slurp and parse operation to me until I ran the program on .emls from desparate sources.

It's a nightmare. Issues arise from trivial changes, for example, Outlook desktop client seems to send mail with the "Message-ID" as "Message-ID", but the webclient sends the field as "Message-Id"; to something different like how separate mail servers mark the boundaries between headers and body.

For example, GMail and other email servers separate the header using "----=_Part_".

However, some M$ Servers seem to use "----_=_NextPart_", and others "----NextPart" and so on.

I have three questions :

1. Is there some module/subroutine/script that I can use to parse these 4 fields reliably from raw mails? The mails can be long (even some hundreds of MBs) and so the script should quit reading the mail as soon as these values are found from the header.

2. Is there any possibility where a "Message-ID" is not part of the mail header? I have not come across any such email over the 4GB of mail I have downloaded so far, but any misbehaving servers we should be aware of?

This ID is used to keep track of which mail has already been downloaded etc - a sort of a unique identifier for every email.

3. I would like to parse the "Date" field that seems to be universally in the stftime format, like "Sat, 9 Feb 2008 17:14:18 -0730"

I tried to use

if(($year,$month,$day) = Date::Calc::Parse_Date("Sat, 9 Feb 2008 17:1 +4:18 -0730")) { printf "\n[*] %d %d %d", $year,$month,$day; }

but it fails.

I need to convert something like "Sat, 9 Feb 2008 17:14:18 -0730" to "20080209171418"


In reply to Parsing email for headers by PoorLuzer

Title:
Use:  <p> text here (a paragraph) </p>
and:  <code> code here </code>
to format your post, it's "PerlMonks-approved HTML":



  • Posts are HTML formatted. Put <p> </p> tags around your paragraphs. Put <code> </code> tags around your code and data!
  • Titles consisting of a single word are discouraged, and in most cases are disallowed outright.
  • Read Where should I post X? if you're not absolutely sure you're posting in the right place.
  • Please read these before you post! —
  • Posts may use any of the Perl Monks Approved HTML tags:
    a, abbr, b, big, blockquote, br, caption, center, col, colgroup, dd, del, details, div, dl, dt, em, font, h1, h2, h3, h4, h5, h6, hr, i, ins, li, ol, p, pre, readmore, small, span, spoiler, strike, strong, sub, summary, sup, table, tbody, td, tfoot, th, thead, tr, tt, u, ul, wbr
  • You may need to use entities for some characters, as follows. (Exception: Within code tags, you can put the characters literally.)
            For:     Use:
    & &amp;
    < &lt;
    > &gt;
    [ &#91;
    ] &#93;
  • Link using PerlMonks shortcuts! What shortcuts can I use for linking?
  • See Writeup Formatting Tips and other pages linked from there for more info.