Interwebs has asked for the wisdom of the Perl Monks concerning the following question:

Hello Gracious Monks,

I have been tearing out my hair for several days now trying to write a perl script that will take a MIME email, parse the header, parse the parts, decode(if necessary) and stick the parts that are text/plain into $body, and decode(if necessary) and strip tags from the parts that are text/html, then also placing them into $body. (This will lead to its actual function down the line.

The problem is, MIME::Decoder seems unable to parse my messages, and it claims that the message does not have separate parts. It also doesn't parse my header correctly, but that's not a big deal, as Email::MIME does.

Here is what I wrote so far (some code was grabbed from other perlmonks nodes):
#!/usr/local/bin/perl use MIME::Parser; use MIME::Entity; use MIME::Body; my (@body, $i, $subentity); my $parser = new MIME::Parser; $parser->output_to_core(1); my $entity = $parser->parse_open("C:\\Documents and Settings\\user\\My + Documents\\email.txt"); %request = (); my $header = $entity->head; $request{SUBJECT} = $header->get('Subject'); chomp $request +{SUBJECT}; $request{TO} = $header->get('To'); chomp $request{T +O}; $request{FROM} = $header->get('From'); chomp $reque +st{FROM}; $request{CC} = $header->get('CC'); chomp $request{CC +}; if ($entity->parts > 0){ for ($i=0; $i<$entity->parts; $i++){ $subentity = $entity->parts($i); if (($subentity->mime_type =~ m/text\/html/i) || ($subentity-> +mime_type =~ m/text\/plain/i)){ $body = join "", @{$subentity->body}; next; } #Multipart/Alternative elsif ($subentity->mime_type =~ m/multipart\/alternative/i){ $body = join "", @{$subentity->body}; @body = split /------=_NextPart_\S*\n/, $body; $body = $body[1]; $body =~ s/^Content-Type.*Content-Transfer-Encoding.*?\n+/ +/is; next; } } } else { print("No body text."); } $body = &untag($body); #remove extra space $body =~ tr/ //s; $body =~ s/\n+ +\n//g; $body =~ s/\n\n{2,}//g; print($body); sub untag { local $_ = $_[0] || $_; # ALGORITHM: # find < , # comment <!-- ... -->, # or comment <? ... ?> , # or one of the start tags which require correspond # end tag plus all to end tag # or if \s or =" # then skip to next " # else [^>] # > s{ < # open tag (?: # open group (A) (!--) | # comment (1) or (\?) | # another comment (2) or (?i: # open group (B) for /i ( TITLE | # one of start tags SCRIPT | # for which APPLET | # must be skipped OBJECT | # all content STYLE # to correspond ) # end tag (3) ) | # close group (B), or ([!/A-Za-z]) # one of these chars, remember in (4) ) # close group (A) (?(4) # if previous case is (4) (?: # open group (C) (?! # and next is not : (D) [\s=] # \s or "=" ["`'] # with open quotes ) # close (D) [^>] | # and not close tag or [\s=] # \s or "=" with `[^`]*` | # something in quotes ` or [\s=] # \s or "=" with '[^']*' | # something in quotes ' or [\s=] # \s or "=" with "[^"]*" # something in quotes " )* # repeat (C) 0 or more times | # else (if previous case is not (4)) .*? # minimum of any chars ) # end if previous char is (4) (?(1) # if comment (1) (?<=--) # wait for "--" ) # end if comment (1) (?(2) # if another comment (2) (?<=\?) # wait for "?" ) # end if another comment (2) (?(3) # if one of tags-containers (3) </ # wait for end (?i:\3) # of this tag (?:\s[^>]*)? # skip junk to ">" ) # end if (3) > # tag closed }{}gsx; # STRIP THIS TAG return $_ ? $_ : ""; }
An email I'd parse might look like this:
MIME-Version: 1.0 Received: by xx.xxx.xxx.xxx with HTTP; Mon, 28 Jun 2010 11:16:13 -0700 + (PDT) X-Goomoji-Body: true Date: Mon, 28 Jun 2010 14:16:13 -0400 Delivered-To: xxxxxx@gmail.com Message-ID: <xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx@mail.gmail.com> Subject: dfgyhh From: Foo Bar <xxxxxxx@gmail.com> To: xxxxxxx@gmail.com Content-Type: multipart/mixed; boundary=0016e646072eab7f26048a1b1d96 --0016e646072eab7f26048a1b1d96 Content-Type: multipart/related; boundary=0016e646072eab7f24048a1b1d95 --0016e646072eab7f24048a1b1d95 Content-Type: multipart/alternative; boundary=0016e646072eab7f20048a1b +1d94 --0016e646072eab7f20048a1b1d94 Content-Type: text/plain; charset=ISO-8859-1 *This is totally obnoxious HTML EMAIL! I'm writing from GMAIL* [?][?][ +?] AND IT MAKES *ME Want to throw up :(* *Look at my rad alignment* Lookie <www.yahoo.com> --0016e646072eab7f20048a1b1d94 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable <b>This is totally obnoxious HTML EMAIL! I&#39;m writing from <span st +yle= =3D"background-color: rgb(255, 102, 0);">GMAIL</span></b> <img goomoji +=3D"4= EF" style=3D"margin: 0pt 0.2ex; vertical-align: middle;" src=3D"cid:4E +F@goo= moji.gmail"><img goomoji=3D"32F" style=3D"margin: 0pt 0.2ex; vertical- +align= : middle;" src=3D"cid:32F@goomoji.gmail"><img goomoji=3D"1B2" style=3D +"marg= in: 0pt 0.2ex; vertical-align: middle;" src=3D"cid:1B2@goomoji.gmail"> + <fon= t size=3D"6">AND IT MAKES <u>ME <font face=3D"georgia,serif">Want to t +hrow = up :(</font></u></font> <br> <div style=3D"text-align: right;"><font face=3D"arial black,sans-serif +"><u>= <font size=3D"2">Look at my rad alignment</font></u></font><br><br><a +href= =3D"http://www.yahoo.com">Lookie</a><br><br> </div> --0016e646072eab7f20048a1b1d94-- --0016e646072eab7f24048a1b1d95 Content-Type: image/gif; name="4EF.gif" Content-Transfer-Encoding: base64 X-Attachment-Id: 4EF@goomoji.gmail Content-ID: <4EF@goomoji.gmail> R0lGODlhDwAPAJEDAABGdACJ46PQ7QAAACH5BAEAAAMALAAAAAAPAA8AAAI4nI+ZwKwHhJ +TgMdGC gxFPEGhV54UBZl6acLIRKLosu8JqlL6hWl9wnPqYREPRbzdoKJeVh/OpKAAAOw== --0016e646072eab7f24048a1b1d95 Content-Type: image/gif; name="32F.gif" Content-Transfer-Encoding: base64 X-Attachment-Id: 32F@goomoji.gmail Content-ID: <32F@goomoji.gmail> R0lGODlhDAAMAKIGAIoAAf/v7/+goAAAAP/X1/9YWP///wAAACH5BAEAAAYALAAAAAAMAA +wAAAMt aLXcGjDKFcgg+GJq9dCUII4isYiFkK4q66JFQJJmjAE3Vlc3rnO6oEnRKBoSADs= --0016e646072eab7f24048a1b1d95 Content-Type: image/gif; name="1B2.gif" Content-Transfer-Encoding: base64 X-Attachment-Id: 1B2@goomoji.gmail Content-ID: <1B2@goomoji.gmail> R0lGODlhEAAMAKIGAOoDA4oAAf/v7wAAAP9YWP/IyAAAAAAAACH5BAEAAAYALAAAAAAQAA +wAAAM9 SLbU3ssQ4IS91oEAgChgGHaBEQRCIILqGRHfIMqF8sb0QNvMt4Y8ie9XiwgLqlAyCGOdks +UbERSU PK6RBAA7 --0016e646072eab7f24048a1b1d95-- --0016e646072eab7f26048a1b1d96 Content-Type: application/pdf; name="something.pdf" Content-Disposition: attachment; filename="something.pdf" Content-Transfer-Encoding: base64 X-Attachment-Id: f_gazmqs6a1 [REALLY LONG ATTACHMENT] --0016e646072eab7f26048a1b1d96--
I know it's not pretty, but please bear with me, I'm fairly new to perl. My question is this: what's the best approach to this task? Am I way off? Did I make a stupid mistake? Thanks, Nick

Replies are listed 'Best First'.
Re: Trying to correctly parse MIME
by SilasTheMonk (Chaplain) on Jun 29, 2010 at 06:20 UTC

    Hi Interwebs

    I have downloaded your files and had a play. First of all your script is not terribly informative. I was debugging at around line 16 and I think at this stage you find it helpful to add this there:

    die "empty header" if !$header; print $header->as_string;
    This shows that if you remove the initial empty line at the start of the email file it at least it picks up the headers. Did you try debugging the code and using the various as_string methods in MIME::Tools to get more insight into what is happening?