Interwebs has asked for the wisdom of the Perl Monks concerning the following question:
I have been tearing out my hair for several days now trying to write a perl script that will take a MIME email, parse the header, parse the parts, decode(if necessary) and stick the parts that are text/plain into $body, and decode(if necessary) and strip tags from the parts that are text/html, then also placing them into $body. (This will lead to its actual function down the line.
The problem is, MIME::Decoder seems unable to parse my messages, and it claims that the message does not have separate parts. It also doesn't parse my header correctly, but that's not a big deal, as Email::MIME does.
Here is what I wrote so far (some code was grabbed from other perlmonks nodes):An email I'd parse might look like this:#!/usr/local/bin/perl use MIME::Parser; use MIME::Entity; use MIME::Body; my (@body, $i, $subentity); my $parser = new MIME::Parser; $parser->output_to_core(1); my $entity = $parser->parse_open("C:\\Documents and Settings\\user\\My + Documents\\email.txt"); %request = (); my $header = $entity->head; $request{SUBJECT} = $header->get('Subject'); chomp $request +{SUBJECT}; $request{TO} = $header->get('To'); chomp $request{T +O}; $request{FROM} = $header->get('From'); chomp $reque +st{FROM}; $request{CC} = $header->get('CC'); chomp $request{CC +}; if ($entity->parts > 0){ for ($i=0; $i<$entity->parts; $i++){ $subentity = $entity->parts($i); if (($subentity->mime_type =~ m/text\/html/i) || ($subentity-> +mime_type =~ m/text\/plain/i)){ $body = join "", @{$subentity->body}; next; } #Multipart/Alternative elsif ($subentity->mime_type =~ m/multipart\/alternative/i){ $body = join "", @{$subentity->body}; @body = split /------=_NextPart_\S*\n/, $body; $body = $body[1]; $body =~ s/^Content-Type.*Content-Transfer-Encoding.*?\n+/ +/is; next; } } } else { print("No body text."); } $body = &untag($body); #remove extra space $body =~ tr/ //s; $body =~ s/\n+ +\n//g; $body =~ s/\n\n{2,}//g; print($body); sub untag { local $_ = $_[0] || $_; # ALGORITHM: # find < , # comment <!-- ... -->, # or comment <? ... ?> , # or one of the start tags which require correspond # end tag plus all to end tag # or if \s or =" # then skip to next " # else [^>] # > s{ < # open tag (?: # open group (A) (!--) | # comment (1) or (\?) | # another comment (2) or (?i: # open group (B) for /i ( TITLE | # one of start tags SCRIPT | # for which APPLET | # must be skipped OBJECT | # all content STYLE # to correspond ) # end tag (3) ) | # close group (B), or ([!/A-Za-z]) # one of these chars, remember in (4) ) # close group (A) (?(4) # if previous case is (4) (?: # open group (C) (?! # and next is not : (D) [\s=] # \s or "=" ["`'] # with open quotes ) # close (D) [^>] | # and not close tag or [\s=] # \s or "=" with `[^`]*` | # something in quotes ` or [\s=] # \s or "=" with '[^']*' | # something in quotes ' or [\s=] # \s or "=" with "[^"]*" # something in quotes " )* # repeat (C) 0 or more times | # else (if previous case is not (4)) .*? # minimum of any chars ) # end if previous char is (4) (?(1) # if comment (1) (?<=--) # wait for "--" ) # end if comment (1) (?(2) # if another comment (2) (?<=\?) # wait for "?" ) # end if another comment (2) (?(3) # if one of tags-containers (3) </ # wait for end (?i:\3) # of this tag (?:\s[^>]*)? # skip junk to ">" ) # end if (3) > # tag closed }{}gsx; # STRIP THIS TAG return $_ ? $_ : ""; }
I know it's not pretty, but please bear with me, I'm fairly new to perl. My question is this: what's the best approach to this task? Am I way off? Did I make a stupid mistake? Thanks, NickMIME-Version: 1.0 Received: by xx.xxx.xxx.xxx with HTTP; Mon, 28 Jun 2010 11:16:13 -0700 + (PDT) X-Goomoji-Body: true Date: Mon, 28 Jun 2010 14:16:13 -0400 Delivered-To: xxxxxx@gmail.com Message-ID: <xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx@mail.gmail.com> Subject: dfgyhh From: Foo Bar <xxxxxxx@gmail.com> To: xxxxxxx@gmail.com Content-Type: multipart/mixed; boundary=0016e646072eab7f26048a1b1d96 --0016e646072eab7f26048a1b1d96 Content-Type: multipart/related; boundary=0016e646072eab7f24048a1b1d95 --0016e646072eab7f24048a1b1d95 Content-Type: multipart/alternative; boundary=0016e646072eab7f20048a1b +1d94 --0016e646072eab7f20048a1b1d94 Content-Type: text/plain; charset=ISO-8859-1 *This is totally obnoxious HTML EMAIL! I'm writing from GMAIL* [?][?][ +?] AND IT MAKES *ME Want to throw up :(* *Look at my rad alignment* Lookie <www.yahoo.com> --0016e646072eab7f20048a1b1d94 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable <b>This is totally obnoxious HTML EMAIL! I'm writing from <span st +yle= =3D"background-color: rgb(255, 102, 0);">GMAIL</span></b> <img goomoji +=3D"4= EF" style=3D"margin: 0pt 0.2ex; vertical-align: middle;" src=3D"cid:4E +F@goo= moji.gmail"><img goomoji=3D"32F" style=3D"margin: 0pt 0.2ex; vertical- +align= : middle;" src=3D"cid:32F@goomoji.gmail"><img goomoji=3D"1B2" style=3D +"marg= in: 0pt 0.2ex; vertical-align: middle;" src=3D"cid:1B2@goomoji.gmail"> + <fon= t size=3D"6">AND IT MAKES <u>ME <font face=3D"georgia,serif">Want to t +hrow = up :(</font></u></font> <br> <div style=3D"text-align: right;"><font face=3D"arial black,sans-serif +"><u>= <font size=3D"2">Look at my rad alignment</font></u></font><br><br><a +href= =3D"http://www.yahoo.com">Lookie</a><br><br> </div> --0016e646072eab7f20048a1b1d94-- --0016e646072eab7f24048a1b1d95 Content-Type: image/gif; name="4EF.gif" Content-Transfer-Encoding: base64 X-Attachment-Id: 4EF@goomoji.gmail Content-ID: <4EF@goomoji.gmail> R0lGODlhDwAPAJEDAABGdACJ46PQ7QAAACH5BAEAAAMALAAAAAAPAA8AAAI4nI+ZwKwHhJ +TgMdGC gxFPEGhV54UBZl6acLIRKLosu8JqlL6hWl9wnPqYREPRbzdoKJeVh/OpKAAAOw== --0016e646072eab7f24048a1b1d95 Content-Type: image/gif; name="32F.gif" Content-Transfer-Encoding: base64 X-Attachment-Id: 32F@goomoji.gmail Content-ID: <32F@goomoji.gmail> R0lGODlhDAAMAKIGAIoAAf/v7/+goAAAAP/X1/9YWP///wAAACH5BAEAAAYALAAAAAAMAA +wAAAMt aLXcGjDKFcgg+GJq9dCUII4isYiFkK4q66JFQJJmjAE3Vlc3rnO6oEnRKBoSADs= --0016e646072eab7f24048a1b1d95 Content-Type: image/gif; name="1B2.gif" Content-Transfer-Encoding: base64 X-Attachment-Id: 1B2@goomoji.gmail Content-ID: <1B2@goomoji.gmail> R0lGODlhEAAMAKIGAOoDA4oAAf/v7wAAAP9YWP/IyAAAAAAAACH5BAEAAAYALAAAAAAQAA +wAAAM9 SLbU3ssQ4IS91oEAgChgGHaBEQRCIILqGRHfIMqF8sb0QNvMt4Y8ie9XiwgLqlAyCGOdks +UbERSU PK6RBAA7 --0016e646072eab7f24048a1b1d95-- --0016e646072eab7f26048a1b1d96 Content-Type: application/pdf; name="something.pdf" Content-Disposition: attachment; filename="something.pdf" Content-Transfer-Encoding: base64 X-Attachment-Id: f_gazmqs6a1 [REALLY LONG ATTACHMENT] --0016e646072eab7f26048a1b1d96--
|
|---|
| Replies are listed 'Best First'. | |
|---|---|
|
Re: Trying to correctly parse MIME
by SilasTheMonk (Chaplain) on Jun 29, 2010 at 06:20 UTC |