http://qs1969.pair.com?node_id=194574

RuneK has asked for the wisdom of the Perl Monks concerning the following question:

Hi all, I'm writing a script to decode a mail so that I ONLY get the:

to from cc subject body

I'm using the MIME::Parser to decode the mail, but I run into some very anoying mail format problems.

When I send a mail from hotmail the body contains a lot of HTML that I do not need in the pure format. The from field sometimes contain an alias in this way:

"Usern name" test@test.dk

where I only want the pure mail address.

So my question is. What is the trick to ONLY the the pure information from a mail where the mails format can wary a lot in the header and body section?

Best regards, RuneK

Replies are listed 'Best First'.
Use Mail::MboxParser for extracting body text
by jlongino (Parson) on Sep 02, 2002 at 17:31 UTC
    In my recent mail-munging experiments, I found that the easiest way to get the simple mail info was to use Mail::MboxParser. This won't directly help with the e-mail address problem (which can probably be done via regex) but it should help in most cases with the message body. I'm a novice at regexp, so I'll leave that part for someone more qualified.

    The problem is that depending on the client software used, it may be possible to send a message with an HTML only message body by preference settings. Here is an example that will extract the plaintext body message if one exists. Note that this can be used to read single or multiple messages:

    #!/usr/local/bin/perl use strict; use warnings; use Mail::MboxParser; my $mbox= \*STDIN; my $mb = Mail::MboxParser->new($mbox); for my $msg ($mb->get_messages) { my $to = $msg->header->{to}; my $from = $msg->header->{from}; my $cc = $msg->header->{cc} || " ", my $subject = $msg->header->{subject} || '<No Subject:>', my $body = $msg->body($msg->find_body,0); my $body_str = $body->as_string || '<No message text>'; print "To: $to\n", "From: $from\n", "Cc: $cc\n", "Subject: $subject\n", "Message Text: $body_str\n"; print "~" x 77, "\n\n"; }
    Just remember to handle errors/warnings for fields that are undefined.

    You might check this node for an example of how to select only text attachments when dealing with multi-part MIME messages.

    Update1: After a bit of R&R (reflection & research), I figured it would be best to mention a few things about the To: field format. I probably shouldn't have mentioned using a regexp to scarf the real e-mail address. Check out the documentation for RFC822 for details about the various ways header fields can be formatted. Things like folding and multiple addresses in the To: field to name a few.

    One option that may prove worth looking at is Email::Valid , from the docs:

    Let's see an example of how the address may be modified: $addr = Email::Valid->address('Alfred Neuman <Neuman @ foo.bar +>'); print "$addr\n"; # prints Neuman@foo.bar
    This would do what you need for a simple single address but you would still have to figure out a way to parse multiple addresses. Maybe someone else knows of another module or method for the address field problem.

    Update2: Some related nodes found via Super Search.

    --Jim

Re: Decode a mail with MIME::Parser
by blahblahblah (Priest) on Sep 03, 2002 at 04:11 UTC
    A simple regex that seems to always work for me for pulling the 'real' address out is this:
    $from = $1 if $from =~ /<(\S+)>/;
    If $from was "Usern name" test@test.dk, you'd wind up with 'test@test.dk'. If $from was just 'test@test.dk', it still would be since the regex wouldn't match.

    As far as the problems of folding and multiple addresses in a field that jlongino pointed out, I think MIME::Head does a good job of taking care of the messiness of that for you. I have a script that used to do a lot of that manually, but I was always tweaking it for special cases that I'd never thought of. For example, addresses like "Name, User" <user@whatever> threw off my parsing of multiples because of the commas. And I started out thinking I only had to handle folding (wrapping onto multiple lines) for Subjects, but have now seen it in every header you mentioned, even the From header (The person had a very long user name in quotes). The perldoc is very long, but take a look at the 'get' and the 'unfold' methods to handle the problems of getting multiples and folding.

Re: Decode a mail with MIME::Parser
by RuneK (Sexton) on Sep 03, 2002 at 08:02 UTC
    Hi again, Thanks for the answer but I cannot seem to get your part with the mbox to work.
    I'm using the .procmailrc with the following statement:

    :0 | /tmp/mailparser.pl> /tmp/test.log

    So I'm parsing the mail directly to the script which should read it through STDIN.
    I changed the print (to,from,subject,body) to print to a file instead because that is what I need, but the file is empty. The script now looks like this:
    #!/usr/bin/perl -w use strict; use warnings; use Mail::MboxParser; my $mbox= \*STDIN; my $mb = Mail::MboxParser->new($mbox); open (MSG,">mailmessage"); for my $msg ($mb->get_messages) { my $to = $msg->header->{to}; my $from = $msg->header->{from}; my $cc = $msg->header->{cc} || " ", my $subject = $msg->header->{subject} || '<No Subject:>', my $body = $msg->body($msg->find_body,0); my $body_str = $body->as_string || '<No message text>'; print MSG "To: $to\n", "From: $from\n", "Cc: $cc\n", "Subject: $subject\n", "Message Text: $body_str\n"; print MSG "~" x 77, "\n\n"; close MSG; }

    The file is created so it's not a permission problem but it is empty. The MboxParser.pm is just installed.
    Any ideas?
    Thanks, Rune
      I'd move
      close MSG;

      outside of the for loop just for aesthetic purposes and check for errors on your open statement:

      open (MSG, ">>mailmessage") or die "File open failed: $!\n";
      but otherwise it performs as I would expect. Try saving a copy of your mail message to a file and invoke the program like so:
      /tmp/mailparser.pl < mail.message.file

      This suggests that you'll probably have to modify your procmail config line in some way, but I'm not familiar with procmail.

      Update: I've had a similar problem using the /etc/aliases file to strip/redirect mail to a special account. I solved my problem by parsing the message and then resending it to a different account. This is sort of ugly, but it works until I can figure out the correct method:

      #/etc/aliases entry ## The below entry saves an unaltered copy on account original, and ## pipes the incoming message through our program which ## resends to the final destination account. Be sure to execute ## 'newaliases' after you modify the /etc/aliases for the changed ## entry to be updated. original: \original, "|/tmp/mailparser.pl"
      The modified perl program:
      ## # MboxParser stuff here ## use MIME::Lite; my $msg = new MIME::Lite To => 'targetaccount@same.site.net', From => $from, Cc => $cc, Subject => $subject, Type => 'TEXT', Comments => $to, ## save original to: if needed Data => $body_str; MIME::Lite->send('smtp', "some.smtp.net", Timeout=>60); $msg->send;

      --Jim

      I faced the same problem with pipeline from .procmailrc \*STDIN; seem to be not work..