ksublondie has asked for the wisdom of the Perl Monks concerning the following question:

Dear Monks,

I'm using Mail::Pop3Client to retrieve messages from an exchange account with an ssl connection. I've run into a few problems when getting the email body using Body():

1. html emails contain more than just html tags: they include some header info and miscellaneous information that varies depending on which carrier the email is from, text is even duplicated and special characters are converted differently. I've tried using regular expressions to clean it up, but it's getting hairy to account for all the discrepancies, and I want to make sure this will work from ANY carrier. I can make all this work, however...

2. Emails originated from Cox (there may be more providers, this is just the one that i've discovered so far) get translated to gibberish. eg: original plain text mail says 'This is a test...', but Body() returns 'VGhpcyBpcyBhIHRlc3QuLi4NCg==' -- that's it -- and the html email returns a basic header followed by lines and lines of seemingly random characters. Viewing the email through outlook looks normal.

The ultimate goal is to take the body of the message in plain simple text and send a text message to a cell phone. Any extra/confusing characters are unacceptable. Is there a better module out there that I should be using instead or is there a way to make this work?

for (my $i=1; $i<=$messages; $i++){ foreach( $pop->Head($i)){ if($_=~/From:[^<>]*\<(.*)\>/){ print "From: $1 "; $emails[$i]->{'from'}=$1; } if($_=~/Subject:(.*)/){ #print "To: $1 "; $emails[$i]->{'subject'}=$1; } } my $body=$pop->Body($i); $body=~s/\n/ /g; $body=~s/\r/ /g; $body=~s/^.*\<body[^<>]*\>(.*)/$1/; $body=~s/(.*)\<\/body[^<>]*\>.*$/$1/; while($body=~/[<>]/){ $body=~s/\<[^<>]*\>(.*)/$1/; } $body=~s/&#8217;/\'/g; $body=~s/&#39;/\'/g; $body=~s/\=92/\'/g; $body=~s/\=A0/ /g; $body=~s/\=\s //g; $body=~s/\&nbsp;/ /g; $body=~s/\s{2,}/ /g; $emails[$i]->{'body'}=$body; #$pop->Delete($i); print "Message: ".$body."\n\n"; } $pop->Close();

Replies are listed 'Best First'.
Re: Mail::Pop3Client - want to get consistent body text
by zwon (Abbot) on May 19, 2009 at 19:58 UTC
      Email::MIME only returns the text for plain text emails, but body() on HTML emails returns nothing. Still need a way to get the HTML emails from Cox.

      Still playing with MIME::Tools...

        If you show us your code, it would really help us to help you. Note that you should pass to Email::MIME the whole message including headers (at least MIME headers) in order it could parse it. Also HTML messages are usually multipart, so you have to invoke parts() before body()

Re: Mail::Pop3Client - want to get consistent body text
by trwww (Priest) on May 19, 2009 at 21:57 UTC

    Hello,

    The problem is that an email message can be formatted in quite a few different ways.

    Mail::Box is a pretty good way of quickly normalizing email messages in to the format you need.

    Regards,