ady has asked for the wisdom of the Perl Monks concerning the following question:

Wise Monks!

I need to parse the TEXT portion out of emails;

I've looked at
Email::Simple Mail::Internet MIME::Tools
but these modules only offer to extract mail header & body (including all the HTML tagging & binary mail encoding), so i have to filter out the TEXT myself.

I may just as well filter the raw email message instead, -- which is what i currently do...
There's sure to be a better way tho'. A pointer in the right direction would be much appreciated.

best regards,
allan

===========================================================
As the eternal tranquility of Truth reveals itself to us, this very place is the Land of Lotuses

-- Hakuin Ekaku Zenji

Replies are listed 'Best First'.
Re: Extracting TEXT from email
by PodMaster (Abbot) on Apr 30, 2005 at 11:03 UTC
    but these modules only offer to extract mail header & body (including all the HTML tagging & binary mail encoding), so i have to filter out the TEXT myself
    Those modules give you everything you need to get at the message. If somebody sends you html, you should first complain, and then grab something from the HTML:: namespace (like ::Stripper, ::Scrubber) so you can remove it.

    update: MIME::Tools -> examples -> mimeexplode

    MJD says "you can't just make shit up and expect the computer to know what you mean, retardo!"
    I run a Win32 PPM repository for perl 5.6.x and 5.8.x -- I take requests (README).
    ** The third rule of perl club is a statement of fact: pod is sexy.

      Thanks.
      For further clarification : an example:

      use Mail::Internet; $msgfile = "Angelee.msg"; open (MSG, "$msgfile") or die "Can't open $msgfile: $!\n"; $msg = new Mail::Internet \*MSG; close (MSG); $body = $msg->body(); $msg->print_body(\*STDOUT);

      The message body as dumped to the terminal contains approx. 80% mail binary and HTML formating chars and only 20% corresponding to the transmitted TEXT payload.

      I'd like a function to strip off all that junk:
      msg_clean($body);

      Ok, so i've written some regex filtering to do it, but that's hardly as as flexible & robust as a real MIME-knowledgeable parsing could be.
      I expext there would be a msg_clean or body2text or equiv. function out there ?

      allan
        How about you go study mimeexplode?

        MJD says "you can't just make shit up and expect the computer to know what you mean, retardo!"
        I run a Win32 PPM repository for perl 5.6.x and 5.8.x -- I take requests (README).
        ** The third rule of perl club is a statement of fact: pod is sexy.

Re: Extracting TEXT from email
by jhourcle (Prior) on Apr 30, 2005 at 13:48 UTC

    It would help if you explained what the input was. How is the text that you are trying to get encoded into the email? What are you qualifying as text? (ie, based on the input, what are you trying to get as the output?)

    There are many, many ways to encode text into an e-mail (MIME, PGP, PGP+MIME, UUEncode, BinHex, BinHex+MIME, Quoted Printable, etc.) Without knowing what you're dealing with, we can only guess at what it is that you're asking for.

      I'm at the "receiving end" of the mail wire: i receice mails in my (MS Windows Exchange) inbox encoded in standard mail/MIME format.

      I'm interested in the text part of the body of these mails, that is: "what follows the mail header" (ie. the From:, Sent: To: Issue: stuff). The body contains haiku entries, that i parse and reshuffle into a voting list, and subsequently rank according to received votes, -- but the app as such is not that interesting in this context.
      An example of the text part of a mail message is:

      [author] xxx yyy [1] clouds . . . the distance blossoming between two crows [2] a morning without incident dead fly [3] sunrise ceremony the holy man's third eye bloodshot [4] morning dew bell bottoms darkened by mayflies

      This is what i'm interested in parsing out, and this is the text part of the message, that is displayed in the mail client (in casu: MS Outlook).

      The problem is, that the above text is not what i get from the mail body handed over by the the mentioned MIME modules. Instead i get the full mail body segment, including binary MIME encodings and HTML tagging.

      So i have to do some filtering to get at the text "payload", that i need for the app. Now i was wondering, if anybody had already wrapped this functionality into a function, possibly in a MIME module. That was my question

      I haven't worked with email before, so maybe i'm simply overlooking som basic assumptions about the MIME format & parsing...
      -- allan

        I find it's good to understand what you're working with. (this being said as I deal with data at work that I have absolutely no idea what it actually means)

        Basically, email is sent using SMTP as what it calls a mail object, which is composed of headers, an empty line, and a message body.

        Bodys are required to be ASCII, which limits you to 7bits, but someone thought it would be a good idea to send non-text files, so came up with MIME. Using MIME, the body may be identified as being one or more encapsulted objects. To mark the body as being MIME encoded, there are additional headers inserted into the heading of the email message.

        There's a fair bit of background information in the MIME::Tools documentation.

        what about a glance at perlretut?
        language is a virus from outer space.
Re: Extracting TEXT from email
by Anonymous Monk on Apr 30, 2005 at 19:10 UTC
    Interesting. Many if not most HTML messages are sent multipart--several MIME types in one message, including at minimum an HTML "part" for HTMLized email clients and a plain text part for older email clients or people who prefer to read their messages in that format. Any attachments will also have their own MIME part of the appropriate type.

    I used Email::MIME, which no one has yet mentioned, to take emails I send from my mobile phone and turn them into Web posts, with plain text and/or JPEG photo. Here is a modified untested version of that code which may suit you purposes:

    my $parsed = Email::MIME->new($message) or die "Could not parse email +message: $!"; #$message is full text of entire email m\ essage foreach my $part ($parsed->parts) { if ($part->content_type =~ /text\/plain/i) { #You have a plain text part #Do stuff here with $part->body } elsif ($part->content_type =~ /image\/jpeg/i) { #You have a JPEG part #in $part->body } elsif ($part->content_type =~ /text\/html/i) { #You have an HTML part #in part body my $html = $part->body; my $plain_text; my $parsed_text = HTML::TokeParser->new(\$html) or die "Cannot rea +d message text for parsing and cleaning: $!"; while (my $token = $parsed_text->get_token) { if ($token->[0] eq 'T') { # text $plain_text .= $token->[1]; } } #Do stuff with $plain_text extracted from HTML here } }

    Notice the HTML::TokeParser part inside the HTML section. You'll only want to use that if the plain text part is unavailable to you. HTH.

      I like this approach!, but alas it didn't recognize any text or html parts in my message.

      #=========================================================== # Program EM.pl #!/usr/bin/perl -w #use strict; use Email::MIME; use HTML::TokeParser; use Data::Dumper; my $msgfile = "Andrew.msg"; # A test message file from MS Outlook open (MSG, "$msgfile") or die "Can't open $msgfile: $!\n"; my $message = do { local $/; <MSG> }; # $/=undef; my $e=<FH>; close(MSG); my $parsed = Email::MIME->new($message) or die "Could not parse email message: $!"; #$message is full text of entire email message foreach my $part ($parsed->parts) { if ($part->content_type =~ /text\/plain/i) { #You have a plain text part: do stuff here with $part->body print $part->body; } elsif ($part->content_type =~ /image\/jpeg/i) { #You have a JPEG part in $part->body } elsif ($part->content_type =~ /text\/html/i) { #You have an HTML part in part body my $html = $part->body; my $plain_text; my $parsed_text = HTML::TokeParser->new(\$html) or die "Cannot read message text for parsing and cleaning: $! +"; while (my $token = $parsed_text->get_token) { ´ if ($token->[0] eq 'T') { $plain_text .= $token->[1];} # text } #Do stuff with $plain_text extracted from HTML here print $plain_text; } else { print "NO MATCH\n"; foreach (keys %$part) { ${%$part}{$_} =~ s/\W*//g; } # Zap non-w +ord print Data::Dumper->Dump( [%$part] ); # for test outpu +t } } #=========================================================== C:\Perl\Test\MIME>perl -w EM.pl NO MATCH $VAR1 = 'body'; $VAR2 = 'PPPBYahooGroupsLinksBBRPULLITovisityourgrouponthewebgotoBRAhrefhttpgr +oupsyahoocomgrouphaiku kaiIIIhttpgroupsyahoocomgrouphaikukaiIIIABRnbspLITounsubscribefromthis +groupsendanemailtoBRAhref <cut... a lot more lines of this stuff> mailt stg10_5FF70102DFA__prope +rties_version100X99q___Nd_Ad0'; $VAR3 = 'head'; $VAR4 = 'HASH0x1625814'; $VAR5 = 'mycrlf'; $VAR6 = ''; $VAR7 = 'header_names'; $VAR8 = 'HASH0x1c60278'; $VAR9 = 'order'; $VAR10 = 'ARRAY0x1af5550'; $VAR11 = 'parts'; $VAR12 = 'ARRAY0x16259ac'; $VAR13 = 'ct'; $VAR14 = 'HASH0x1aa33e4'; C:\Perl\Test\MIME> #===========================================================
      My conclusion is, that there's probably no simple :
      my $text = msg_clean($email);
      function out there, and i'll have to do a top-down parsing of the MIME object to get at the part of the email, that interests me (As an alternative to the simple brute force regex filtering, that i'm using right now. The latter approach works ok as long as the text is enclosed in proper tags, but it easily breaks, if it isn't)

      This is basically what several of you (actually all of you) have tried to tell me, but i didn't quite want to give up on my laziness up front... A full parsing of the email is more work, but also more robust and surely in the long run will allow me to be lazy at at higher level...

      So i think i'll start digging into the MIME::Tools
      thanks for your patience!
      Best regards
      -- Allan
        Hi, I wrote the node you are replying to, thought I was logged in but wasn't.

        It looks like you can just take the body of the email -- it may not be MIME encoded at all from the looks of things -- and run it straight through the HTML::TokeParser code I included to take out the text, without checking the type. In other words, just take the code starting at #You have an HTML part in part body through print $plain_text; and put it in your final else block where the Data Dumper code is right now.