in reply to Extracting TEXT from email

Interesting. Many if not most HTML messages are sent multipart--several MIME types in one message, including at minimum an HTML "part" for HTMLized email clients and a plain text part for older email clients or people who prefer to read their messages in that format. Any attachments will also have their own MIME part of the appropriate type.

I used Email::MIME, which no one has yet mentioned, to take emails I send from my mobile phone and turn them into Web posts, with plain text and/or JPEG photo. Here is a modified untested version of that code which may suit you purposes:

my $parsed = Email::MIME->new($message) or die "Could not parse email +message: $!"; #$message is full text of entire email m\ essage foreach my $part ($parsed->parts) { if ($part->content_type =~ /text\/plain/i) { #You have a plain text part #Do stuff here with $part->body } elsif ($part->content_type =~ /image\/jpeg/i) { #You have a JPEG part #in $part->body } elsif ($part->content_type =~ /text\/html/i) { #You have an HTML part #in part body my $html = $part->body; my $plain_text; my $parsed_text = HTML::TokeParser->new(\$html) or die "Cannot rea +d message text for parsing and cleaning: $!"; while (my $token = $parsed_text->get_token) { if ($token->[0] eq 'T') { # text $plain_text .= $token->[1]; } } #Do stuff with $plain_text extracted from HTML here } }

Notice the HTML::TokeParser part inside the HTML section. You'll only want to use that if the plain text part is unavailable to you. HTH.

Replies are listed 'Best First'.
Re^2: Extracting TEXT from email
by ady (Deacon) on May 01, 2005 at 09:45 UTC
    I like this approach!, but alas it didn't recognize any text or html parts in my message.

    #=========================================================== # Program EM.pl #!/usr/bin/perl -w #use strict; use Email::MIME; use HTML::TokeParser; use Data::Dumper; my $msgfile = "Andrew.msg"; # A test message file from MS Outlook open (MSG, "$msgfile") or die "Can't open $msgfile: $!\n"; my $message = do { local $/; <MSG> }; # $/=undef; my $e=<FH>; close(MSG); my $parsed = Email::MIME->new($message) or die "Could not parse email message: $!"; #$message is full text of entire email message foreach my $part ($parsed->parts) { if ($part->content_type =~ /text\/plain/i) { #You have a plain text part: do stuff here with $part->body print $part->body; } elsif ($part->content_type =~ /image\/jpeg/i) { #You have a JPEG part in $part->body } elsif ($part->content_type =~ /text\/html/i) { #You have an HTML part in part body my $html = $part->body; my $plain_text; my $parsed_text = HTML::TokeParser->new(\$html) or die "Cannot read message text for parsing and cleaning: $! +"; while (my $token = $parsed_text->get_token) { ´ if ($token->[0] eq 'T') { $plain_text .= $token->[1];} # text } #Do stuff with $plain_text extracted from HTML here print $plain_text; } else { print "NO MATCH\n"; foreach (keys %$part) { ${%$part}{$_} =~ s/\W*//g; } # Zap non-w +ord print Data::Dumper->Dump( [%$part] ); # for test outpu +t } } #=========================================================== C:\Perl\Test\MIME>perl -w EM.pl NO MATCH $VAR1 = 'body'; $VAR2 = 'PPPBYahooGroupsLinksBBRPULLITovisityourgrouponthewebgotoBRAhrefhttpgr +oupsyahoocomgrouphaiku kaiIIIhttpgroupsyahoocomgrouphaikukaiIIIABRnbspLITounsubscribefromthis +groupsendanemailtoBRAhref <cut... a lot more lines of this stuff> mailt stg10_5FF70102DFA__prope +rties_version100X99q___Nd_Ad0'; $VAR3 = 'head'; $VAR4 = 'HASH0x1625814'; $VAR5 = 'mycrlf'; $VAR6 = ''; $VAR7 = 'header_names'; $VAR8 = 'HASH0x1c60278'; $VAR9 = 'order'; $VAR10 = 'ARRAY0x1af5550'; $VAR11 = 'parts'; $VAR12 = 'ARRAY0x16259ac'; $VAR13 = 'ct'; $VAR14 = 'HASH0x1aa33e4'; C:\Perl\Test\MIME> #===========================================================
    My conclusion is, that there's probably no simple :
    my $text = msg_clean($email);
    function out there, and i'll have to do a top-down parsing of the MIME object to get at the part of the email, that interests me (As an alternative to the simple brute force regex filtering, that i'm using right now. The latter approach works ok as long as the text is enclosed in proper tags, but it easily breaks, if it isn't)

    This is basically what several of you (actually all of you) have tried to tell me, but i didn't quite want to give up on my laziness up front... A full parsing of the email is more work, but also more robust and surely in the long run will allow me to be lazy at at higher level...

    So i think i'll start digging into the MIME::Tools
    thanks for your patience!
    Best regards
    -- Allan
      Hi, I wrote the node you are replying to, thought I was logged in but wasn't.

      It looks like you can just take the body of the email -- it may not be MIME encoded at all from the looks of things -- and run it straight through the HTML::TokeParser code I included to take out the text, without checking the type. In other words, just take the code starting at #You have an HTML part in part body through print $plain_text; and put it in your final else block where the Data Dumper code is right now.