Re^2: Extracting TEXT from email

I like this approach!, but alas it didn't recognize any text or html parts in my message.

#===========================================================
# Program EM.pl
#!/usr/bin/perl -w
#use strict;
use Email::MIME;
use HTML::TokeParser;
use Data::Dumper;

my $msgfile = "Andrew.msg";    # A test message file from MS Outlook

open (MSG, "$msgfile") or die "Can't open $msgfile: $!\n";
my $message = do { local $/; <MSG> };        # $/=undef; my $e=<FH>;
close(MSG);

my $parsed = Email::MIME->new($message)
   or die "Could not parse email message: $!";
   #$message is full text of entire email message
    
foreach my $part ($parsed->parts) {

   if ($part->content_type =~ /text\/plain/i) {
   #You have a plain text part: do stuff here with $part->body
      print $part->body;
   }
   elsif ($part->content_type =~ /image\/jpeg/i) {
   #You have a JPEG part in $part->body

   } elsif ($part->content_type =~ /text\/html/i) {
      #You have an HTML part in part body
      my $html = $part->body;
      my $plain_text;
      my $parsed_text = HTML::TokeParser->new(\$html)
         or die "Cannot read message text for parsing and cleaning: $!
+";
      while (my $token = $parsed_text->get_token) {
´        if ($token->[0] eq 'T') { $plain_text .= $token->[1];} # text
      }
      #Do stuff with $plain_text extracted from HTML here
      print $plain_text;

   } else {
      print "NO MATCH\n";
      foreach (keys %$part) { ${%$part}{$_} =~ s/\W*//g; } # Zap non-w
+ord
      print Data::Dumper->Dump( [%$part] );           # for test outpu
+t
   }
}

#===========================================================
C:\Perl\Test\MIME>perl -w EM.pl
NO MATCH
$VAR1 = 'body';
$VAR2 = 
'PPPBYahooGroupsLinksBBRPULLITovisityourgrouponthewebgotoBRAhrefhttpgr
+oupsyahoocomgrouphaiku
kaiIIIhttpgroupsyahoocomgrouphaikukaiIIIABRnbspLITounsubscribefromthis
+groupsendanemailtoBRAhref
<cut... a lot more lines of this stuff> mailt stg10_5FF70102DFA__prope
+rties_version100X99q___Nd_Ad0';
$VAR3 = 'head';
$VAR4 = 'HASH0x1625814';
$VAR5 = 'mycrlf';
$VAR6 = '';
$VAR7 = 'header_names';
$VAR8 = 'HASH0x1c60278';
$VAR9 = 'order';
$VAR10 = 'ARRAY0x1af5550';
$VAR11 = 'parts';
$VAR12 = 'ARRAY0x16259ac';
$VAR13 = 'ct';
$VAR14 = 'HASH0x1aa33e4';

C:\Perl\Test\MIME>
#===========================================================
[download]

My conclusion is, that there's probably no simple :

   my $text = msg_clean($email);
[download]

function out there, and i'll have to do a top-down parsing of the MIME object to get at the part of the email, that interests me (As an alternative to the simple brute force regex filtering, that i'm using right now. The latter approach works ok as long as the text is enclosed in proper tags, but it easily breaks, if it isn't)

This is basically what several of you (actually all of you) have tried to tell me, but i didn't quite want to give up on my laziness up front... A full parsing of the email is more work, but also more robust and surely in the long run will allow me to be lazy at at higher level...

So i think i'll start digging into the MIME::Tools
thanks for your patience!
Best regards
-- Allan

Comment on Re^2: Extracting TEXT from email Select or Download Code

Replies are listed 'Best First'.
Re^3: Extracting TEXT from email by ryantate (Friar) on May 02, 2005 at 18:57 UTC
Hi, I wrote the node you are replying to, thought I was logged in but wasn't. It looks like you can just take the body of the email -- it may not be MIME encoded at all from the looks of things -- and run it straight through the HTML::TokeParser code I included to take out the text, without checking the type. In other words, just take the code starting at `#You have an HTML part in part body` through `print $plain_text;` and put it in your final `else` block where the Data Dumper code is right now.	[reply] [d/l] [select]

Replies are listed 'Best First'.

Re^3: Extracting TEXT from email
by ryantate (Friar) on May 02, 2005 at 18:57 UTC

It looks like you can just take the body of the email -- it may not be MIME encoded at all from the looks of things -- and run it straight through the HTML::TokeParser code I included to take out the text, without checking the type. In other words, just take the code starting at #You have an HTML part in part body through print $plain_text; and put it in your final else block where the Data Dumper code is right now.

[reply]
[d/l]
[select]