Extracting TEXT from email

ady has asked for the wisdom of the Perl Monks concerning the following question:

Replies are listed 'Best First'.

Re: Extracting TEXT from email
by PodMaster (Abbot) on Apr 30, 2005 at 11:03 UTC

but these modules only offer to extract mail header & body (including all the HTML tagging & binary mail encoding), so i have to filter out the TEXT myself

HTML::

update: MIME::Tools -> examples -> mimeexplode

MJD says "you can't just make shit up and expect the computer to know what you mean, retardo!"
I run a Win32 PPM repository for perl 5.6.x and 5.8.x -- I take requests (README).
** The third rule of perl club is a statement of fact: pod is sexy.

[reply]

Re^2: Extracting TEXT from email

by ady (Deacon) on Apr 30, 2005 at 11:32 UTC

  use Mail::Internet;
  $msgfile = "Angelee.msg";
  open (MSG, "$msgfile") or die "Can't open $msgfile: $!\n";
  $msg = new Mail::Internet \*MSG;
  close (MSG);
  $body = $msg->body();
  $msg->print_body(\*STDOUT);
[download]

  msg_clean($body);
[download]

msg_clean

body2text

[reply]
[d/l]
[select]

Re^3: Extracting TEXT from email

by PodMaster (Abbot) on Apr 30, 2005 at 11:52 UTC

[reply]

Re^4: Extracting TEXT from email

by ady (Deacon) on Apr 30, 2005 at 12:03 UTC

Re^5: Extracting TEXT from email

by PodMaster (Abbot) on Apr 30, 2005 at 20:54 UTC

Re: Extracting TEXT from email
by jhourcle (Prior) on Apr 30, 2005 at 13:48 UTC

It would help if you explained what the input was. How is the text that you are trying to get encoded into the email? What are you qualifying as text? (ie, based on the input, what are you trying to get as the output?)

There are many, many ways to encode text into an e-mail (MIME, PGP, PGP+MIME, UUEncode, BinHex, BinHex+MIME, Quoted Printable, etc.) Without knowing what you're dealing with, we can only guess at what it is that you're asking for.

[reply]

Re^2: Extracting TEXT from email

by ady (Deacon) on Apr 30, 2005 at 15:36 UTC

From:, Sent: To: Issue:

[author]   xxx yyy

[1]  
clouds . . .
the distance blossoming
between two crows

[2]
a morning
without incident
dead fly

[3]
sunrise ceremony
the holy man's third eye
bloodshot

[4]
morning dew
bell bottoms darkened
by mayflies
[download]

not

[reply]
[d/l]

Re^3: Extracting TEXT from email

by jhourcle (Prior) on Apr 30, 2005 at 19:23 UTC

I find it's good to understand what you're working with. (this being said as I deal with data at work that I have absolutely no idea what it actually means)

Basically, email is sent using SMTP as what it calls a mail object, which is composed of headers, an empty line, and a message body.

Bodys are required to be ASCII, which limits you to 7bits, but someone thought it would be a good idea to send non-text files, so came up with MIME. Using MIME, the body may be identified as being one or more encapsulted objects. To mark the body as being MIME encoded, there are additional headers inserted into the heading of the email message.

There's a fair bit of background information in the MIME::Tools documentation.

[reply]

Re^3: Extracting TEXT from email

by thcsoft (Monk) on May 01, 2005 at 11:11 UTC

language is a virus from outer space.

[reply]

Re: Extracting TEXT from email
by Anonymous Monk on Apr 30, 2005 at 19:10 UTC

I used Email::MIME, which no one has yet mentioned, to take emails I send from my mobile phone and turn them into Web posts, with plain text and/or JPEG photo. Here is a modified untested version of that code which may suit you purposes:

my $parsed = Email::MIME->new($message) or die "Could not parse email 
+message: $!"; #$message is full text of entire email m\
essage
foreach my $part ($parsed->parts) {
  if ($part->content_type =~ /text\/plain/i) {
    #You have a plain text part
    #Do stuff here with $part->body
  } elsif ($part->content_type =~ /image\/jpeg/i) {
    #You have a JPEG part
    #in $part->body
  } elsif ($part->content_type =~ /text\/html/i) {

    #You have an HTML part
    #in part body
    my $html =   $part->body;
    my $plain_text;
    my $parsed_text = HTML::TokeParser->new(\$html) or die "Cannot rea
+d
message text for parsing and cleaning: $!";
    while (my $token = $parsed_text->get_token) {
      if ($token->[0] eq 'T') {
        # text
        $plain_text .= $token->[1];
      }
    }
#Do stuff with $plain_text extracted from HTML here
  }
}
[download]

Notice the HTML::TokeParser part inside the HTML section. You'll only want to use that if the plain text part is unavailable to you. HTH.

[reply]
[d/l]

Re^2: Extracting TEXT from email

by ady (Deacon) on May 01, 2005 at 09:45 UTC

#===========================================================
# Program EM.pl
#!/usr/bin/perl -w
#use strict;
use Email::MIME;
use HTML::TokeParser;
use Data::Dumper;

my $msgfile = "Andrew.msg";    # A test message file from MS Outlook

open (MSG, "$msgfile") or die "Can't open $msgfile: $!\n";
my $message = do { local $/; <MSG> };        # $/=undef; my $e=<FH>;
close(MSG);

my $parsed = Email::MIME->new($message)
   or die "Could not parse email message: $!";
   #$message is full text of entire email message
    
foreach my $part ($parsed->parts) {

   if ($part->content_type =~ /text\/plain/i) {
   #You have a plain text part: do stuff here with $part->body
      print $part->body;
   }
   elsif ($part->content_type =~ /image\/jpeg/i) {
   #You have a JPEG part in $part->body

   } elsif ($part->content_type =~ /text\/html/i) {
      #You have an HTML part in part body
      my $html = $part->body;
      my $plain_text;
      my $parsed_text = HTML::TokeParser->new(\$html)
         or die "Cannot read message text for parsing and cleaning: $!
+";
      while (my $token = $parsed_text->get_token) {
�        if ($token->[0] eq 'T') { $plain_text .= $token->[1];} # text
      }
      #Do stuff with $plain_text extracted from HTML here
      print $plain_text;

   } else {
      print "NO MATCH\n";
      foreach (keys %$part) { ${%$part}{$_} =~ s/\W*//g; } # Zap non-w
+ord
      print Data::Dumper->Dump( [%$part] );           # for test outpu
+t
   }
}

#===========================================================
C:\Perl\Test\MIME>perl -w EM.pl
NO MATCH
$VAR1 = 'body';
$VAR2 = 
'PPPBYahooGroupsLinksBBRPULLITovisityourgrouponthewebgotoBRAhrefhttpgr
+oupsyahoocomgrouphaiku
kaiIIIhttpgroupsyahoocomgrouphaikukaiIIIABRnbspLITounsubscribefromthis
+groupsendanemailtoBRAhref
<cut... a lot more lines of this stuff> mailt stg10_5FF70102DFA__prope
+rties_version100X99q___Nd_Ad0';
$VAR3 = 'head';
$VAR4 = 'HASH0x1625814';
$VAR5 = 'mycrlf';
$VAR6 = '';
$VAR7 = 'header_names';
$VAR8 = 'HASH0x1c60278';
$VAR9 = 'order';
$VAR10 = 'ARRAY0x1af5550';
$VAR11 = 'parts';
$VAR12 = 'ARRAY0x16259ac';
$VAR13 = 'ct';
$VAR14 = 'HASH0x1aa33e4';

C:\Perl\Test\MIME>
#===========================================================
[download]

My conclusion

   my $text = msg_clean($email);
[download]

MIME::Tools

[reply]
[d/l]
[select]

Re^3: Extracting TEXT from email

by ryantate (Friar) on May 02, 2005 at 18:57 UTC

It looks like you can just take the body of the email -- it may not be MIME encoded at all from the looks of things -- and run it straight through the HTML::TokeParser code I included to take out the text, without checking the type. In other words, just take the code starting at #You have an HTML part in part body through print $plain_text; and put it in your final else block where the Data Dumper code is right now.

[reply]
[d/l]
[select]