Re: Extracting TEXT from email
by PodMaster (Abbot) on Apr 30, 2005 at 11:03 UTC
|
| [reply] |
|
|
Thanks.
For further clarification : an example:
use Mail::Internet;
$msgfile = "Angelee.msg";
open (MSG, "$msgfile") or die "Can't open $msgfile: $!\n";
$msg = new Mail::Internet \*MSG;
close (MSG);
$body = $msg->body();
$msg->print_body(\*STDOUT);
The message body as dumped to the terminal contains approx. 80% mail binary and HTML formating chars and only 20% corresponding to the transmitted TEXT payload.
I'd like a function to strip off all that junk:
msg_clean($body);
Ok, so i've written some regex filtering to do it, but that's hardly as as flexible & robust as a real MIME-knowledgeable parsing could be. I expext there would be a msg_clean or body2text or equiv. function out there ?
allan
| [reply] [d/l] [select] |
|
|
How about you go study mimeexplode?
| MJD says "you can't just make shit up and expect the computer to know what you mean, retardo!" | | I run a Win32 PPM repository for perl 5.6.x and 5.8.x -- I take requests (README). | | ** The third rule of perl club is a statement of fact: pod is sexy. |
| [reply] |
|
|
|
|
Re: Extracting TEXT from email
by jhourcle (Prior) on Apr 30, 2005 at 13:48 UTC
|
It would help if you explained what the input was. How is the text that you are trying to get encoded into the email? What are you qualifying as text? (ie, based on the input, what are you trying to get as the output?)
There are many, many ways to encode text into an e-mail (MIME, PGP, PGP+MIME, UUEncode, BinHex, BinHex+MIME, Quoted Printable, etc.) Without knowing what you're dealing with, we can only guess at what it is that you're asking for.
| [reply] |
|
|
I'm at the "receiving end" of the mail wire: i receice mails in my (MS Windows Exchange) inbox encoded in standard mail/MIME format.
I'm interested in the text part of the body of these mails, that is: "what follows the mail header" (ie. the From:, Sent: To: Issue: stuff). The body contains haiku entries, that i parse and reshuffle into a voting list, and subsequently rank according to received votes, -- but the app as such is not that interesting in this context.
An example of the text part of a mail message is:
[author] xxx yyy
[1]
clouds . . .
the distance blossoming
between two crows
[2]
a morning
without incident
dead fly
[3]
sunrise ceremony
the holy man's third eye
bloodshot
[4]
morning dew
bell bottoms darkened
by mayflies
This is what i'm interested in parsing out, and this is the text part of the message, that is displayed in the mail client (in casu: MS Outlook).
The problem is, that the above text is not what i get from the mail body handed over by the the mentioned MIME modules. Instead i get the full mail body segment, including binary MIME encodings and HTML tagging.
So i have to do some filtering to get at the text "payload", that i need for the app. Now i was wondering, if anybody had already wrapped this functionality into a function, possibly in a MIME module. That was my question
I haven't worked with email before, so maybe i'm simply overlooking som basic assumptions about the MIME format & parsing...
-- allan
| [reply] [d/l] |
|
|
I find it's good to understand what you're working with. (this being said as I deal with data at work that I have absolutely no idea what it actually means)
Basically, email is sent using SMTP as what it calls a mail object, which is composed of headers, an empty line, and a message body.
Bodys are required to be ASCII, which limits you to 7bits, but someone thought it would be a good idea to send non-text files, so came up with MIME. Using MIME, the body may be identified as being one or more encapsulted objects. To mark the body as being MIME encoded, there are additional headers inserted into the heading of the email message.
There's a fair bit of background information in the MIME::Tools documentation.
| [reply] |
|
|
| [reply] |
Re: Extracting TEXT from email
by Anonymous Monk on Apr 30, 2005 at 19:10 UTC
|
Interesting. Many if not most HTML messages are sent multipart--several MIME types in one message, including at minimum an HTML "part" for HTMLized email clients and a plain text part for older email clients or people who prefer to read their messages in that format. Any attachments will also have their own MIME part of the appropriate type.
I used Email::MIME, which no one has yet mentioned, to take emails I send from my mobile phone and turn them into Web posts, with plain text and/or JPEG photo. Here is a modified untested version of that code which may suit you purposes:
my $parsed = Email::MIME->new($message) or die "Could not parse email
+message: $!"; #$message is full text of entire email m\
essage
foreach my $part ($parsed->parts) {
if ($part->content_type =~ /text\/plain/i) {
#You have a plain text part
#Do stuff here with $part->body
} elsif ($part->content_type =~ /image\/jpeg/i) {
#You have a JPEG part
#in $part->body
} elsif ($part->content_type =~ /text\/html/i) {
#You have an HTML part
#in part body
my $html = $part->body;
my $plain_text;
my $parsed_text = HTML::TokeParser->new(\$html) or die "Cannot rea
+d
message text for parsing and cleaning: $!";
while (my $token = $parsed_text->get_token) {
if ($token->[0] eq 'T') {
# text
$plain_text .= $token->[1];
}
}
#Do stuff with $plain_text extracted from HTML here
}
}
Notice the HTML::TokeParser part inside the HTML section. You'll only want to use that if the plain text part is unavailable to you. HTH. | [reply] [d/l] |
|
|
I like this approach!, but alas it didn't recognize any text or html parts in my message.
#===========================================================
# Program EM.pl
#!/usr/bin/perl -w
#use strict;
use Email::MIME;
use HTML::TokeParser;
use Data::Dumper;
my $msgfile = "Andrew.msg"; # A test message file from MS Outlook
open (MSG, "$msgfile") or die "Can't open $msgfile: $!\n";
my $message = do { local $/; <MSG> }; # $/=undef; my $e=<FH>;
close(MSG);
my $parsed = Email::MIME->new($message)
or die "Could not parse email message: $!";
#$message is full text of entire email message
foreach my $part ($parsed->parts) {
if ($part->content_type =~ /text\/plain/i) {
#You have a plain text part: do stuff here with $part->body
print $part->body;
}
elsif ($part->content_type =~ /image\/jpeg/i) {
#You have a JPEG part in $part->body
} elsif ($part->content_type =~ /text\/html/i) {
#You have an HTML part in part body
my $html = $part->body;
my $plain_text;
my $parsed_text = HTML::TokeParser->new(\$html)
or die "Cannot read message text for parsing and cleaning: $!
+";
while (my $token = $parsed_text->get_token) {
´ if ($token->[0] eq 'T') { $plain_text .= $token->[1];} # text
}
#Do stuff with $plain_text extracted from HTML here
print $plain_text;
} else {
print "NO MATCH\n";
foreach (keys %$part) { ${%$part}{$_} =~ s/\W*//g; } # Zap non-w
+ord
print Data::Dumper->Dump( [%$part] ); # for test outpu
+t
}
}
#===========================================================
C:\Perl\Test\MIME>perl -w EM.pl
NO MATCH
$VAR1 = 'body';
$VAR2 =
'PPPBYahooGroupsLinksBBRPULLITovisityourgrouponthewebgotoBRAhrefhttpgr
+oupsyahoocomgrouphaiku
kaiIIIhttpgroupsyahoocomgrouphaikukaiIIIABRnbspLITounsubscribefromthis
+groupsendanemailtoBRAhref
<cut... a lot more lines of this stuff> mailt stg10_5FF70102DFA__prope
+rties_version100X99q___Nd_Ad0';
$VAR3 = 'head';
$VAR4 = 'HASH0x1625814';
$VAR5 = 'mycrlf';
$VAR6 = '';
$VAR7 = 'header_names';
$VAR8 = 'HASH0x1c60278';
$VAR9 = 'order';
$VAR10 = 'ARRAY0x1af5550';
$VAR11 = 'parts';
$VAR12 = 'ARRAY0x16259ac';
$VAR13 = 'ct';
$VAR14 = 'HASH0x1aa33e4';
C:\Perl\Test\MIME>
#===========================================================
My conclusion is, that there's probably no simple :
my $text = msg_clean($email);
function out there, and i'll have to do a top-down parsing of the MIME object to get at the part of the email, that interests me (As an alternative to the simple brute force regex filtering, that i'm using right now. The latter approach works ok as long as the text is enclosed in proper tags, but it easily breaks, if it isn't)
This is basically what several of you (actually all of you) have tried to tell me, but i didn't quite want to give up on my laziness up front... A full parsing of the email is more work, but also more robust and surely in the long run will allow me to be lazy at at higher level...
So i think i'll start digging into the MIME::Tools
thanks for your patience!
Best regards -- Allan | [reply] [d/l] [select] |
|
|
Hi, I wrote the node you are replying to, thought I was logged in but wasn't.
It looks like you can just take the body of the email -- it may not be MIME encoded at all from the looks of things -- and run it straight through the HTML::TokeParser code I included to take out the text, without checking the type. In other words, just take the code starting at #You have an HTML part in part body through print $plain_text; and put it in your final else block where the Data Dumper code is right now.
| [reply] [d/l] [select] |