in reply to Extracting TEXT from email

but these modules only offer to extract mail header & body (including all the HTML tagging & binary mail encoding), so i have to filter out the TEXT myself
Those modules give you everything you need to get at the message. If somebody sends you html, you should first complain, and then grab something from the HTML:: namespace (like ::Stripper, ::Scrubber) so you can remove it.

update: MIME::Tools -> examples -> mimeexplode

MJD says "you can't just make shit up and expect the computer to know what you mean, retardo!"
I run a Win32 PPM repository for perl 5.6.x and 5.8.x -- I take requests (README).
** The third rule of perl club is a statement of fact: pod is sexy.

Replies are listed 'Best First'.
Re^2: Extracting TEXT from email
by ady (Deacon) on Apr 30, 2005 at 11:32 UTC
    Thanks.
    For further clarification : an example:

    use Mail::Internet; $msgfile = "Angelee.msg"; open (MSG, "$msgfile") or die "Can't open $msgfile: $!\n"; $msg = new Mail::Internet \*MSG; close (MSG); $body = $msg->body(); $msg->print_body(\*STDOUT);

    The message body as dumped to the terminal contains approx. 80% mail binary and HTML formating chars and only 20% corresponding to the transmitted TEXT payload.

    I'd like a function to strip off all that junk:
    msg_clean($body);

    Ok, so i've written some regex filtering to do it, but that's hardly as as flexible & robust as a real MIME-knowledgeable parsing could be.
    I expext there would be a msg_clean or body2text or equiv. function out there ?

    allan
      How about you go study mimeexplode?

      MJD says "you can't just make shit up and expect the computer to know what you mean, retardo!"
      I run a Win32 PPM repository for perl 5.6.x and 5.8.x -- I take requests (README).
      ** The third rule of perl club is a statement of fact: pod is sexy.

        I did try to run mimeexplode on a msg file, and it placed the body of the msg in a .txt file in a subdirectory.

        But this "exploded" file contains the same amount of binary & HTML "junk", so it does not spare me the job of post-filtering to get at the payload TEXT.
        -- allan