Beefy Boxes and Bandwidth Generously Provided by pair Networks
Keep It Simple, Stupid
 
PerlMonks  

Getting rid of HTML in POP3Client

by Massyn (Hermit)
on Dec 27, 2002 at 10:02 UTC ( [id://222509]=perlquestion: print w/replies, xml ) Need Help??

Massyn has asked for the wisdom of the Perl Monks concerning the following question:

#!/fellow/monks.pl

In the world of Outlook and Outlook Express, we are *cough* blessed with HTML text within our emails. When these emails come to my Perl POP3Client module, it downloads all the HTML nicely.. One problem though - My text parser can't read the HTML. It's all nice and well when I send the mail in clear text, but not in HTML.

Is there perhaps a module available that will take the entire body of the email as input and just give me the output of the text portion, excluding any HTML pieces?

Thanks!

#!/massyn.pl The more I learn, the more I realize I don't know. Albert Einstein 1879-1955

Replies are listed 'Best First'.
Re: Getting rid of HTML in POP3Client
by BrowserUk (Patriarch) on Dec 27, 2002 at 11:27 UTC

    The cookbook offers this:

    use HTML::TreeBuilder; use HTML::FormatText; $html = HTML::TreeBuilder->new(); $html->parse($document); $formatter = HTML::FormatText->new(leftmargin => 0, rightmargin => 50) +; $ascii = $formatter->format($html);

    It assumes you have the html to be strip in the var $document.


    Examine what is said, not who speaks.

Re: Getting rid of HTML in POP3Client
by tachyon (Chancellor) on Dec 27, 2002 at 10:32 UTC

    You can use MIME::Parser to process the body into a MIME::Entity (part of MIME::Tools - this suite has a few dependencies ie other modules you need like from memory IO::Stringy and some others.

    I promised you some code for a webmail app and will email you the base code now - I have been doing the family thing and have not packaged it up yet. It uses a custom MIME parser to process emails. The whole app is some 5000 lines so it takes a bit of getting you head around but you will find what you want in the parse_head() parse_body() and related functions.

    cheers

    tachyon

    s&&rsenoyhcatreve&&&s&n.+t&"$'$`$\"$\&"&ee&&y&srve&&d&&print

Re: Getting rid of HTML in POP3Client
by IlyaM (Parson) on Dec 27, 2002 at 10:30 UTC
Re: Getting rid of HTML in POP3Client
by davorg (Chancellor) on Dec 27, 2002 at 13:51 UTC

    Any sane email client will (or, at least, can be configured to) send a plain text version alongside the HTML version.

    If your correspondents have their email client misconfigured so it doesn't do that, then I'd be very tempted to simply return the message to them with an appropriate message ("My email client will not read HTML email as the javascript often found within such messages can be dangerous, please resend in plain text"). It would be very easy to configure something like Procmail to do that automatically.

    It's all very well applying a local patch, but I think you'll find it far more satisfying in the long term to fix the source of the problem.

    --
    <http://www.dave.org.uk>

    "The first rule of Perl club is you do not talk about Perl club."
    -- Chip Salzenberg

Re: Getting rid of HTML in POP3Client
by McD (Chaplain) on Dec 27, 2002 at 14:33 UTC
    I'm not 100% sure what your eventual goal is, so I'll speak in general terms.

    Most HTML mail is made up of one or more MIME attachments. The MIME::Tools suite that an earlier poster mentioned will let you take those apart and look at the pieces.

    MIME messages contain entities, each of which in turn can contain their own entities. A single entity is something like text, or a picture, or a binary attachment of some sort. Usually, the HTML attachment is a single entity.

    Having used MIME::Tools to break your message down into a collection of entities, you can use the HTML modules another poster mentioned to construct just the formatted text from an HTML entity.

    At that point, it gets a little tricky. What will you do with this formatted text? Replace the HTML entity with an entity of type text/plain? Turn the root entity into a multipart/alternative message and attach your new text/plain entity as an alternate? What if there already was an alternate? What if there's more than one HTML entity? What if this wasn't a MIME message, but just a plain old mail message which happened to contain HTML?

    (Or maybe you just want to do some processing on the text and leave the mail unchanged, I don't know.)

    All kinds of things are possible, but you need to ramp up a little on MIME messages (and how Outlook will process them) before you can really make an informed decision. O'Reilly's Programming Internet Email is a great resource for this.

    Peace,
    -McD
•Re: Getting rid of HTML in POP3Client
by merlyn (Sage) on Dec 27, 2002 at 16:09 UTC

Log In?
Username:
Password:

What's my password?
Create A New User
Domain Nodelet?
Node Status?
node history
Node Type: perlquestion [id://222509]
Approved by rob_au
help
Chatterbox?
and the web crawler heard nothing...

How do I use this?Last hourOther CB clients
Other Users?
Others romping around the Monastery: (7)
As of 2024-03-28 10:44 GMT
Sections?
Information?
Find Nodes?
Leftovers?
    Voting Booth?

    No recent polls found