Email parsing CPAN module?

woei has asked for the wisdom of the Perl Monks concerning the following question:

Hi fellow monks!

I'm currently working on a script that would retrieve in an email via POP3, perform some text replacements on the email body, and resend it out again.

Right now it performs the text replacements via regex on the raw message content, and it works ok for some instances where emails are encoded with the Quoted-Print, when long lines will be split across multiple lines, kind of like this:

Thank you Joanne for your informative email and for organising the =
mailing list - let's all commit to...
[download]

Since words can now be split across multiple lines, my regexes will now fail for those words that happen to fall on a line boundary.

I'm aware of the MIME::QuotedPrint module, so I tried stitching the email lines together (separated by "\n"s, and passing it through MIME::QuotedPrint::decode, but the real trouble really comes when some lines are encoded like this:

centre)<BR></FONT>&nbsp;<BR><FONT color=3D#c00000><STRONG>31/5/09=20
</STRONG></FONT><FONT color=3D#3f3f3f>Medical work (including=20
[download]

I have not had a chance to experiment with emails that contain MIME::Encoded file attachments yet, but I suspect passing them through MIME::QuotedPrint::decode will not bode very well too.

What's the best way I can go about solving this problem?

Comment on Email parsing CPAN module? Select or Download Code

Replies are listed 'Best First'.
Re: Email parsing CPAN module? by GrandFather (Saint) on May 20, 2009 at 02:11 UTC
A starting point may be MIME::Parser to pick apart the message, but it looks like you need something like HTML::Parser or HTML::TreeBuilder to pull out the text you need to process from an HTML document. True laziness is hard work	[reply]