Re: Parsing HTML

HTML is the markup language used to format the text. Taking it out, you can approximate some things (paragraph breaks, for example), but you can't have bold, italics, tables, etc. ...not in an OS independant way, at least. If formatting isn't an absolute requirement, read on...

As far as removing all HTML, I happen to like HTML::Strip. It's easy to use, and results in pretty readable output. It has a habbit of indenting a lot, but that's easy to strip out too if you want. Here's the synopsis from its POD:

use HTML::Strip;

my $hs = HTML::Strip->new();

my $clean_text = $hs->parse( $raw_html );
$hs->eof;
[download]

$clean_text now will contain the HTML-free version of $raw_html. It's as easy to use as LWP::Simple.

Dave

Comment on Re: Parsing HTML Download Code